Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

via GitHub Fri, 23 May 2025 15:40:10 -0700


zeroshade commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905956585


   > For implementation and the buffer handling stack is inspiring, I think for 
!sorted, this is great example. I may need to think about using a tmp buffer 
when build sub-object, it could change memmove to memcpy. But when object/array 
stack size grows, the required buffer also grows, which makes it even worse
   
   Notice that for building the sub-objects I'm simply growing the single 
buffer and then shifting the built variant data down to make room for the 
header. This works just fine for nested objects as long as the user calls 
`FinishObject` in the correct order (the ordering would still need to be 
careful with sub-objects). What does the temp buffer get you instead? 
Eventually you'll need to grow the top buffer to accomodate it regardless when 
you copy it back to the parent builder, so you end up with more allocations 
happening by using the sub-objects. 
   
   > For sorted, maybe it's easy to do it in one-layer object, for multi layer 
object, a sorted & unique Builder might need maintain a "object builder tree" 
and need to finish it in build order 😅
   
   Well, my current implementation does guarantee *uniqueness* for the keys, 
but sorting is definitely much more complex. There's only two ways I can think 
about efficiently getting the final metadata sorted:
   
   The first case: If the user adds all the keys they are going to use 
up-front, the builder could provide a function to sort the keys and give them 
new ids based on the sorted positions. If you require this before creating 
*any* objects, everything else becomes simple. But requiring ALL of the keys 
up-front is not really a reasonable thing IMO
   
   The more complex case: You have to maintain your object builder tree with 
nested builders and children and so on, but NONE of them can construct their 
object header until the entire builder is being finished. At which point you do 
a depth-first iteration of the tree recursively constructing the object headers 
and copying the buffer data into the parents as you go. 
   
   The reason for this is because the header for the objects contain the IDs 
for each key in the object, and the ID is based on the order it is written to 
the metadata. Which means every time a new key is added to the dictionary, you 
would potentially have to go through all of the objects and update ALL of the 
key ids (in the worst case scenario). The only way to do this efficiently is to 
not write any object headers until you *know* you have ALL of the keys, and 
thus the IDs won't change. For a simple value, this probably isn't difficult. 
But for a complex, multi-level nested object, this could get pretty inefficient 
real quick.
   
   In the case of the rust code you linked, they are basically keeping multiple 
buffers, one for each value/object/element and then if the keys are sorted it 
has to go rewrite the header of every child object, every time it finishes one 
of the parent objects (to update the IDs).
   
   In the Go code, I just didn't bother even attempting to sort the keys. I 
just check on the final result call whether or not the keys *happen* to be 
sorted, if they are then I set the sorted bit in the metadata lol


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

Reply via email to