zeroshade commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905956585
> For implementation and the buffer handling stack is inspiring, I think for !sorted, this is great example. I may need to think about using a tmp buffer when build sub-object, it could change memmove to memcpy. But when object/array stack size grows, the required buffer also grows, which makes it even worse Notice that for building the sub-objects I'm simply growing the single buffer and then shifting the built variant data down to make room for the header. This works just fine for nested objects as long as the user calls `FinishObject` in the correct order (the ordering would still need to be careful with sub-objects). What does the temp buffer get you instead? Eventually you'll need to grow the top buffer to accomodate it regardless when you copy it back to the parent builder, so you end up with more allocations happening by using the sub-objects. > For sorted, maybe it's easy to do it in one-layer object, for multi layer object, a sorted & unique Builder might need maintain a "object builder tree" and need to finish it in build order 😅 Well, my current implementation does guarantee *uniqueness* for the keys, but sorting is definitely much more complex. There's only two ways I can think about efficiently getting the final metadata sorted: The first case: If the user adds all the keys they are going to use up-front, the builder could provide a function to sort the keys and give them new ids based on the sorted positions. If you require this before creating *any* objects, everything else becomes simple. But requiring ALL of the keys up-front is not really a reasonable thing IMO The more complex case: You have to maintain your object builder tree with nested builders and children and so on, but NONE of them can construct their object header until the entire builder is being finished. At which point you do a depth-first iteration of the tree recursively constructing the object headers and copying the buffer data into the parents as you go. The reason for this is because the header for the objects contain the IDs for each key in the object, and the ID is based on the order it is written to the metadata. Which means every time a new key is added to the dictionary, you would potentially have to go through all of the objects and update ALL of the key ids (in the worst case scenario). The only way to do this efficiently is to not write any object headers until you *know* you have ALL of the keys, and thus the IDs won't change. For a simple value, this probably isn't difficult. But for a complex, multi-level nested object, this could get pretty inefficient real quick. In the case of the rust code you linked, they are basically keeping multiple buffers, one for each value/object/element and then if the keys are sorted it has to go rewrite the header of every child object, every time it finishes one of the parent objects (to update the IDs). In the Go code, I just didn't bother even attempting to sort the keys. I just check on the final result call whether or not the keys *happen* to be sorted, if they are then I set the sorted bit in the metadata lol -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org