JakeDern commented on PR #8001: URL: https://github.com/apache/arrow-rs/pull/8001#issuecomment-3184791114
@albertlockett Thanks for the review, I'll try to go through it this afternoon! >One general thought I had is: I wonder if eventually we'd want to produce deltas for record batches where the dictionary of the subsequent batches fed into the writer don't necessarily need to be extensions of the previous dict. (I think this has maybe been called out elsewhere, but I'll call it out here for posterity). >That would allow us to produce IPC dict deltas without having to copy the values the each time we call .finish_preserve_values(). Of course, it makes the logic in writer::compare_dictionaries more complicated (and maybe as we change the implementation, we could also maybe fix the performance issue identified https://github.com/apache/arrow-rs/pull/8001#discussion_r2270751815), and it also means that we need to adjust the dictionary keys when writing the subsequent batches. >Obviously this is a larger piece of work than what this change is trying to achieve, but maybe we could create a follow up issue to open the discussion? Yeah I 100% agree - The implementation that I have is very similar to the go implementation, but I think there's opportunity to be more efficient. The naive way I think would be to have the IPC writer track all the values and re-intern them, but while that would get us nice deltas it would be super expensive. An alternative I was thinking about is we could: 1. Have an IPC stream writer which owns the record batch building rather than just having finished record batches pushed into it. 2. Add APIs to the dictionary builders to track and emit the incremental deltas for whatever they're building intsead of the full set. That way the IPC writer will know for sure that every time it seals a batch that it has the exact delta for every dictionary since it owns the record batch production and is controlling what the dictionary builders do. But like you say it's a bigger change and better suited to a follow-up IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org