JakeDern commented on PR #8001:
URL: https://github.com/apache/arrow-rs/pull/8001#issuecomment-3184791114

   @albertlockett  Thanks for the review, I'll try to go through it this 
afternoon!
   
   >One general thought I had is: I wonder if eventually we'd want to produce 
deltas for record batches where the dictionary of the subsequent batches fed 
into the writer don't necessarily need to be extensions of the previous dict. 
(I think this has maybe been called out elsewhere, but I'll call it out here 
for posterity).
   
   >That would allow us to produce IPC dict deltas without having to copy the 
values the each time we call .finish_preserve_values(). Of course, it makes the 
logic in writer::compare_dictionaries more complicated (and maybe as we change 
the implementation, we could also maybe fix the performance issue identified 
https://github.com/apache/arrow-rs/pull/8001#discussion_r2270751815), and it 
also means that we need to adjust the dictionary keys when writing the 
subsequent batches.
   
   >Obviously this is a larger piece of work than what this change is trying to 
achieve, but maybe we could create a follow up issue to open the discussion?
   
   Yeah I 100% agree - The implementation that I have is very similar to the go 
implementation, but I think there's opportunity to be more efficient. The naive 
way I think would be to have the IPC writer track all the values and re-intern 
them, but while that would get us nice deltas it would be super expensive.
   
   An alternative I was thinking about is we could:
   
   1. Have an IPC stream writer which owns the record batch building rather 
than just having finished record batches pushed into it. 
   2. Add APIs to the dictionary builders to track and emit the incremental 
deltas for whatever they're building intsead of the full set.
   
   That way the IPC writer will know for sure that every time it seals a batch 
that it has the exact delta for every dictionary since it owns the record batch 
production and is controlling what the dictionary builders do.
   
   But like you say it's a bigger change and better suited to a follow-up IMO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to