Re: [I] Further refine the Top K sort operator [arrow-datafusion]

via GitHub Sat, 02 Mar 2024 16:08:17 -0800


yjshen commented on issue #9417:
URL: 
https://github.com/apache/arrow-datafusion/issues/9417#issuecomment-1974948936


   I see the main reason causing resource exhaustion is the incorrect memory 
accounting for record batches stored in TopK's RecordBatchStore (as noted in 
the issue description, it's ~220MB per batch). By printing the mem size 
calculation a little bit, I saw:
   
   ```
   Getting mem size of batch in topk::insert with batch size: 8192
   Column 0 mem: 37561184
   Column 1 mem: 37561184
   Column 2 mem: 78416312
   Column 3 mem: 72507488
   Inserting batch with mem size: 226046168
   ```
   
    If we correct the calculation, spill to disk for TopK would be less of a 
concern.
    
    And for option3, there is `maybe_compact` in TopK serving a similar 
purpose, but still keeps relevant records in record batch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Further refine the Top K sort operator [arrow-datafusion]

Reply via email to