Re: [I] Real Time Analytics: concat_tables memory and performance degradation [arrow]

via GitHub Tue, 03 Oct 2023 00:52:55 -0700


pablodcar commented on issue #37801:
URL: https://github.com/apache/arrow/issues/37801#issuecomment-1744394914


   Thanks for the response.
   
   > In a process where you are continuously appending data, the better 
approach would be a two-step logic: first gather the batches of data in a 
separate Table, and only when this reaches a certain size threshold (eg 65k), 
actually combine the chunks of this table (involving a copy) into a single 
chunk, and then append this combined chunk to the overall `rates_table`.
   > 
   > It would be interesting to see when using such an approach if you still 
notice memory issues.
   > 
   > > and if I use `combine_chunks` from time to time, things get worse.
   > 
   > Given my explanation above, I have to admit that this is a bit strange ..
   
   Calling `combine_chunks` every 65k items - instead of a single call at the 
end - helps in terms of CPU and memory. In fact, the memory remains almost 
constant, i.e. if I run the cycle twice, the memory is not doubled, only it's 
increased a little.
   
   Will test with a long-running process, thanks.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Real Time Analytics: concat_tables memory and performance degradation [arrow]

Reply via email to