[GitHub] [arrow] jorisvandenbossche commented on issue #37801: Real Time Analytics: concat_tables memory and performance degradation

via GitHub Fri, 29 Sep 2023 06:24:12 -0700


jorisvandenbossche commented on issue #37801:
URL: https://github.com/apache/arrow/issues/37801#issuecomment-1740891131


   > Could you advise if this is a limitation of the tool (i.e. not being 
designed to continuously append data) or if I should use another approach?
   
   This is not an answer to whether there might be an actual memory issue or 
leak in the implementation, but the way you are constructing the Table is 
definitely sub-optimal. 
   What you are doing is creating tables of 5 rows at a time, and then 
appending this to the existing table (`rates_table = 
pa.concat_tables((rates_table, chunk))`, where chunk has 5 rows). The 
`concat_tables` actually doesn't make any copy, but will preserve the original 
chunks but put them together in a single Table object. So you end up with a 
table of > 10000 chunks of each 5 rows, which is very inefficient (each chunk 
has _some_ overhead).
   
   In a process where you are continuously appending data, the better approach 
would be a two-step logic: first gather the batches of data in a separate 
Table, and only when this reaches a certain size threshold (eg 65k), actually 
combine the chunks of this table (involving a copy) into a single chunk, and 
then append this combined chunk to the overall `rates_table`.
   
   It would be interesting to see when using such an approach if you still 
notice memory issues.
   
   > and if I use `combine_chunks` from time to time, things get worse.
   
   Given my explanation above, I have to admit that this is a bit strange ..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #37801: Real Time Analytics: concat_tables memory and performance degradation

Reply via email to