jorisvandenbossche commented on issue #37801: URL: https://github.com/apache/arrow/issues/37801#issuecomment-1740891131
> Could you advise if this is a limitation of the tool (i.e. not being designed to continuously append data) or if I should use another approach? This is not an answer to whether there might be an actual memory issue or leak in the implementation, but the way you are constructing the Table is definitely sub-optimal. What you are doing is creating tables of 5 rows at a time, and then appending this to the existing table (`rates_table = pa.concat_tables((rates_table, chunk))`, where chunk has 5 rows). The `concat_tables` actually doesn't make any copy, but will preserve the original chunks but put them together in a single Table object. So you end up with a table of > 10000 chunks of each 5 rows, which is very inefficient (each chunk has _some_ overhead). In a process where you are continuously appending data, the better approach would be a two-step logic: first gather the batches of data in a separate Table, and only when this reaches a certain size threshold (eg 65k), actually combine the chunks of this table (involving a copy) into a single chunk, and then append this combined chunk to the overall `rates_table`. It would be interesting to see when using such an approach if you still notice memory issues. > and if I use `combine_chunks` from time to time, things get worse. Given my explanation above, I have to admit that this is a bit strange .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
