[GitHub] [arrow-rs] Dandandan commented on issue #623: Confusing memory usage with CSV reader

GitBox Tue, 27 Jul 2021 08:04:26 -0700


Dandandan commented on issue #623:
URL: https://github.com/apache/arrow-rs/issues/623#issuecomment-887589188



   What the CSV reader does in the CSV parser is reusing some allocations over 
time in a batch to reduce allocations / time.
   So generally, this might increase the memory usage a bit as more allocations 
are kept around from previous batches.
   
   However, with a very small batch size of 10, this won't cause the high 
memory usage, but the data and metadata around a single `RecordBatch` does: 
each batch has a schema with field names, some different pointers to the data 
etc. which will make up the most of the data when choosing a low size. When you 
store them in a `Vec` instead of iterating over them (where they will be 
dropped) you'll keep them in memory, which I expect will consume the most 
memory.
   
   So generally
   
   * Use a batch size of some 1000s (so you have less overhead of metadata and 
makes use the columnar Arrow format)
   * If you don't have to store them in a `Vec` - don't keep them in a Vec but 
iterate over them like in your first example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] Dandandan commented on issue #623: Confusing memory usage with CSV reader

Reply via email to