dbr commented on issue #623:
URL: https://github.com/apache/arrow-rs/issues/623#issuecomment-888788592


   I ran the example code with the 
[`memory-profiler`](https://github.com/koute/memory-profiler) tool, with huge 
batch size:
   ![graphs of memory 
allocation/fragmentation](https://user-images.githubusercontent.com/509/127428396-d6848bd4-92f4-47d4-ad74-d158e2bb9a17.png)
   
   If I understand right, this explains the remaining mystery (why giant batch 
size causes process to use lots of memory). In my "ELI5 level" knowledge of 
memory allocators:
   
   - With a huge chunk size size, the readers allocate large continuous chunks 
of memory as part of the parsing
   - The RecordBatch(es) are allocated while the parsing is happening, so they 
have to "fit around" these allocations
   - When the buffers for the reader are deallocated, they leave big "holes" in 
the allocated memory
   - Since Rust wont magically rearrange items in memory, the process ends up 
using all that memory, holes and all
   - The example script is especially bad since there's no subsequent 
allocations which might use up some of those holes
   - With a smaller buffer size, the "holes" created by the parser are much 
smaller, thus the overhead is insignificant.
   
   I might try and make a PR to add some basic docs to the `with_batch_size` 
methods when I have time, to incorporate some of the advice above - but 
otherwise I think this issue can be closed, as it seems to be working "as 
intended".
   
   Thanks @Dandandan !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to