Re: [I] A 175M DataFrame saved to parquet requires 1G of memory to be read [arrow]

via GitHub Tue, 17 Oct 2023 11:23:00 -0700


mapleFU commented on issue #38245:
URL: https://github.com/apache/arrow/issues/38245#issuecomment-1766937874


   Oops I find a good way to debug, let me use `LoggingMemoryPool`.
   
   1. RecordReader `65536` bytes per column. 5000 columns would use 312MiB
   2. File is fetched column-chunk, which takes the file-size ( 120MiB ), io 
might take more
   3. Each column allocate bytes for def levels, each level uses 16384 bytes 
per column, sum up as 78MiB
   4. Output validity buffer allocates 1024 bytes per column.
   5. `dictionary` needs a huge buffer, which is 35072 bytes per dict, sum uses 
167MiB
   6. Decompress needs 35072 bytes per column, sum uses 167MiB
   
   This is sum up to 712MiB. @jorisvandenbossche 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] A 175M DataFrame saved to parquet requires 1G of memory to be read [arrow]

Reply via email to