mapleFU commented on issue #38245:
URL: https://github.com/apache/arrow/issues/38245#issuecomment-1766937874

   Oops I find a good way to debug, let me use `LoggingMemoryPool`.
   
   1. RecordReader `65536` bytes per column. 5000 columns would use 312MiB
   2. File is fetched column-chunk, which takes the file-size ( 120MiB ), io 
might take more
   3. Each column allocate bytes for def levels, each level uses 16384 bytes 
per column, sum up as 78MiB
   4. Output validity buffer allocates 1024 bytes per column.
   5. `dictionary` needs a huge buffer, which is 35072 bytes per dict, sum uses 
167MiB
   6. Decompress needs 35072 bytes per column, sum uses 167MiB
   
   This is sum up to 712MiB. @jorisvandenbossche 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to