mapleFU commented on issue #38245: URL: https://github.com/apache/arrow/issues/38245#issuecomment-1766937874
Oops I find a good way to debug, let me use `LoggingMemoryPool`. 1. RecordReader `65536` bytes per column. 5000 columns would use 312MiB 2. File is fetched column-chunk, which takes the file-size ( 120MiB ), io might take more 3. Each column allocate bytes for def levels, each level uses 16384 bytes per column, sum up as 78MiB 4. Output validity buffer allocates 1024 bytes per column. 5. `dictionary` needs a huge buffer, which is 35072 bytes per dict, sum uses 167MiB 6. Decompress needs 35072 bytes per column, sum uses 167MiB This is sum up to 712MiB. @jorisvandenbossche -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
