pitrou commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2616455259
> Awesome, thanks. Can you point me to how you were able to tell 250MB was spent on Column Chunk metadata using the memory pool statistics debugging? I think I was getting only high-level summary statistics with `PrintStats()`. What I did in a Python prompt: 1. `tab = pd.dataset(...).to_table(memory_pool=pa.system_memory_pool())` 2. Look up "in use bytes" in `pa.system_memory_pool().print_stats()`: I get around 932 MB 3. `tab = tab.combine_chunks(memory_pool=pa.system_memory_pool())` 4. Look up "in use bytes" in `pa.system_memory_pool().print_stats()` again: I get around 655 MB The diff between 4 and 1 is the space saved when combining the table chunks. I admit I'm not entirely sure this would be Arrow column chunk metadata, as perhaps some buffers were overallocated when reading the Parquet file. I would have to check that explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org