pitrou commented on issue #45287:
URL: https://github.com/apache/arrow/issues/45287#issuecomment-2616455259

   > Awesome, thanks. Can you point me to how you were able to tell 250MB was 
spent on Column Chunk metadata using the memory pool statistics debugging? I 
think I was getting only high-level summary statistics with `PrintStats()`.
   
   What I did in a Python prompt:
   1. `tab = pd.dataset(...).to_table(memory_pool=pa.system_memory_pool())`
   2. Look up "in use bytes" in `pa.system_memory_pool().print_stats()`: I get 
around 932 MB
   3. `tab = tab.combine_chunks(memory_pool=pa.system_memory_pool())`
   4. Look up "in use bytes" in `pa.system_memory_pool().print_stats()` again: 
I get around 655 MB
   
   The diff between 4 and 1 is the space saved when combining the table chunks.
   I admit I'm not entirely sure this would be Arrow column chunk metadata, as 
perhaps some buffers were overallocated when reading the Parquet file. I would 
have to check that explicitly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to