timothydijamco commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2616422607
Awesome, thanks. Can you point me to how you were able to tell 250MB was spent on Column Chunk metadata using the memory pool statistics debugging? I think I was getting only high-level summary statistics with `PrintStats()`. I think I may be seeing what you're saying about column chunk metadata in the output that I see when running `valgrind --tool=massif` on the C++ repro I posted above and visualizing using `massif-visualizer`. Here's what the memory usage graph looks like when running the two scans in one script: <img width="1314" alt="Image" src="https://github.com/user-attachments/assets/5f3e600d-dce6-4524-baad-568c3201300d" /> At the peak (middle of the graph), the top three things using memory seem to be: * 342.9MiB: Some "`parquet::schema::node` to 'schema field'" map * <img width="1148" alt="Image" src="https://github.com/user-attachments/assets/c95b98d6-9632-4b27-a3d5-3fdf4034dda6" /> * 155.4 MiB: Some "name to index" map * <img width="1143" alt="Image" src="https://github.com/user-attachments/assets/6fd5c862-fa23-4d9b-9367-91b5f7d1a7c9" /> * 109.9 MiB: Vector of `parquet::format::ColumnChunk`s * <img width="1139" alt="Image" src="https://github.com/user-attachments/assets/9980b3ab-fd91-4274-b4b3-b0ffb437fb00" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
