timothydijamco commented on issue #45287: URL: https://github.com/apache/arrow/issues/45287#issuecomment-2622689751
> `ReleaseUnused` is best effort, so you can't really deduce this unfortunately. The new https://github.com/apache/arrow/pull/45359 might allow you to get a better idea, though the allocator stats are not always easy to understand. I see, that's fair. --- > Adding `physical_schema_.reset()` to the `ClearCachedMetadata()` method (from https://github.com/apache/arrow/pull/45330) seems to reduce memory usage a bit further I did some memory profiling on a version of Arrow with `physical_schema_.reset()` and I notice that memory usage actually looks bounded now. Here's the memory usage graph of a C++ program that scans that "250 files, 10k columns, 200-character-long column names" dataset twice: | Clearing `metadata_`, `manifest_`, `original_metadata_` | Clearing `metadata_`, `manifest_`, `original_metadata_`, **`physical_schema_`** | |------|---------| | <img width="1002" alt="Image" src="https://github.com/user-attachments/assets/ec750ab1-fab1-4712-97f3-13cabb0d06f5" /> | <img width="1000" alt="Image" src="https://github.com/user-attachments/assets/48280ac2-7dcf-49a1-898c-c7712d7378b6" /> | And for good measure, here's the same thing but on a dataset with twice as many files (from 250 files -> 500 files) to show memory accumulation better: | Clearing `metadata_`, `manifest_`, `original_metadata_` | Clearing `metadata_`, `manifest_`, `original_metadata_`, **`physical_schema_`** | |------|---------| | <img width="1002" alt="Image" src="https://github.com/user-attachments/assets/64cf67a6-3a32-4d48-a321-104ac44017e8" /> | <img width="1002" alt="Image" src="https://github.com/user-attachments/assets/43484c53-57e8-4816-90ec-8a3d93005dc4" /> | Overall, clearing `metadata_`, `manifest_`, `original_metadata_`, and `physical_schema_` all together seems to do the trick of preventing metadata-related objects from accumulating over a scan. Going to test on some real datasets as well and see how they are affected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
