XiangpengHao commented on issue #5854: URL: https://github.com/apache/arrow-rs/issues/5854#issuecomment-2154921008
FWIW, by simply moving this field to heap (i.e., `Option<Statistics>` -> `Option<Box<Statistics>>`), we can get 30% performance improvement (as will show in blog #5770). https://github.com/apache/arrow-rs/blob/087f34b70e97ee85e1a54b3c45c5ed814f500b0a/parquet/src/format.rs#L3407 The `Option<Statistics>` occupies 136 bytes even if the file does not have stats at all (i.e., the field is `None`); this not only slows down decoding (due to poor memory locality) but also causes high memory consumption when decoding metadata (parquet-rs consumes 10MB memory per MB of metadata). I think this example motivates custom parquet type definitions and, thus, custom thrift decoder. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
