alamb commented on PR #8160: URL: https://github.com/apache/arrow-rs/pull/8160#issuecomment-3197249445
> I will say that the page indexes are pretty darn expensive to parse, and the file used for the benchmark (`parquet-testing/data/all_types_tiny_pages.parquet`) is pretty pathological. Looking into where the time goes, the offset index is hobbled by the fact that it's defined as an array of structs, which adds considerable overhead to the parsing. The column index is a struct of arrays that parses very quickly, but then must be transformed into an array of structs after decoding, so that takes a good bit of time. What drives the need to convert to array of structs? Is that the representation of the ColumnIndex in Rust or is it something about how the thrift is encoded? Copying of the min/max statistics for byte arrays takes the majority of that time (note that the test file does not contain the level histograms...those would be very costly as well if present). We could look into rethinking how we represent the column index. Perhaps saving the bytes read and presenting slices rather than copies will work (at least as far as the histograms in the column index...we may be stuck with min/max value copying). As you say, perhaps we could keep around a `Bytes` with the byte statistics in it, and store an offset there (rather than copying into their own structure). Maybe we could also contemplate some way to defer decoding/copying the structures out until they were requested > @alamb, not sure how radical you want to go here 😅 I have no pre-concieved ideas. I have personally always found the ColumnIndex representation in Rust (`Vec<Vec<Index>>` as I recall) quite complicated to work with, so if we have to change that to improve the performance I would be fully in support of it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org