alamb commented on PR #8160:
URL: https://github.com/apache/arrow-rs/pull/8160#issuecomment-3197249445

   > I will say that the page indexes are pretty darn expensive to parse, and 
the file used for the benchmark 
(`parquet-testing/data/all_types_tiny_pages.parquet`) is pretty pathological. 
Looking into where the time goes, the offset index is hobbled by the fact that 
it's defined as an array of structs, which adds considerable overhead to the 
parsing. The column index is a struct of arrays that parses very quickly, but 
then must be transformed into an array of structs after decoding, so that takes 
a good bit of time. 
   
   What drives the need to convert to array of structs? Is that the 
representation of the ColumnIndex in Rust or is it something about how the 
thrift is encoded?
   
   Copying of the min/max statistics for byte arrays takes the majority of that 
time (note that the test file does not contain the level histograms...those 
would be very costly as well if present). We could look into rethinking how we 
represent the column index. Perhaps saving the bytes read and presenting slices 
rather than copies will work (at least as far as the histograms in the column 
index...we may be stuck with min/max value copying).
   
   As you say, perhaps we could keep around a `Bytes` with the byte statistics 
in it, and store an offset there (rather than copying into their own structure).
   
   Maybe we could also contemplate some way to defer decoding/copying the 
structures out until they were requested
   
   > @alamb, not sure how radical you want to go here 😅
   
   I have no pre-concieved ideas. I have personally always found the 
ColumnIndex representation in Rust (`Vec<Vec<Index>>` as I recall) quite 
complicated to work with, so if we have to change that to improve the 
performance I would be fully in support of it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to