etseidl commented on PR #8072:
URL: https://github.com/apache/arrow-rs/pull/8072#issuecomment-3166365107

   FWIW, here are some benchmark numbers. In a separate branch I've completed 
reading (mostly) directly to rust structures. This is from a modified metadata 
bench.
   "open(default)" is opening a `SerializedFileReader` to get the 
`ParquetMetaData`, "parquet metadata" is just using 
`ParquetMetaDataReader::decode_metadata` to parse the footer (returns a 
`ParquetMetaData` as well), "decode file metadata" uses the thrift code to a 
raw decode to `format::FileMetaData`, "decode new" uses the new parser to go 
straight to `ParquetMetaData`. The "(wide)" variants are for a synthetic 1000 
column schema.
   ```
   open(default)           time:   [37.663 µs 37.840 µs 38.011 µs]
   parquet metadata        time:   [36.357 µs 36.453 µs 36.564 µs]
   decode file metadata    time:   [22.177 µs 22.224 µs 22.279 µs]
   decode new              time:   [20.758 µs 20.797 µs 20.836 µs]
   parquet metadata (wide) time:   [219.16 ms 219.86 ms 220.56 ms]
   decode file metadata (wide)
                           time:   [110.76 ms 111.26 ms 111.82 ms]
   decode new (wide)       time:   [78.140 ms 78.468 ms 78.802 ms]
   ```
   It's encouraging to see that the time to fully decode to `ParquetMetaData` 
with the new code is faster than the current decoder going to `format` objects.
   
   I'm finding some unfortunate dependencies on knowledge of the schema that 
lead to still having to parse to intermediate structures before creating the 
final results. It's true that the schema is usually the first thing encoded in 
the metadata, but there is no guarantee that this will be so. Thrift structures 
could conceivably be written out of order, even though this is pretty unlikely. 
I'll probably add a fast path that assumes the schema will be available when 
decoding the row group metadata...that will save some more unnecessary 
duplication of effort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to