etseidl commented on PR #8072:
URL: https://github.com/apache/arrow-rs/pull/8072#issuecomment-3166365107
FWIW, here are some benchmark numbers. In a separate branch I've completed
reading (mostly) directly to rust structures. This is from a modified metadata
bench.
"open(default)" is opening a `SerializedFileReader` to get the
`ParquetMetaData`, "parquet metadata" is just using
`ParquetMetaDataReader::decode_metadata` to parse the footer (returns a
`ParquetMetaData` as well), "decode file metadata" uses the thrift code to a
raw decode to `format::FileMetaData`, "decode new" uses the new parser to go
straight to `ParquetMetaData`. The "(wide)" variants are for a synthetic 1000
column schema.
```
open(default) time: [37.663 µs 37.840 µs 38.011 µs]
parquet metadata time: [36.357 µs 36.453 µs 36.564 µs]
decode file metadata time: [22.177 µs 22.224 µs 22.279 µs]
decode new time: [20.758 µs 20.797 µs 20.836 µs]
parquet metadata (wide) time: [219.16 ms 219.86 ms 220.56 ms]
decode file metadata (wide)
time: [110.76 ms 111.26 ms 111.82 ms]
decode new (wide) time: [78.140 ms 78.468 ms 78.802 ms]
```
It's encouraging to see that the time to fully decode to `ParquetMetaData`
with the new code is faster than the current decoder going to `format` objects.
I'm finding some unfortunate dependencies on knowledge of the schema that
lead to still having to parse to intermediate structures before creating the
final results. It's true that the schema is usually the first thing encoded in
the metadata, but there is no guarantee that this will be so. Thrift structures
could conceivably be written out of order, even though this is pretty unlikely.
I'll probably add a fast path that assumes the schema will be available when
decoding the row group metadata...that will save some more unnecessary
duplication of effort.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]