etseidl commented on PR #8072: URL: https://github.com/apache/arrow-rs/pull/8072#issuecomment-3166365107
FWIW, here are some benchmark numbers. In a separate branch I've completed reading (mostly) directly to rust structures. This is from a modified metadata bench. "open(default)" is opening a `SerializedFileReader` to get the `ParquetMetaData`, "parquet metadata" is just using `ParquetMetaDataReader::decode_metadata` to parse the footer (returns a `ParquetMetaData` as well), "decode file metadata" uses the thrift code to a raw decode to `format::FileMetaData`, "decode new" uses the new parser to go straight to `ParquetMetaData`. The "(wide)" variants are for a synthetic 1000 column schema. ``` open(default) time: [37.663 µs 37.840 µs 38.011 µs] parquet metadata time: [36.357 µs 36.453 µs 36.564 µs] decode file metadata time: [22.177 µs 22.224 µs 22.279 µs] decode new time: [20.758 µs 20.797 µs 20.836 µs] parquet metadata (wide) time: [219.16 ms 219.86 ms 220.56 ms] decode file metadata (wide) time: [110.76 ms 111.26 ms 111.82 ms] decode new (wide) time: [78.140 ms 78.468 ms 78.802 ms] ``` It's encouraging to see that the time to fully decode to `ParquetMetaData` with the new code is faster than the current decoder going to `format` objects. I'm finding some unfortunate dependencies on knowledge of the schema that lead to still having to parse to intermediate structures before creating the final results. It's true that the schema is usually the first thing encoded in the metadata, but there is no guarantee that this will be so. Thrift structures could conceivably be written out of order, even though this is pretty unlikely. I'll probably add a fast path that assumes the schema will be available when decoding the row group metadata...that will save some more unnecessary duplication of effort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org