etseidl commented on PR #8797: URL: https://github.com/apache/arrow-rs/pull/8797#issuecomment-3549738741
Circling back to the question about skipping the stats (https://github.com/apache/arrow-rs/pull/8797#pullrequestreview-3472757501), I've created several branches based off of this PR and #8714. I've implemented row group, column, and all chunk statistics skipping, all with and without the metadata index. Here's some preliminary benchmark numbers: ``` group index skip skip_opt ----- ----- ---- -------- decode metadata (wide) 10 columns 1.00 19.2±0.61ms ? ?/sec 2.11 40.6±0.74ms ? ?/sec 1.77 34.0±0.51ms ? ?/sec decode metadata (wide) 10 columns last row group 1.00 7.9±0.15ms ? ?/sec 3.75 29.7±0.33ms ? ?/sec 2.96 23.4±0.30ms ? ?/sec decode metadata (wide) last row group 1.00 10.7±0.27ms ? ?/sec 2.81 29.9±0.50ms ? ?/sec 2.27 24.2±0.32ms ? ?/sec decode metadata (wide) with schema 1.05 47.5±2.15ms ? ?/sec 1.01 45.7±0.77ms ? ?/sec 1.00 45.3±0.62ms ? ?/sec decode metadata (wide) with skip PES 1.00 33.2±1.04ms ? ?/sec 1.26 41.9±0.50ms ? ?/sec 1.15 38.0±0.45ms ? ?/sec decode metadata (wide) with stats mask 1.03 43.9±0.71ms ? ?/sec 1.01 42.8±0.85ms ? ?/sec 1.00 42.5±0.57ms ? ?/sec decode parquet metadata (wide) 1.03 52.4±0.70ms ? ?/sec 1.04 52.5±1.05ms ? ?/sec 1.00 50.7±0.93ms ? ?/sec ``` "skip" is using the current `ThriftCompactInputProtocol::skip` code, which still has to parse the thrift data, but doesn't materialize it. "skip_opt" uses an optimized version of `skip()` that uses a stack rather than recursion to handle nested thrift structures. Finally "index" uses the index introduced in #8714 to enable selective decoding. For the "skip PES" bench I hacked the reader to skip all column metadata stats, not just the page encoding stats. If a cached schema is used, then the "10 columns last row group" case goes down to 2.3ms, which is over 20X faster than "decode parquet metadata (wide)" (52ms. For comparison that benchmark was 221ms in 56.2.0, and was down to 90ms after the first phase of the remodel). None of this is usable yet, but it does give some hope for speeding up point lookup queries with projection down the road. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
