etseidl commented on PR #8797:
URL: https://github.com/apache/arrow-rs/pull/8797#issuecomment-3549738741

   Circling back to the question about skipping the stats 
(https://github.com/apache/arrow-rs/pull/8797#pullrequestreview-3472757501), 
I've created several branches based off of this PR and #8714. I've implemented 
row group, column, and all chunk statistics skipping, all with and without the 
metadata index. Here's some preliminary benchmark numbers:
   ```
   group                                               index                    
              skip                                   skip_opt
   -----                                               -----                    
              ----                                   --------
   decode metadata (wide) 10 columns                   1.00     19.2±0.61ms     
   ? ?/sec    2.11     40.6±0.74ms        ? ?/sec    1.77     34.0±0.51ms       
 ? ?/sec
   decode metadata (wide) 10 columns last row group    1.00      7.9±0.15ms     
   ? ?/sec    3.75     29.7±0.33ms        ? ?/sec    2.96     23.4±0.30ms       
 ? ?/sec
   decode metadata (wide) last row group               1.00     10.7±0.27ms     
   ? ?/sec    2.81     29.9±0.50ms        ? ?/sec    2.27     24.2±0.32ms       
 ? ?/sec
   decode metadata (wide) with schema                  1.05     47.5±2.15ms     
   ? ?/sec    1.01     45.7±0.77ms        ? ?/sec    1.00     45.3±0.62ms       
 ? ?/sec
   decode metadata (wide) with skip PES                1.00     33.2±1.04ms     
   ? ?/sec    1.26     41.9±0.50ms        ? ?/sec    1.15     38.0±0.45ms       
 ? ?/sec
   decode metadata (wide) with stats mask              1.03     43.9±0.71ms     
   ? ?/sec    1.01     42.8±0.85ms        ? ?/sec    1.00     42.5±0.57ms       
 ? ?/sec
   decode parquet metadata (wide)                      1.03     52.4±0.70ms     
   ? ?/sec    1.04     52.5±1.05ms        ? ?/sec    1.00     50.7±0.93ms       
 ? ?/sec
   ```
   
   "skip" is using the current `ThriftCompactInputProtocol::skip` code, which 
still has to parse the thrift data, but doesn't materialize it. "skip_opt" uses 
an optimized version of `skip()` that uses a stack rather than recursion to 
handle nested thrift structures. Finally "index" uses the index introduced in 
#8714 to enable selective decoding. For the "skip PES" bench I hacked the 
reader to skip all column metadata stats, not just the page encoding stats.
   
   If a cached schema is used, then the "10 columns last row group" case goes 
down to 2.3ms, which is over 20X faster than "decode parquet metadata (wide)" 
(52ms. For comparison that benchmark was 221ms in 56.2.0, and was down to 90ms 
after the first phase of the remodel).
   
   None of this is usable yet, but it does give some hope for speeding up point 
lookup queries with projection down the road.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to