alamb opened a new issue, #5854:
URL: https://github.com/apache/arrow-rs/issues/5854

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Part of https://github.com/apache/arrow-rs/issues/5853
   
   Parsing the parquet metadata takes substantial time and most of that time is 
spent in decoding the thrift format (@XiangpengHao is quantifying this in 
https://github.com/apache/arrow-rs/issues/5770)
   
   
   **Describe the solution you'd like**
   Improve the thrift decoder speed
   
   **Describe alternatives you've considered**
   @jhorstmann reports  on https://github.com/apache/arrow-rs/issues/5775 that 
he made a prototype of this:
   
                 I had an attack of "not invented here" syndrome the last few 
days 😅 and worked on an alternative code generator for thrift, that would allow 
me to more easily try out some changes to the generated code. The repo can be 
found at <https://github.com/jhorstmann/compact-thrift/> and the output for 
`parquet.thrift` at 
<https://github.com/jhorstmann/compact-thrift/blob/main/src/main/rust/tests/parquet.rs>.
   
   The current output is still doing allocations for string and binary, but 
running the benchmarks from 
<https://github.com/tustvold/arrow-rs/tree/thrift-bench> shows some nice 
improvements. This is the comparison with current arrow-rs code, so both 
versions should be doing the same amount of allocations:
   
   ```
   decode metadata      time:   [32.592 ms 32.645 ms 32.702 ms]
   
   decode metadata new  time:   [17.440 ms 17.476 ms 17.532 ms]
   ```
   
   So incidentally very close to that 2x improvement.
   
   The main difference in the code should be avoiding most of the abstractions 
from `TInputProtocol` and avoiding stack moves by directly writing into 
default-initialized structs instead of moving from local variables.
   
   _Originally posted by @jhorstmann in 
https://github.com/apache/arrow-rs/issues/5775#issuecomment-2131307588_
               
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to