XiangpengHao commented on issue #5855:
URL: https://github.com/apache/arrow-rs/issues/5855#issuecomment-2155014991

   Below is the flamegraph of decoding parquet metadata and allocation itself 
does not show up as the bottleneck.
   
   The term "allocation" is ambiguous,  it can refer to allocation operations, 
it can also mean excessive memory footprint (and the efforts to set them up), 
and I believe the later is the bottleneck.
   
   More formally, I believe the time is spent on two types of tasks: (1) 
decoding, interpreting the thrift data, which can be SIMD acclerated (2) setup 
the in-memory sturcture, i.e., inflating the 10MB metadata into 100MB of 
in-memory representation, which is solved by skipping columns/row groups.
   
   
![wide_table_19_43_50](https://github.com/apache/arrow-rs/assets/6504314/810e98e3-ed64-4902-94d0-7e094071efac)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to