Re: [I] Implement selective decoding of a subset (e.g. columns or row groups) of parquet metadata [arrow-rs]

via GitHub Fri, 07 Jun 2024 07:57:05 -0700


XiangpengHao commented on issue #5855:
URL: https://github.com/apache/arrow-rs/issues/5855#issuecomment-2155014991


   Below is the flamegraph of decoding parquet metadata and allocation itself 
does not show up as the bottleneck.
   
   The term "allocation" is ambiguous,  it can refer to allocation operations, 
it can also mean excessive memory footprint (and the efforts to set them up), 
and I believe the later is the bottleneck.
   
   More formally, I believe the time is spent on two types of tasks: (1) 
decoding, interpreting the thrift data, which can be SIMD acclerated (2) setup 
the in-memory sturcture, i.e., inflating the 10MB metadata into 100MB of 
in-memory representation, which is solved by skipping columns/row groups.
   
   
![wide_table_19_43_50](https://github.com/apache/arrow-rs/assets/6504314/810e98e3-ed64-4902-94d0-7e094071efac)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Implement selective decoding of a subset (e.g. columns or row groups) of parquet metadata [arrow-rs]

Reply via email to