XiangpengHao commented on issue #5855: URL: https://github.com/apache/arrow-rs/issues/5855#issuecomment-2155014991
Below is the flamegraph of decoding parquet metadata and allocation itself does not show up as the bottleneck. The term "allocation" is ambiguous, it can refer to allocation operations, it can also mean excessive memory footprint (and the efforts to set them up), and I believe the later is the bottleneck. More formally, I believe the time is spent on two types of tasks: (1) decoding, interpreting the thrift data, which can be SIMD acclerated (2) setup the in-memory sturcture, i.e., inflating the 10MB metadata into 100MB of in-memory representation, which is solved by skipping columns/row groups.  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
