[I] Improve performance of extracting statistics from parquet files [datafusion]

via GitHub Wed, 22 May 2024 12:45:18 -0700


alamb opened a new issue, #10626:
URL: https://github.com/apache/datafusion/issues/10626


   ### Is your feature request related to a problem or challenge?
   
   Part of https://github.com/apache/datafusion/issues/10453
   
   @Lordworms added a benchmark for extracting statistics from parquet files in 
https://github.com/apache/datafusion/pull/10610
   
   As this code can be used to extract statistics from parquet files, we would 
like to make sure it is efficient (especially if we are going to extract 
statistics for many files at once)
   
   The idea here is to improve the speed of the statistics extraction
   
   
   
   ### Describe the solution you'd like
   
   Make this go faster
   
   ```shell
   cargo bench --bench parquet_statistic
   ```
   
   
   
   ### Describe alternatives you've considered
   
    I did some brief profiling:
   
   ![Screenshot 2024-05-22 at 3 37 30 
PM](https://github.com/apache/datafusion/assets/490673/c53c5a1d-2d06-4d13-bd87-e5d6e51ccb49)
   
   I think they key would be to change these loops so they built the required 
Arrow Arrays directly from primitive values rather than from `ScalarValue`:
   
   
https://github.com/apache/datafusion/blob/1bf7112171fd820c101e325822dc4d44dd65b2ff/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L183-L189
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Improve performance of extracting statistics from parquet files [datafusion]

Reply via email to