[GitHub] [arrow-datafusion] hengfeiyang opened a new issue, #7573: Should parallel collecting statistics like infer schema?

via GitHub Fri, 15 Sep 2023 09:57:36 -0700


hengfeiyang opened a new issue, #7573:
URL: https://github.com/apache/arrow-datafusion/issues/7573


   ### Is your feature request related to a problem or challenge?
   
   When i searched data from s3 I found Datafusion fetches parquet file 
metadata one by one, it is a bit slow when I have many files.
   
   The code is here:
   
   
https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/listing/table.rs#L960C1-L985
   
   I found this code uses `iter::then`, and next it will fetch data one by one.
   
   But I found something here:
   
   
https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/file_format/parquet.rs#L170-L175
   
   When fetching schema it uses concurrent requests. 
   
   ```
   iter::map().boxed().buffered(SCHEMA_INFERENCE_CONCURRENCY)
   ```
   
   Is possible to do the same things here? user concurrent request for 
collecting statistics?
   
   ### Describe the solution you'd like
   
   Actually i tried change the code:
   
   
https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/listing/table.rs#L960C1-L985
   
   ```Rust
           let files = file_list.then(|part_file| async {
                   let part_file = part_file?;
                   let statistics = if self.options.collect_stat {
                       match 
self.collected_statistics.get(&part_file.object_meta) {
                           Some(statistics) => statistics,
                           None => {
                               let statistics = self
                                   .options
                                   .format
                                   .infer_stats(
                                       ctx,
                                       &store,
                                       self.file_schema.clone(),
                                       &part_file.object_meta,
                                   )
                                   .await?;
                               self.collected_statistics
                                   .save(part_file.object_meta.clone(), 
statistics.clone());
                               statistics
                           }
                       }
                   } else {
                       Statistics::default()
                   };
                   Ok((part_file, statistics)) as Result<(PartitionedFile, 
Statistics)>
               });
   ```
   
   To this:
   
   ```Rust
             let files = file_list.map(|part_file| async {
                   let part_file = part_file?;
                   let statistics = if self.options.collect_stat {
                       match 
self.collected_statistics.get(&part_file.object_meta) {
                           Some(statistics) => statistics,
                           None => {
                               let statistics = self
                                   .options
                                   .format
                                   .infer_stats(
                                       ctx,
                                       &store,
                                       self.file_schema.clone(),
                                       &part_file.object_meta,
                                   )
                                   .await?;
                               self.collected_statistics
                                   .save(part_file.object_meta.clone(), 
statistics.clone());
                               statistics
                           }
                       }
                   } else {
                       Statistics::default()
                   };
                   Ok((part_file, statistics)) as Result<(PartitionedFile, 
Statistics)>
               })
               .boxed()
               .buffered(COLLECT_STATISTICS_CONCURRENCY);
   ```
   
   And set a const variable:
   
   ```Rust
   const COLLECT_STATISTICS_CONCURRENCY: usize = 32;
   ```
   
   The search speed is much improved in my local because it can concurrently 
fetch parquet files to collect statistics, earlier it requested files one by 
one. to
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] hengfeiyang opened a new issue, #7573: Should parallel collecting statistics like infer schema?

Reply via email to