[GitHub] [arrow-datafusion] korowa commented on a diff in pull request #5057: Parquet parallel scan

via GitHub Fri, 27 Jan 2023 04:03:38 -0800


korowa commented on code in PR #5057:
URL: https://github.com/apache/arrow-datafusion/pull/5057#discussion_r1088883601



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -232,6 +241,74 @@ impl ParquetExec {
         self.enable_page_index
             .unwrap_or(config_options.execution.parquet.enable_page_index)
     }
+
+    /// Redistribute files across partitions according to their size
+    pub fn get_repartitioned(&self, target_partitions: usize) -> Self {
+        // Perform redistribution only in case all files should be read from 
beginning to end
+        let has_ranges = self
+            .base_config()
+            .file_groups
+            .iter()
+            .flatten()
+            .any(|f| f.range.is_some());
+        if has_ranges {
+            return self.clone();
+        }
+
+        let total_size = self
+            .base_config()
+            .file_groups
+            .iter()
+            .flatten()
+            .map(|f| f.object_meta.size as i64)
+            .sum::<i64>();
+        let target_partition_size =

Review Comment:
   Sounds reasonable, but it seems to me that same thing can be done via both 
`target_partitions` & `parallel_file_scan` if we know file layout prior to 
running query - is this setting supposed to be kind of fuse for running queries 
over random files (i.e. neither the number of files nor their size is known)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] korowa commented on a diff in pull request #5057: Parquet parallel scan

Reply via email to