[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #5057: Parquet parallel scan

via GitHub Fri, 27 Jan 2023 03:06:19 -0800


tustvold commented on code in PR #5057:
URL: https://github.com/apache/arrow-datafusion/pull/5057#discussion_r1088827017



##########
datafusion/core/src/physical_optimizer/repartition.rs:
##########
@@ -686,24 +729,6 @@ mod tests {
         Ok(())
     }
 

Review Comment:
   Why has this been removed?



##########
datafusion/core/src/physical_optimizer/repartition.rs:
##########
@@ -846,6 +871,182 @@ mod tests {
         Ok(())
     }
 
+    #[test]

Review Comment:
   Love the test coverage



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -232,6 +241,74 @@ impl ParquetExec {
         self.enable_page_index
             .unwrap_or(config_options.execution.parquet.enable_page_index)
     }
+
+    /// Redistribute files across partitions according to their size
+    pub fn get_repartitioned(&self, target_partitions: usize) -> Self {
+        // Perform redistribution only in case all files should be read from 
beginning to end
+        let has_ranges = self
+            .base_config()
+            .file_groups
+            .iter()
+            .flatten()
+            .any(|f| f.range.is_some());
+        if has_ranges {
+            return self.clone();
+        }
+
+        let total_size = self
+            .base_config()
+            .file_groups
+            .iter()
+            .flatten()
+            .map(|f| f.object_meta.size as i64)
+            .sum::<i64>();
+        let target_partition_size =

Review Comment:
   I think we should probably have a minimum file range, anything less than 
10MB is unlikely to yield additional parallelism, as row groups shouldn't be 
this small, and adding more partitions is not necessarily free.
   
   We could make this configurable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #5057: Parquet parallel scan

Reply via email to