korowa commented on code in PR #5057:
URL: https://github.com/apache/arrow-datafusion/pull/5057#discussion_r1089232836
##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -232,6 +241,74 @@ impl ParquetExec {
self.enable_page_index
.unwrap_or(config_options.execution.parquet.enable_page_index)
}
+
+ /// Redistribute files across partitions according to their size
+ pub fn get_repartitioned(&self, target_partitions: usize) -> Self {
+ // Perform redistribution only in case all files should be read from
beginning to end
+ let has_ranges = self
+ .base_config()
+ .file_groups
+ .iter()
+ .flatten()
+ .any(|f| f.range.is_some());
+ if has_ranges {
+ return self.clone();
+ }
+
+ let total_size = self
+ .base_config()
+ .file_groups
+ .iter()
+ .flatten()
+ .map(|f| f.object_meta.size as i64)
+ .sum::<i64>();
+ let target_partition_size =
Review Comment:
After some thoughts, I see that it could be just more user-friendly -- cause
this setting will care about preventing unnecessary file splits without any
additional parameters tweaking efforts.
I've also moved both of these new settings from `execution` to `optimizer`
section.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]