Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

via GitHub Sun, 30 Mar 2025 21:46:39 -0700


suremarc commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020364348



##########
datafusion/datasource/src/file_scan_config.rs:
##########
@@ -575,6 +575,95 @@ impl FileScanConfig {
         })
     }
 
+    /// Splits file groups into new groups based on statistics to enable 
efficient parallel processing.
+    ///
+    /// The method distributes files across a target number of partitions 
while ensuring
+    /// files within each partition maintain sort order based on their min/max 
statistics.
+    ///
+    /// The algorithm works by:
+    /// 1. Sorting all files by their minimum values
+    /// 2. Trying to place each file into an existing group where it can 
maintain sort order
+    /// 3. Creating new groups when necessary if a file cannot fit into 
existing groups
+    /// 4. Prioritizing smaller groups when multiple suitable groups exist 
(for load balancing)
+    ///
+    /// # Parameters
+    /// * `table_schema`: Schema containing information about the columns
+    /// * `file_groups`: The original file groups to split
+    /// * `sort_order`: The lexicographical ordering to maintain within each 
group
+    /// * `target_partitions`: The desired number of output partitions
+    ///
+    /// # Returns
+    /// A new set of file groups, where files within each group are 
non-overlapping with respect to
+    /// their min/max statistics and maintain the specified sort order.
+    pub fn split_groups_by_statistics_v2(

Review Comment:
   Perhaps we could call it `split_groups_by_statistics_with_target_partitions`?
   
   TBH I am not sure if anyone is using the old code, so I would wager it is 
safe to replace with the new implementation. But I agree the old one is 
probably more useful in certain scenarios, e.g. if you are doing a sort merge 
above it. 
   
   If we were to keep it, I would rather unify the implementations, the only 
thing that differs is the policy for selecting the group to insert. I think we 
could probably abstract that out into an enum or generic parameter. (Not really 
sure how common generics are in datafusion though)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

Reply via email to