NGA-TRAN commented on code in PR #9593:
URL: https://github.com/apache/arrow-datafusion/pull/9593#discussion_r1534510334


##########
datafusion/core/src/datasource/physical_plan/file_scan_config.rs:
##########
@@ -762,6 +836,171 @@ mod tests {
         assert_eq!(projection.fields(), schema.fields());
     }
 
+    #[test]
+    fn test_sort_file_groups() -> Result<()> {
+        use chrono::TimeZone;
+        use datafusion_common::DFSchema;
+        use datafusion_expr::execution_props::ExecutionProps;
+        use object_store::{path::Path, ObjectMeta};
+
+        struct File {
+            name: &'static str,
+            date: &'static str,
+            statistics: Vec<Option<(f64, f64)>>,
+        }
+        impl File {
+            fn new(
+                name: &'static str,
+                date: &'static str,
+                statistics: Vec<Option<(f64, f64)>>,
+            ) -> Self {
+                Self {
+                    name,
+                    date,
+                    statistics,
+                }
+            }
+        }
+
+        struct TestCase {
+            #[allow(unused)]
+            file_schema: Schema,
+            files: Vec<File>,
+            sort: Vec<datafusion_expr::Expr>,
+            expected_result: Result<Vec<Vec<usize>>, &'static str>,
+        }
+
+        use datafusion_expr::col;
+        let cases = vec![
+            TestCase {
+                file_schema: Schema::new(vec![Field::new(
+                    "value".to_string(),
+                    DataType::Float64,
+                    false,
+                )]),
+                files: vec![
+                    File::new("0", "2023-01-01", vec![Some((0.00, 0.49))]),
+                    File::new("1", "2023-01-01", vec![Some((0.50, 1.00))]),
+                    File::new("2", "2023-01-02", vec![Some((0.00, 1.00))]),
+                ],
+                sort: vec![col("value").sort(true, false)],
+                expected_result: Ok(vec![vec![0, 1], vec![2]]),
+            },

Review Comment:
   Can we add more tests here?
   1. The same input but "2" is now in the middle to ensure we sort and group 
them correctly
   2. All three non-overlapped files
   3. All three overlapped files
   4. Empty input



##########
datafusion/core/src/datasource/physical_plan/file_scan_config.rs:
##########
@@ -194,6 +203,71 @@ impl FileScanConfig {
             .with_repartition_file_min_size(repartition_file_min_size)
             .repartition_file_groups(&file_groups)
     }
+
+    /// Attempts to do a bin-packing on files into file groups, such that any 
two files
+    /// in a file group are ordered and non-overlapping with respect to their 
statistics.
+    /// It will produce the smallest number of file groups possible.

Review Comment:
   I understand the purpose of this is to split and group files.  Files in each 
group is sorted and not overlapped. If there are many groups, each of them must 
overlap with at least one other.
   
   Is this the first step for following PRs to do something with these sorted 
file groups? Like files in each group will be read sequentially without going 
thru  SortPreservingMerge; and then data streams of different groups will go 
thru SortPreservingMerge?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to