NGA-TRAN commented on code in PR #9593:
URL: https://github.com/apache/arrow-datafusion/pull/9593#discussion_r1534510334
##########
datafusion/core/src/datasource/physical_plan/file_scan_config.rs:
##########
@@ -762,6 +836,171 @@ mod tests {
assert_eq!(projection.fields(), schema.fields());
}
+ #[test]
+ fn test_sort_file_groups() -> Result<()> {
+ use chrono::TimeZone;
+ use datafusion_common::DFSchema;
+ use datafusion_expr::execution_props::ExecutionProps;
+ use object_store::{path::Path, ObjectMeta};
+
+ struct File {
+ name: &'static str,
+ date: &'static str,
+ statistics: Vec<Option<(f64, f64)>>,
+ }
+ impl File {
+ fn new(
+ name: &'static str,
+ date: &'static str,
+ statistics: Vec<Option<(f64, f64)>>,
+ ) -> Self {
+ Self {
+ name,
+ date,
+ statistics,
+ }
+ }
+ }
+
+ struct TestCase {
+ #[allow(unused)]
+ file_schema: Schema,
+ files: Vec<File>,
+ sort: Vec<datafusion_expr::Expr>,
+ expected_result: Result<Vec<Vec<usize>>, &'static str>,
+ }
+
+ use datafusion_expr::col;
+ let cases = vec![
+ TestCase {
+ file_schema: Schema::new(vec![Field::new(
+ "value".to_string(),
+ DataType::Float64,
+ false,
+ )]),
+ files: vec![
+ File::new("0", "2023-01-01", vec![Some((0.00, 0.49))]),
+ File::new("1", "2023-01-01", vec![Some((0.50, 1.00))]),
+ File::new("2", "2023-01-02", vec![Some((0.00, 1.00))]),
+ ],
+ sort: vec![col("value").sort(true, false)],
+ expected_result: Ok(vec![vec![0, 1], vec![2]]),
+ },
Review Comment:
Can we add more tests here?
1. The same input but "2" is now in the middle to ensure we sort and group
them correctly
2. All three non-overlapped files
3. All three overlapped files
4. Empty input
##########
datafusion/core/src/datasource/physical_plan/file_scan_config.rs:
##########
@@ -194,6 +203,71 @@ impl FileScanConfig {
.with_repartition_file_min_size(repartition_file_min_size)
.repartition_file_groups(&file_groups)
}
+
+ /// Attempts to do a bin-packing on files into file groups, such that any
two files
+ /// in a file group are ordered and non-overlapping with respect to their
statistics.
+ /// It will produce the smallest number of file groups possible.
Review Comment:
I understand the purpose of this is to split and group files. Files in each
group is sorted and not overlapped. If there are many groups, each of them must
overlap with at least one other.
Is this the first step for following PRs to do something with these sorted
file groups? Like files in each group will be read sequentially without going
thru SortPreservingMerge; and then data streams of different groups will go
thru SortPreservingMerge?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]