Re: [PR] initial attempt at implementation [datafusion]

via GitHub Thu, 07 Nov 2024 10:20:35 -0800


suremarc commented on code in PR #13296:
URL: https://github.com/apache/datafusion/pull/13296#discussion_r1833111049



##########
datafusion/core/src/datasource/physical_plan/file_scan_config.rs:
##########
@@ -310,22 +310,12 @@ impl FileScanConfig {
         sort_order: &LexOrdering,
     ) -> Result<Vec<Vec<PartitionedFile>>> {
         let flattened_files = file_groups.iter().flatten().collect::<Vec<_>>();
-        // First Fit:
-        // * Choose the first file group that a file can be placed into.
-        // * If it fits into no existing file groups, create a new one.
-        //
-        // By sorting files by min values and then applying first-fit bin 
packing,
-        // we can produce the smallest number of file groups such that
-        // files within a group are in order and non-overlapping.
-        //
-        // Source: Applied Combinatorics (Keller and Trotter), Chapter 6.8
-        // 
https://www.appliedcombinatorics.org/book/s_posets_dilworth-intord.html

Review Comment:
   I moved this and the relevant code into a new method, 
`MinMaxStatistics::first_fit`



##########
datafusion/physical-plan/src/statistics.rs:
##########
@@ -131,14 +95,35 @@ impl MinMaxStatistics {
                 .collect::<Vec<_>>(),
         };
 
+        let projected_statistics = statistics
+            .into_iter()
+            .cloned()
+            .map(|s| s.project(Some(&projection)))
+            .collect::<Vec<_>>();
+
+        // Helper function to get min/max statistics
+        let get_min_max = |i: usize| -> Result<(Vec<ScalarValue>, 
Vec<ScalarValue>)> {
+            Ok(projected_statistics
+                .iter()

Review Comment:
   I moved this code later so it uses the _projected_ statistics, i.e. it only 
relies on stats for sorting columns. I was seeing the code error because some 
columns had unknown statistics. 



##########
datafusion/physical-plan/src/execution_plan.rs:
##########
@@ -397,6 +397,13 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync {
         Ok(Statistics::new_unknown(&self.schema()))
     }
 
+    fn statistics_by_partition(&self) -> Result<Vec<Statistics>> {
+        Ok(vec![
+            self.statistics()?;
+            self.properties().partitioning.partition_count()
+        ])
+    }
+

Review Comment:
   As stated in the PR description, this is what a proposed API would look like 
for statistics by partition, though it is certainly not final. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] initial attempt at implementation [datafusion]

Reply via email to