[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #7544: [MINOR]: Unknown input statistics in FilterExec

via GitHub Wed, 13 Sep 2023 10:49:33 -0700


jackwener commented on code in PR #7544:
URL: https://github.com/apache/arrow-datafusion/pull/7544#discussion_r1324866471



##########
datafusion/common/src/stats.rs:
##########
@@ -70,3 +72,29 @@ pub struct ColumnStatistics {
     /// Number of distinct values
     pub distinct_count: Option<usize>,
 }
+
+impl ColumnStatistics {
+    /// Returns the [`Vec<ColumnStatistics>`] corresponding to the given 
schema by assigning
+    /// infinite bounds to each column in the schema. This is useful when even 
the input statistics
+    /// are not known, as the current executor can shrink the bounds of some 
columns.
+    pub fn new_with_unbounded_columns(schema: SchemaRef) -> 
Vec<ColumnStatistics> {
+        let data_types = schema
+            .fields()
+            .iter()
+            .map(|field| field.data_type())
+            .collect::<Vec<_>>();
+
+        data_types
+            .into_iter()
+            .map(|data_type| {
+                let dt = ScalarValue::try_from(data_type.clone()).ok();
+                ColumnStatistics {
+                    null_count: None,
+                    max_value: dt.clone(),
+                    min_value: dt,

Review Comment:
   🤔In my opinion, `min_value`/`max_value` is set as 
`ScalarValue<XXType<None>>` is used to get datatype of this column. Is right?
   
   



##########
datafusion/common/src/stats.rs:
##########
@@ -70,3 +72,29 @@ pub struct ColumnStatistics {
     /// Number of distinct values
     pub distinct_count: Option<usize>,
 }
+
+impl ColumnStatistics {
+    /// Returns the [`Vec<ColumnStatistics>`] corresponding to the given 
schema by assigning
+    /// infinite bounds to each column in the schema. This is useful when even 
the input statistics
+    /// are not known, as the current executor can shrink the bounds of some 
columns.
+    pub fn new_with_unbounded_columns(schema: SchemaRef) -> 
Vec<ColumnStatistics> {
+        let data_types = schema
+            .fields()
+            .iter()
+            .map(|field| field.data_type())
+            .collect::<Vec<_>>();
+
+        data_types
+            .into_iter()
+            .map(|data_type| {
+                let dt = ScalarValue::try_from(data_type.clone()).ok();
+                ColumnStatistics {
+                    null_count: None,
+                    max_value: dt.clone(),
+                    min_value: dt,

Review Comment:
   🤔In my opinion, `min_value`/`max_value` is set as 
`ScalarValue<XXType<None>>` is used to get datatype of this column. Is right?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #7544: [MINOR]: Unknown input statistics in FilterExec

Reply via email to