alamb commented on code in PR #10852:
URL: https://github.com/apache/datafusion/pull/10852#discussion_r1634019602


##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -517,6 +518,72 @@ macro_rules! get_statistics {
         }}}
 }
 
+macro_rules! make_data_page_stats_iterator {
+    ($iterator_type: ident, $func: ident, $index_type: path, $stat_value_type: 
ty) => {
+        struct $iterator_type<'a, I>
+        where
+            I: Iterator<Item = &'a Index>,
+        {
+            iter: I,
+        }
+
+        impl<'a, I> $iterator_type<'a, I>
+        where
+            I: Iterator<Item = &'a Index>,
+        {
+            fn new(iter: I) -> Self {
+                Self { iter }
+            }
+        }
+
+        impl<'a, I> Iterator for $iterator_type<'a, I>
+        where
+            I: Iterator<Item = &'a Index>,
+        {
+            type Item = Vec<Option<$stat_value_type>>;
+
+            fn next(&mut self) -> Option<Self::Item> {
+                let next = self.iter.next();
+                match next {
+                    Some(index) => match index {
+                        $index_type(native_index) => Some(
+                            native_index
+                                .indexes
+                                .iter()
+                                .map(|x| x.$func)
+                                .collect::<Vec<_>>(),
+                        ),
+                        // No matching `Index` found.
+                        // Thus no statistics that can be extracted.
+                        // We return vec![None] to effectively
+                        // create an arrow null-array.
+                        _ => Some(vec![None]),

Review Comment:
   I think returning `vec![None]` would return only a single entry -- I think 
it needs to be the same size as what would come from `native_index`
   
   So I think this should be something like this 
   
   ```suggestion
                           _ => Some(vec![None; native_index.len()]),
   ```
   
   However, in order to have this make a difference we need to make a test that 
has multiple data pages in a row group, which the current setup does not. I can 
help write such a test as a follow on PR



##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -517,6 +518,72 @@ macro_rules! get_statistics {
         }}}
 }
 
+macro_rules! make_data_page_stats_iterator {
+    ($iterator_type: ident, $func: ident, $index_type: path, $stat_value_type: 
ty) => {
+        struct $iterator_type<'a, I>
+        where
+            I: Iterator<Item = &'a Index>,
+        {
+            iter: I,
+        }
+
+        impl<'a, I> $iterator_type<'a, I>
+        where
+            I: Iterator<Item = &'a Index>,
+        {
+            fn new(iter: I) -> Self {
+                Self { iter }
+            }
+        }
+
+        impl<'a, I> Iterator for $iterator_type<'a, I>
+        where
+            I: Iterator<Item = &'a Index>,
+        {
+            type Item = Vec<Option<$stat_value_type>>;
+
+            fn next(&mut self) -> Option<Self::Item> {
+                let next = self.iter.next();
+                match next {
+                    Some(index) => match index {
+                        $index_type(native_index) => Some(
+                            native_index
+                                .indexes
+                                .iter()
+                                .map(|x| x.$func)
+                                .collect::<Vec<_>>(),

Review Comment:
   I think to avoid the intermediate Vec we would likely have to keep a 
reference to the current `&Vec<Index>` or something
   
   Since we are doing an intermediate allocation anyways, I think it might be 
simpler to write a function that collects the values as a  `Vec<&$index_type>` 
(aka references to the relevant types, like `Vec<&bool>` rather than going 
through the iterator incantation. We could do this as a follow on PR
   
   It is impressive you got this to work 🎩 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to