alamb commented on code in PR #10852: URL: https://github.com/apache/datafusion/pull/10852#discussion_r1634019602
########## datafusion/core/src/datasource/physical_plan/parquet/statistics.rs: ########## @@ -517,6 +518,72 @@ macro_rules! get_statistics { }}} } +macro_rules! make_data_page_stats_iterator { + ($iterator_type: ident, $func: ident, $index_type: path, $stat_value_type: ty) => { + struct $iterator_type<'a, I> + where + I: Iterator<Item = &'a Index>, + { + iter: I, + } + + impl<'a, I> $iterator_type<'a, I> + where + I: Iterator<Item = &'a Index>, + { + fn new(iter: I) -> Self { + Self { iter } + } + } + + impl<'a, I> Iterator for $iterator_type<'a, I> + where + I: Iterator<Item = &'a Index>, + { + type Item = Vec<Option<$stat_value_type>>; + + fn next(&mut self) -> Option<Self::Item> { + let next = self.iter.next(); + match next { + Some(index) => match index { + $index_type(native_index) => Some( + native_index + .indexes + .iter() + .map(|x| x.$func) + .collect::<Vec<_>>(), + ), + // No matching `Index` found. + // Thus no statistics that can be extracted. + // We return vec![None] to effectively + // create an arrow null-array. + _ => Some(vec![None]), Review Comment: I think returning `vec![None]` would return only a single entry -- I think it needs to be the same size as what would come from `native_index` So I think this should be something like this ```suggestion _ => Some(vec![None; native_index.len()]), ``` However, in order to have this make a difference we need to make a test that has multiple data pages in a row group, which the current setup does not. I can help write such a test as a follow on PR ########## datafusion/core/src/datasource/physical_plan/parquet/statistics.rs: ########## @@ -517,6 +518,72 @@ macro_rules! get_statistics { }}} } +macro_rules! make_data_page_stats_iterator { + ($iterator_type: ident, $func: ident, $index_type: path, $stat_value_type: ty) => { + struct $iterator_type<'a, I> + where + I: Iterator<Item = &'a Index>, + { + iter: I, + } + + impl<'a, I> $iterator_type<'a, I> + where + I: Iterator<Item = &'a Index>, + { + fn new(iter: I) -> Self { + Self { iter } + } + } + + impl<'a, I> Iterator for $iterator_type<'a, I> + where + I: Iterator<Item = &'a Index>, + { + type Item = Vec<Option<$stat_value_type>>; + + fn next(&mut self) -> Option<Self::Item> { + let next = self.iter.next(); + match next { + Some(index) => match index { + $index_type(native_index) => Some( + native_index + .indexes + .iter() + .map(|x| x.$func) + .collect::<Vec<_>>(), Review Comment: I think to avoid the intermediate Vec we would likely have to keep a reference to the current `&Vec<Index>` or something Since we are doing an intermediate allocation anyways, I think it might be simpler to write a function that collects the values as a `Vec<&$index_type>` (aka references to the relevant types, like `Vec<&bool>` rather than going through the iterator incantation. We could do this as a follow on PR It is impressive you got this to work 🎩 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org