marvinlanhenke commented on code in PR #10852: URL: https://github.com/apache/datafusion/pull/10852#discussion_r1634245246
########## datafusion/core/src/datasource/physical_plan/parquet/statistics.rs: ########## @@ -517,6 +518,72 @@ macro_rules! get_statistics { }}} } +macro_rules! make_data_page_stats_iterator { + ($iterator_type: ident, $func: ident, $index_type: path, $stat_value_type: ty) => { + struct $iterator_type<'a, I> + where + I: Iterator<Item = &'a Index>, + { + iter: I, + } + + impl<'a, I> $iterator_type<'a, I> + where + I: Iterator<Item = &'a Index>, + { + fn new(iter: I) -> Self { + Self { iter } + } + } + + impl<'a, I> Iterator for $iterator_type<'a, I> + where + I: Iterator<Item = &'a Index>, + { + type Item = Vec<Option<$stat_value_type>>; + + fn next(&mut self) -> Option<Self::Item> { + let next = self.iter.next(); + match next { + Some(index) => match index { + $index_type(native_index) => Some( + native_index + .indexes + .iter() + .map(|x| x.$func) + .collect::<Vec<_>>(), + ), + // No matching `Index` found. + // Thus no statistics that can be extracted. + // We return vec![None] to effectively + // create an arrow null-array. + _ => Some(vec![None]), Review Comment: Yes, you're right - and we definitely need test coverage here (I'm still confused sometimes (actually all the time) about this logic 🤯). However, I think we need a different approach since we wouldn't have access to `native_index.indexes`, if we cannot match the index, or we encounter the `Index::NONE` variant. Based on the implementation in `page_filter.rs` [here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/physical_optimizer/pruning.rs#L885-L893) we should probably construct a `vec![None; len]` where `len = page_offset_index.len()` and `page_offset_index: Vec<PageLocation>`. I'll try make some changes here (also the API needs to change slightly) and we can discuss further? I'll also try to make a test-case with multiple data_pages per row_group (haven't found the setting, yet...) WDYT @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org