tshauck commented on code in PR #10972:
URL: https://github.com/apache/datafusion/pull/10972#discussion_r1644954474


##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -552,6 +552,22 @@ 
make_data_page_stats_iterator!(MinInt32DataPageStatsIterator, min, Index::INT32,
 make_data_page_stats_iterator!(MaxInt32DataPageStatsIterator, max, 
Index::INT32, i32);
 make_data_page_stats_iterator!(MinInt64DataPageStatsIterator, min, 
Index::INT64, i64);
 make_data_page_stats_iterator!(MaxInt64DataPageStatsIterator, max, 
Index::INT64, i64);
+make_data_page_stats_iterator!(

Review Comment:
   Oooff, sorry @tmi, I should've done `take` on the issue. Taking @alamb's 
pointer to look at row groups, I might be wrong, but I think there's a couple 
of differences between it and the data page stats that complicate things...
   
   * `max_bytes`/`min_bytes`, which is used for row groups, does not exist on 
`PageIndex` in the parquet crate but does on `Statistics`, so using it in the 
macro isn't possible at this point.
   * Using `[u8]` as the `stat_value_type` is problematic when combined with 
the iterator macro in that `type Item = Vec<Option<$stat_value_type>>;` can't 
work with `[u8]` because it's unsized. Changing the item def to `type Item = 
Vec<Option<&'a $stat_value_type>>;` works and is similar to the item definition 
of row group (`type Item = Option<&'a $stat_value_type>;`).
   
   f16 is less common, so maybe to have a solution that's similar to Statistics 
and hopefully avoids cloning, we could:
   
   1. Update this PR to do f32 and f64, get it in to support higher priority 
types and is mostly innocuous
   2. Update the parquet crate to support `max_bytes`/`min_bytes` on `PageIndex`
   3. Update the data page stats to use a `&'a $stat_value_type` and add f16 
when that's merged
   
   @tmi, maybe you and I could collaborate on 2 & 3 if that seems reasonable? 
Otherwise, certainly happy to go another route, if there's a better approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to