alamb commented on PR #9129: URL: https://github.com/apache/arrow-datafusion/pull/9129#issuecomment-2066406717
> @alamb @Weijun-H i have plans to pick up #8295 next week unless you both think that this can be completed before then (I havent looked yet to see whether it makes sense to continue on this PR or make a new one). > > Happy to get both of your thoughts! I don't think I will be able to make it before then, sadly. Thank you @matthewmturner -- I think this would be a very impactful change. Part of the challenge is that there are two copies of the statistics extraction code. A first step may be to figure out how consolidate that Here is one copy (used for row group pruning): https://github.com/apache/arrow-datafusion/blob/19356b26f515149f96f9b6296975a77ac7260149/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L321-L329 Here is the second copy (used for file level statistics): https://github.com/apache/arrow-datafusion/blob/19356b26f515149f96f9b6296975a77ac7260149/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L196 I think this code eventually belongs in Arrow -- see https://github.com/apache/arrow-rs/issues/4328, but getting it working in DataFusion initially is probably the right thing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
