seddonm1 opened a new pull request #9976: URL: https://github.com/apache/arrow/pull/9976
@alamb @jorgecarleitao @andygrove This is a WIP start of work to re-implement the `input_file_name` function by not using the `metadata` of the `RecordBatch` and instead relying on plan traversal. Inspired by @jorgecarleitao's comment: https://github.com/apache/arrow/pull/9944#pullrequestreview-632013179 I have tried to initially extend the `TableProvider` to capture this information by means of the `Statistics`. This code collects statistics at the **partition** level not the **table** level and provides methods for calculating the table level statistics. It can easily be extended to do things like `TableProvider::file_name(partition: usize)` to get a specific `partition` filename. Calculating things like `distinct_count` per column will be a fun future problem and may require something like `HyperLogLog`. This code also assumes that a 1:1 mapping of Parquet file to logical partition - but it still has a lot of the initial implementation for future many:one mapping of files to partition. I think it would be fair to fix a 1:1 mapping of file:partition and then down the line a `repartition` operator used to merge them. If agreed I can make those changes. I don't like to make these changes in a vacuum and the mailing list is not ideal to demonstrate ideas so I hoped to review here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
