seddonm1 opened a new pull request #9976:
URL: https://github.com/apache/arrow/pull/9976


   @alamb @jorgecarleitao @andygrove 
   
   This is a WIP start of work to re-implement the `input_file_name` function 
by not using the `metadata` of the `RecordBatch` and instead relying on plan 
traversal. Inspired by @jorgecarleitao's comment: 
https://github.com/apache/arrow/pull/9944#pullrequestreview-632013179 I have 
tried to initially extend the `TableProvider` to capture this information by 
means of the `Statistics`.
   
   This code collects statistics at the **partition** level not the **table** 
level and provides methods for calculating the table level statistics. It can 
easily be extended to do things like `TableProvider::file_name(partition: 
usize)` to get a specific `partition` filename. Calculating things like 
`distinct_count` per column will be a fun future problem and may require 
something like `HyperLogLog`.
   
   This code also assumes that a 1:1 mapping of Parquet file to logical 
partition - but it still has a lot of the initial implementation for future 
many:one mapping of files to partition. I think it would be fair to fix a 1:1 
mapping of file:partition and then down the line a `repartition` operator used 
to merge them. If agreed I can make those changes.
   
   I don't like to make these changes in a vacuum and the mailing list is not 
ideal to demonstrate ideas so I hoped to review here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to