rdblue commented on pull request #1566:
URL: https://github.com/apache/iceberg/pull/1566#issuecomment-721874701


   @shangxinli, I think we probably want to build the filtering into Iceberg. I 
could be convinced otherwise, but it would take a lot of work to use Parquet 
filters.
   
   After my comment yesterday, I was looking at the implementation here and 
[noticed another 
problem](https://github.com/apache/iceberg/pull/1566#discussion_r516902119): 
Parquet expects a filter that references columns using the name that was 
written into each data file. Iceberg allows renaming and so the Iceberg filters 
are written to use field IDs instead of names. That is needed for correctness. 
If I rename field `a` to `x`, then I have to rewrite the filter `x > 5` to `a > 
5` when reading an old data file, or else Parquet thinks there are no matching 
records because column `x` is not present and is all nulls.
   
   Parquet also doesn't support customizing the filter in `ReadSupport`, so to 
do this you have to read each file's footer to get its schema, rewrite the 
filter, and then open the file again to read it.
   
   I'm remembering more of the issues with the original filter path now, and I 
think that it makes more sense to do the work in Iceberg for now, with the 
intent to get the work upstream in a 2.0 Parquet API that fixes some of these 
issues.
   
   What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to