rdblue commented on pull request #1566: URL: https://github.com/apache/iceberg/pull/1566#issuecomment-721874701
@shangxinli, I think we probably want to build the filtering into Iceberg. I could be convinced otherwise, but it would take a lot of work to use Parquet filters. After my comment yesterday, I was looking at the implementation here and [noticed another problem](https://github.com/apache/iceberg/pull/1566#discussion_r516902119): Parquet expects a filter that references columns using the name that was written into each data file. Iceberg allows renaming and so the Iceberg filters are written to use field IDs instead of names. That is needed for correctness. If I rename field `a` to `x`, then I have to rewrite the filter `x > 5` to `a > 5` when reading an old data file, or else Parquet thinks there are no matching records because column `x` is not present and is all nulls. Parquet also doesn't support customizing the filter in `ReadSupport`, so to do this you have to read each file's footer to get its schema, rewrite the filter, and then open the file again to read it. I'm remembering more of the issues with the original filter path now, and I think that it makes more sense to do the work in Iceberg for now, with the intent to get the work upstream in a 2.0 Parquet API that fixes some of these issues. What do you think? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
