laserninja commented on PR #16110: URL: https://github.com/apache/iceberg/pull/16110#issuecomment-4403713933
Agreed on both points. Agree that Parquet.read() with alwaysFalse() directly is user error, not something we need to protect against. The schema evolution scenario is the real bug, and "only push down where the columns are present" is a cleaner fix than what this PR currently has. I see two ways to implement it: In Parquet.java before calling ParquetFilters.convert(fileSchema, filter, caseSensitive), check whether the columns referenced by filter are all present in fileSchema. If any are missing, fall back to FilterCompat.NOOP for that file. The row-level correctness still holds because Parquet will return all rows, and Iceberg's own residual evaluation will handle the rest. In ConvertFilterToParquet when binding a predicate against a column that doesn't exist in the file schema, return null instead of AlwaysFalse.INSTANCE, and propagate nulls through and/or/not to indicate "not pushable" rather than "false". convert() then treats a null result the same as AlwaysFalse.INSTANCE is treated today, fall back to NOOP. I think option 1 is simpler and more transparent. Happy to update the PR with the schema evolution test and the Parquet.java-level column-presence check. Let me know if you (or @gaborkaszab) have a preference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
