laserninja commented on PR #16110:
URL: https://github.com/apache/iceberg/pull/16110#issuecomment-4403713933

   Agreed on both points.
   
   Agree that Parquet.read() with alwaysFalse() directly is user error, not 
something we need to protect against.
   
   The schema evolution scenario is the real bug, and "only push down where the 
columns are present" is a cleaner fix than what this PR currently has. I see 
two ways to implement it:
   
   In Parquet.java before calling ParquetFilters.convert(fileSchema, filter, 
caseSensitive), check whether the columns referenced by filter are all present 
in fileSchema. If any are missing, fall back to FilterCompat.NOOP for that 
file. The row-level correctness still holds because Parquet will return all 
rows, and Iceberg's own residual evaluation will handle the rest.
   
   In ConvertFilterToParquet when binding a predicate against a column that 
doesn't exist in the file schema, return null instead of AlwaysFalse.INSTANCE, 
and propagate nulls through and/or/not to indicate "not pushable" rather than 
"false". convert() then treats a null result the same as AlwaysFalse.INSTANCE 
is treated today, fall back to NOOP.
   
   I think option 1 is simpler and more transparent. Happy to update the PR 
with the schema evolution test and the Parquet.java-level column-presence 
check. Let me know if you (or @gaborkaszab) have a preference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to