kbendick commented on pull request #3645:
URL: https://github.com/apache/iceberg/pull/3645#issuecomment-986166519
> Hi @kbendick! I get this error when I'm trying to rewrite iceberg table in
scala spark code with a partition filter like this:
`SparkActions.get().rewriteDataFiles(table)
.filter(Expressions.startsWith("imp_date",'20211202')) .execute()` "imp_date"
is a time partition field, it contains null value from some abnormal rows.
Ohhh that would explain why Spark isn't injecting an implicit `IS NOT NULL`
check on the filter. We parse the text of the `WHERE` clause from the SQL and
then convert from an Iceberg Filter to a Spark filter, and not via the
LogicalPlan that would be generated by Spark.
This means potentially all inputs that would get an implicit null check
(string inputs at the least) would likely have this same issue.
Where we parse the `WHERE` clause and convert:
https://github.com/apache/iceberg/blob/b6554fccfac7a0c0ba35ebbcbff60d5f7eb0826d/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java#L120-L131
I'm not sure if we should individually handle each one, or if we want to try
to make that use the parsed LogicalPlan instead via
`sqlParser.parsePlan(where)` instead of the current
`sqlParser.parseExpression(where)`.
Using `parsePlan` would provide ` LogicalPlan`, which has a children
attribute of type `Seq[Exrpession]` which I'm guessing would have the null
checks Spark would normally add.
But it might just be easier to add the null check ourselves instead of
updating that logic.
cc @RussellSpitzer @karuppayya @flyrain who might have some input on this.
I believe your approach will work @hbgstc123, but there might be a more
robust way so that the normal Spark plans that would put theimp_date IS NOT
NULL AND imp_date LIKE '20211202%'` for us.
If I'm correct, then probably a number of these things need to be updated to
handle `null` input (only for this particular code path though).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]