kbendick edited a comment on pull request #3645:
URL: https://github.com/apache/iceberg/pull/3645#issuecomment-986166519


   > Hi @kbendick! I get this error when I'm trying to rewrite iceberg table in 
scala spark code with a partition filter like this: 
`SparkActions.get().rewriteDataFiles(table) 
.filter(Expressions.startsWith("imp_date",'20211202')) .execute()` "imp_date" 
is a time partition field, it contains null value from some abnormal rows.
   
   Ohhh that would explain why Spark isn't injecting an implicit `IS NOT NULL` 
check on the filter. We parse the text of the `WHERE` clause from the SQL and 
then convert from an Iceberg Filter to a Spark filter, and not via the 
LogicalPlan that would be generated by Spark.
   
   This means potentially all inputs that would get an implicit null check 
(string inputs at the least) would likely have this same issue.
   
   Where we parse the `WHERE` clause and convert: 
https://github.com/apache/iceberg/blob/b6554fccfac7a0c0ba35ebbcbff60d5f7eb0826d/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java#L120-L131
   
   I'm not sure if we should individually handle each one, or if we want to try 
to make that use the parsed LogicalPlan instead via 
`sqlParser.parsePlan(where)` instead of the current 
`sqlParser.parseExpression(where)`.
   
   Using `parsePlan` would provide ` LogicalPlan`, which has an `expressions` 
attribute of type `Seq[Exrpession]` which I'm guessing would have the null 
checks Spark would normally add.
   
   But it might just be easier to add the null check ourselves instead of 
updating that logic.
   
   cc @RussellSpitzer @karuppayya @flyrain who might have some input on this.
   
   I believe your approach will work @hbgstc123, but there might be a more 
robust way so that the normal Spark plans that would put theimp_date IS NOT 
NULL AND imp_date LIKE '20211202%'` for us.
   
   If I'm correct, then probably a number of these things need to be updated to 
handle `null` input (only for this  particular code path though).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to