HyukjinKwon edited a comment on pull request #28106: URL: https://github.com/apache/spark/pull/28106#issuecomment-625597600
@yaooqinn, I discussed offline with other people who I know, and I decided to share here as it looks valid concerns to address: The concerns are basically, It might be unclear to end users. For example, `TRY(a / MyUDF(b))`. It will catch both the exceptions from `MyUDF` and the division zero. It might be unclear to end users. Should they use `TRY(a / TRY(MyUDF(b)))` vs `TRY(a / MyUDF(b))`. Another example might be `TRY(SUM(a/b))` vs `TRY(SUM(TRY(a/b)))`. Subqueries might be a problem as well: ``` TRY(a IN (SELECT ... WHERE a/b > 1)), ``` Errors from `a/b` will be propagated all the way to the TRY and it will be replaced to `NULL`; however, I guess we can also think it should return `NULL` from `a/b`? How does it work: - When the expression requires a shuffle such as window functions? - When the runtime exception occurs in vectorized pandas UDF - the exception will happen once for a batch? Maybe, it's best to check how two references you pointed out work in these cases. Looks like some other vendors choose to add safe_* or try_* expressions that scope clearly. For example: https://docs.microsoft.com/en-us/sql/t-sql/functions/try-cast-transact-sql?view=sql-server-ver15 https://docs.snowflake.com/en/sql-reference/functions/try_cast.html https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_prefix Maybe we should take a step back and think about this a bit more. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
