ssimeonov commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default URL: https://github.com/apache/spark/pull/27834#issuecomment-596159602 @HeartSaVioR the workaround was simple, unpleasant and unstable: - **Simple**, because we realized that we should never use `size()` as its behavior is (a) inconsistent with SQL principles and (b) dangerous for analytics. - **Unpleasant**, because it involved replacing every single use of the function in our codebase with a `Column` implicit that we created (`size(x)` -> `x.size`). This was easy for IDE-managed code but hard for notebook code. After consideration, we decided that our implicit would return 0 for `null`, because in every single use case we could think of for our data, we would be converting null returns to 0. (I would not recommend this for Spark as the behavior is inconsistent with SQL.) This is also the reason why we went with an implicit as opposed to, say, using import order to hide `functions.size()` with our own `size()`. Caveat: if we had more people using SparkSQL, we would have gone with `size0(x)` and `sizen(x)` instead of an implicit, for version that return 0 or null for null input. - **Unstable**, because there is no way to guarantee the problem will not happen again due to user error. It's part of our DOs and DONTs of Spark not to use `size()` but discipline alone does not offer guaranteed protection. The only way to guarantee the behavior is via a change to Spark. A settings-driven change is an excellent way to have the best of both worlds: sane & consistent behavior for future users and an easy backward compatibility mode for others. As for the idea to have the `size()` behavior differ for aggregation vs. non-aggregation processing, I think this would be a bad pattern to introduce into an already very complex system such as Spark.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
