[GitHub] [spark] ssimeonov commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

GitBox Sat, 07 Mar 2020 18:39:11 -0800

ssimeonov commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` 
from `size(NULL)` by default
URL: https://github.com/apache/spark/pull/27834#issuecomment-596159602
 
 
   @HeartSaVioR the workaround was simple, unpleasant and unstable:
   
   - **Simple**, because we realized that we should never use `size()` as its 
behavior is (a) inconsistent with SQL principles and (b) dangerous for 
analytics.
   
   - **Unpleasant**, because it involved replacing every single use of the 
function in our codebase with a `Column` implicit that we created (`size(x)` -> 
`x.size`). This was easy for IDE-managed code but hard for notebook code. After 
consideration, we decided that our implicit would return 0 for `null`, because 
in every single use case we could think of for our data, we would be converting 
null returns to 0. (I would not recommend this for Spark as the behavior is 
inconsistent with SQL.) This is also the reason why we went with an implicit as 
opposed to, say, using import order to hide `functions.size()` with our own 
`size()`. Caveat: if we had more people using SparkSQL, we would have gone with 
`size0(x)` and `sizen(x)` instead of an implicit, for version that return 0 or 
null for null input.
   
   - **Unstable**, because there is no way to guarantee the problem will not 
happen again due to user error. It's part of our DOs and DONTs of Spark not to 
use `size()` but discipline alone does not offer guaranteed protection. The 
only way to guarantee the behavior is via a change to Spark. A settings-driven 
change is an excellent way to have the best of both worlds: sane & consistent 
behavior for future users and an easy backward compatibility mode for others.
   
   As for the idea to have the `size()` behavior differ for aggregation vs. 
non-aggregation processing, I think this would be a bad pattern to introduce 
into an already very complex system such as Spark.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ssimeonov commented on issue #27834: revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default

Reply via email to