[GitHub] asmello commented on issue #23879: [SPARK-26979][SQL] Add missing column name support for SQL functions

GitBox Sun, 24 Feb 2019 02:20:32 -0800

asmello commented on issue #23879: [SPARK-26979][SQL] Add missing column name 
support for SQL functions
URL: https://github.com/apache/spark/pull/23879#issuecomment-466759565
 
 
   > Yes, it costs maintenance overhead - exposed APIs are getting larger.
   
   Ok, now this is a fair point and one I really can't dispute. But with such a 
trivial addition, I find it hard to believe it will ever cost more than it 
costs people using the API not to have it.
   
   > Not all SQL functions work like this.
   
   All math functions do, except `abs()`. All other string functions too, like 
`concat_ws()`. But you are actually right, now that I looked more closely, 
there are many more exceptions than I realised.
   
   I ran into this issue from PySpark - there these four functions are pretty 
much the only ones that can't take a column name, but I see now that that's 
because PySpark does conversion itself, under the hood, most of the time. So I 
think it may be probably better to apply this change there.
   
   > What kind of attrition does it cause to new learners? Inconsistency?
   
   Again, this problem surfaced from using PySpark, so I can see why you don't 
think it exists in core Spark. But it's very real. Java likes to keep things 
verbose and repetitive, and maybe that spilled over to Scala a little bit, but 
Python is strongly biased towards readability over protections (type, access 
etc.).
   
   I suspect that is why PySpark developers made almost all functions support 
column names. It falls in line with the language's philosophy, making things 
easier to read, and it helps you not repeat yourself writing `col()` over and 
over again. So people learning PySpark quickly get into the habit of using 
column names, but then they face these few exceptions where they can't do that. 
Then what was once a clear pattern is broken, and without a clear guideline 
people will sometimes use `col()` and sometimes not, making code stylistically 
inconsistent and confusing.
   
   > In that case we should rather focus on removing string overridden 
versions. We should rather focus on reducing APIs rather than adding them.
   
   This I think is a bad idea, adding overloads is not a breaking change, but 
removing them is. It's been inconsistent like this for a long time, so I'm sure 
loads of code already relies on using column names where allowed. I find it 
hard to justify such a big downside for the sake of consistency.
   
   > Otherwise, all systems will end up with some APIs like sending an email at 
the end.
   
   This is an exaggeration. All I proposed was something already done elsewhere 
in the same codebase, not a new use case.
   
   > Spark community had a discussion about it before and people tend to think 
about rather removing string overridden versions.
   
   If it ever happens, ok, at least it will make the API consistent. Still 
there is a reason those overloads exist - it does make code less verbose, which 
I don't think is a bad thing.
   
   With all of this in mind, I'll be happy to close this PR and reapply the 
change in the PySpark side. Does that sound reasonable?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] asmello commented on issue #23879: [SPARK-26979][SQL] Add missing column name support for SQL functions

Reply via email to