asmello commented on issue #23879: [SPARK-26979][SQL] Add missing column name support for SQL functions URL: https://github.com/apache/spark/pull/23879#issuecomment-466759565 > Yes, it costs maintenance overhead - exposed APIs are getting larger. Ok, now this is a fair point and one I really can't dispute. But with such a trivial addition, I find it hard to believe it will ever cost more than it costs people using the API not to have it. > Not all SQL functions work like this. All math functions do, except `abs()`. All other string functions too, like `concat_ws()`. But you are actually right, now that I looked more closely, there are many more exceptions than I realised. I ran into this issue from PySpark - there these four functions are pretty much the only ones that can't take a column name, but I see now that that's because PySpark does conversion itself, under the hood, most of the time. So I think it may be probably better to apply this change there. > What kind of attrition does it cause to new learners? Inconsistency? Again, this problem surfaced from using PySpark, so I can see why you don't think it exists in core Spark. But it's very real. Java likes to keep things verbose and repetitive, and maybe that spilled over to Scala a little bit, but Python is strongly biased towards readability over protections (type, access etc.). I suspect that is why PySpark developers made almost all functions support column names. It falls in line with the language's philosophy, making things easier to read, and it helps you not repeat yourself writing `col()` over and over again. So people learning PySpark quickly get into the habit of using column names, but then they face these few exceptions where they can't do that. Then what was once a clear pattern is broken, and without a clear guideline people will sometimes use `col()` and sometimes not, making code stylistically inconsistent and confusing. > In that case we should rather focus on removing string overridden versions. We should rather focus on reducing APIs rather than adding them. This I think is a bad idea, adding overloads is not a breaking change, but removing them is. It's been inconsistent like this for a long time, so I'm sure loads of code already relies on using column names where allowed. I find it hard to justify such a big downside for the sake of consistency. > Otherwise, all systems will end up with some APIs like sending an email at the end. This is an exaggeration. All I proposed was something already done elsewhere in the same codebase, not a new use case. > Spark community had a discussion about it before and people tend to think about rather removing string overridden versions. If it ever happens, ok, at least it will make the API consistent. Still there is a reason those overloads exist - it does make code less verbose, which I don't think is a bad thing. With all of this in mind, I'll be happy to close this PR and reapply the change in the PySpark side. Does that sound reasonable?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
