[ https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663648#comment-16663648 ]
Sean Owen commented on SPARK-25807: ----------------------------------- The language here is SQL(-like) here though, not Python or Java. Column.substr() has to match how SQL substr() works. And this method exists because it exists in SQL, and needs to match its semantics, including 1-based position. These methods exist to match SQL, and not as a library of useful functions designed from first principles. Most anything else you want to do, write code/UDFs for, as it's easier. But in this case, I can't see value in redundant substr functions. The existence of alternatives doesn't even mitigate the issue you describe. The pos semantics are again taken from SQL, and lots of things are clunky about SQL. This argument can be negative to define a position from the end of the string. I don't see behavior for 0 defined anywhere, but treating it as an error or the empty string always or just as a synonym for "1" seem plausible. Hive does the latter, and maybe other SQL engines, but matching Hive alone is a good enough reason for this behavior. > Mitigate 1-based substr() confusion > ----------------------------------- > > Key: SPARK-25807 > URL: https://issues.apache.org/jira/browse/SPARK-25807 > Project: Spark > Issue Type: Improvement > Components: Java API, PySpark > Affects Versions: 1.3.0, 2.3.2, 2.4.0, 3.0.0 > Reporter: Oron Navon > Priority: Minor > > The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's > {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's > {{substr}}, which are zero-based. Both PySpark users and Java API users > often naturally expect a 0-based {{substr()}}. Adding to the confusion, > {{substr()}} currently allows a {{startPos}} value of 0, which returns the > same result as {{startPos==1}}. > Since changing {{substr()}} to 0-based is probably NOT a reasonable option > here, I suggest making one or more of the following changes: > # Adding a method {{substr0}}, which would be zero-based > # Renaming {{substr}} to {{substr1}} > # Making the existing {{substr()}} throw an exception on {{startPos==0}}, > which should catch and alert most users who expect zero-based behavior. > This is my first discussion on this project, apologies for any faux pas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org