[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663648#comment-16663648
 ] 

Sean Owen commented on SPARK-25807:
-----------------------------------

The language here is SQL(-like) here though, not Python or Java. 
Column.substr() has to match how SQL substr() works. And this method exists 
because it exists in SQL, and needs to match its semantics, including 1-based 
position. These methods exist to match SQL, and not as a library of useful 
functions designed from first principles. Most anything else you want to do, 
write code/UDFs for, as it's easier. But in this case, I can't see value in 
redundant substr functions. The existence of alternatives doesn't even mitigate 
the issue you describe.

The pos semantics are again taken from SQL, and lots of things are clunky about 
SQL. This argument can be negative to define a position from the end of the 
string. I don't see behavior for 0 defined anywhere, but treating it as an 
error or the empty string always or just as a synonym for "1" seem plausible. 
Hive does the latter, and maybe other SQL engines, but matching Hive alone is a 
good enough reason for this behavior.

> Mitigate 1-based substr() confusion
> -----------------------------------
>
>                 Key: SPARK-25807
>                 URL: https://issues.apache.org/jira/browse/SPARK-25807
>             Project: Spark
>          Issue Type: Improvement
>          Components: Java API, PySpark
>    Affects Versions: 1.3.0, 2.3.2, 2.4.0, 3.0.0
>            Reporter: Oron Navon
>            Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to