[
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663332#comment-16663332
]
kevin yu edited comment on SPARK-25807 at 10/25/18 6:43 AM:
------------------------------------------------------------
[~oron.navon]: You can also try to implement a simply python UDF to do the
0-based substr(), here is an example:
def substr0(x,y,z):
if y < 0:
y = 0
if z < 0:
z = 0
return x[y:z+y]
from pyspark.sql.functions import UserDefinedFunction
sub_str0 = spark.catalog.registerFunction("subStr0", UserDefinedFunction(lambda
x,y,z: substr0(x,y,z)))
spark.sql("select substr0('kevin', 0, 2)").collect()
[Row(subStr0(kevin, 0, 2)=u'ke')]
spark.sql("select substr0('kevin', 0, 2)").collect()
[Row(subStr0(kevin, 1, 2)=u'ev')]
was (Author: kevinyu98):
[~oron.navon]: You can also try to implement a simply python UDF to do the
0-based substr(), here is an example:
'def substr0(x,y,z):
if y < 0:
y = 0
if z < 0:
z = 0
return x[y:z+y]
from pyspark.sql.functions import UserDefinedFunction
sub_str0 = spark.catalog.registerFunction("subStr0", UserDefinedFunction(lambda
x,y,z: substr0(x,y,z)))
spark.sql("select substr0('kevin', 0, 2)").collect()
[Row(subStr0(kevin, 0, 2)=u'ke')]
spark.sql("select substr0('kevin', 0, 2)").collect()
[Row(subStr0(kevin, 1, 2)=u'ev')]'
> Mitigate 1-based substr() confusion
> -----------------------------------
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
> Issue Type: Improvement
> Components: Java API, PySpark
> Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
> Reporter: Oron Navon
> Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's
> {{substr}}, which are zero-based. Both PySpark users and Java API users
> often naturally expect a 0-based {{substr()}}. Adding to the confusion,
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option
> here, I suggest making one or more of the following changes:
> # Adding a method {{substr0}}, which would be zero-based
> # Renaming {{substr}} to {{substr1}}
> # Making the existing {{substr()}} throw an exception on {{startPos==0}},
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]