[
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010349#comment-16010349
]
Yuming Wang commented on SPARK-19248:
-------------------------------------
*scala* api works fine:
{code:java}
scala> val df = spark.createDataFrame(Seq((0, ".. 5. "))).toDF("id","col")
df: org.apache.spark.sql.DataFrame = [id: int, col: string]
scala> df.select(regexp_replace($"col", "[ \\.]*", "")).show()
+-----------------------------+
|regexp_replace(col, [ \.]*, )|
+-----------------------------+
| 5|
+-----------------------------+
scala> df.select(regexp_replace($"col", "( |\\.)*", "")).show()
+------------------------------+
|regexp_replace(col, ( |\.)*, )|
+------------------------------+
| 5|
+------------------------------+
{code}
> Regex_replace works in 1.6 but not in 2.0
> -----------------------------------------
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.2
> Reporter: Lucas Tittmann
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2,
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('.. 5. ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('.. 5. ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the
> Spark version. We checked the regex in Java, and both should be correct and
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not
> have the possibility to confirm in 2.1 at the moment.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]