[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Chammas updated SPARK-19248: ------------------------------------- Labels: correctness (was: ) Tagging this as a correctness issue since Spark 2+'s output differ's both from Python's as well as from Spark 1.6's. Python 3.7.4 + Spark 2.4.3: {code:java} >>> df = sqlContext.createDataFrame([('.. 5. ',)], ['col']) >>> df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() [Row(col='5')] >>> df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect() [Row(col='')] <-- This differs from Python's output as well as Spark 1.6's output. >>> import re >>> re.sub(pattern='( |\.)*', repl='', string='.. 5. ') '5'{code} > Regex_replace works in 1.6 but not in 2.0 > ----------------------------------------- > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 2.0.2, 2.4.3 > Reporter: Lucas Tittmann > Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5. ',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5. ',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org