[ 
https://issues.apache.org/jira/browse/SPARK-16324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Moroz updated SPARK-16324:
------------------------------
    Description: 
The documentation for regexp_extract isn't clear about how it should behave if 
the regex didn't match the row. However, the Java documentation it refers for 
further detail suggests that the return value should be null if the group 
wasn't matched at all, empty string is the group actually matched empty string, 
and an exception raised if the entire regex didn't match.

This would be identical to how python's own re module behaves when a 
MatchObject.group() is called.

However, in practice regexp_extract() returns empty string when the match 
fails. This seems to be a bug; if it was intended as a feature, it should have 
been documented as such - and it was probably not a good idea since it can 
result in silent bugs.

{code}
import pyspark.sql.functions as F
df = spark.createDataFrame([['abc']], ['text'])
assert df.select(F.regexp_extract('text', r'(z)', 1)).first()[0] == ''
{code}

  was:
The documentation for regexp_extract isn't clear about how it should behave if 
the regex didn't match the row. However, the Java documentation it refers for 
further detail suggests that the return value should be null if the group 
wasn't matched at all, empty string is the group actually matched empty string, 
and an exception raised if the entire regex didn't match.

This would be identical to how python's own re module behaves when a 
MatchObject.group() is called.

However, in practice regexp_extract() returns empty string when the match 
fails. This seems to be a bug; if it was intended as a feature, it should have 
been documented as such - and it was probably not a good idea since it can 
result in silent bugs.

{code}
import pyspark.sql.functions as F
df = spark.createDataFrame([['abc']], ['text'])
assert df.select(F.regexp_extract('text', r'z', 1)).first()[0] == ''
{code}


> regexp_extract returns empty string when match fails
> ----------------------------------------------------
>
>                 Key: SPARK-16324
>                 URL: https://issues.apache.org/jira/browse/SPARK-16324
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> The documentation for regexp_extract isn't clear about how it should behave 
> if the regex didn't match the row. However, the Java documentation it refers 
> for further detail suggests that the return value should be null if the group 
> wasn't matched at all, empty string is the group actually matched empty 
> string, and an exception raised if the entire regex didn't match.
> This would be identical to how python's own re module behaves when a 
> MatchObject.group() is called.
> However, in practice regexp_extract() returns empty string when the match 
> fails. This seems to be a bug; if it was intended as a feature, it should 
> have been documented as such - and it was probably not a good idea since it 
> can result in silent bugs.
> {code}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([['abc']], ['text'])
> assert df.select(F.regexp_extract('text', r'(z)', 1)).first()[0] == ''
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to