[
https://issues.apache.org/jira/browse/SPARK-16324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359361#comment-15359361
]
Jeff Zhang commented on SPARK-16324:
------------------------------------
I think this is by design
{code}
override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
if (!p.equals(lastRegex)) {
// regex value changed
lastRegex = p.asInstanceOf[UTF8String].clone()
pattern = Pattern.compile(lastRegex.toString)
}
val m = pattern.matcher(s.toString)
if (m.find) {
val mr: MatchResult = m.toMatchResult
UTF8String.fromString(mr.group(r.asInstanceOf[Int]))
} else {
UTF8String.EMPTY_UTF8
}
}
{code}
> regexp_extract returns empty string when match fails
> ----------------------------------------------------
>
> Key: SPARK-16324
> URL: https://issues.apache.org/jira/browse/SPARK-16324
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.0.0
> Reporter: Max Moroz
> Priority: Minor
>
> The documentation for regexp_extract isn't clear about how it should behave
> if the regex didn't match the row. However, the Java documentation it refers
> for further detail suggests that the return value should be null if the group
> wasn't matched at all, empty string is the group actually matched empty
> string, and an exception raised if the entire regex didn't match.
> This would be identical to how python's own re module behaves when a
> MatchObject.group() is called.
> However, in practice regexp_extract() returns empty string when the match
> fails. This seems to be a bug; if it was intended as a feature, it should
> have been documented as such - and it was probably not a good idea since it
> can result in silent bugs.
> {code}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([['abc']], ['text'])
> assert df.select(F.regexp_extract('text', r'(z)', 1)).first()[0] == ''
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]