LuciferYang commented on code in PR #56315:
URL: https://github.com/apache/spark/pull/56315#discussion_r3360116459
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##########
@@ -738,26 +736,8 @@ case class RegExpReplace(subject: Expression, regexp:
Expression, rep: Expressio
lastReplacementInUTF8 = r.asInstanceOf[UTF8String].clone()
lastReplacement = lastReplacementInUTF8.toString
}
- val source = s.toString()
- val position = i.asInstanceOf[Int] - 1
- if (position == 0 || position < source.length) {
- val m = pattern.matcher(source)
- m.region(position, source.length)
- result.delete(0, result.length())
- while (m.find) {
- try {
- m.appendReplacement(result, lastReplacement)
- } catch {
- case NonFatal(e) =>
- throw QueryExecutionErrors.invalidRegexpReplaceError(s.toString,
- p.toString, r.toString, i.asInstanceOf[Int], e)
- }
- }
- m.appendTail(result)
- UTF8String.fromString(result.toString)
- } else {
- s
- }
+ RegExpUtils.replace(pattern, s.asInstanceOf[UTF8String], lastReplacement,
Review Comment:
Follow-up on this, plus one open question for a joint call:
Pushed edcd5c6: I also dropped the `regexp` parameter entirely and now use
`pattern.pattern()` for the error message. Since `compileRegexPattern` compiles
the raw regexp without escaping (collation is carried in the `int` flags, not
the pattern text), `pattern.pattern()` round-trips the original regexp exactly,
and it's only evaluated on the rare error path -- this removes the eager
per-row `regexp.toString()` that the String-param version would otherwise have
added in the generated code.
Open question on the remaining `subject` argument: to make the eval site
fully cast-free I pass it as `String` (`s.toString`). One consequence is that
the out-of-range no-op branch now returns `UTF8String.fromString(source)`
instead of the original `subject` bytes. For valid UTF-8 this is
byte-identical, and it's consistent with the match path (which already builds
its result via `UTF8String.fromString(...)`). The only divergence is for
malformed UTF-8 subjects (e.g. `cast(binary as string)` with invalid bytes),
where the String round-trip substitutes U+FFFD; passing it as `String` also
drops the checked cast, so a hypothetical type mismatch would no longer fail
fast.
I can instead keep `subject: UTF8String` (returning it unchanged on the
out-of-range path -> exact bytes + checked cast) at the cost of one remaining
`.asInstanceOf[UTF8String]` at the call site. Do you prefer (a) the fully
cast-free `String` version as-is, or (b) keeping `subject: UTF8String` for
byte-exact passthrough? Happy to go either way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]