LuciferYang commented on code in PR #56315:
URL: https://github.com/apache/spark/pull/56315#discussion_r3360116459


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala:
##########
@@ -738,26 +736,8 @@ case class RegExpReplace(subject: Expression, regexp: 
Expression, rep: Expressio
       lastReplacementInUTF8 = r.asInstanceOf[UTF8String].clone()
       lastReplacement = lastReplacementInUTF8.toString
     }
-    val source = s.toString()
-    val position = i.asInstanceOf[Int] - 1
-    if (position == 0 || position < source.length) {
-      val m = pattern.matcher(source)
-      m.region(position, source.length)
-      result.delete(0, result.length())
-      while (m.find) {
-        try {
-          m.appendReplacement(result, lastReplacement)
-        } catch {
-          case NonFatal(e) =>
-            throw QueryExecutionErrors.invalidRegexpReplaceError(s.toString,
-              p.toString, r.toString, i.asInstanceOf[Int], e)
-        }
-      }
-      m.appendTail(result)
-      UTF8String.fromString(result.toString)
-    } else {
-      s
-    }
+    RegExpUtils.replace(pattern, s.asInstanceOf[UTF8String], lastReplacement,

Review Comment:
   Follow-up on this, plus one open question for a joint call:
   
   Pushed edcd5c6: I also dropped the `regexp` parameter entirely and now use 
`pattern.pattern()` for the error message. Since `compileRegexPattern` compiles 
the raw regexp without escaping (collation is carried in the `int` flags, not 
the pattern text), `pattern.pattern()` round-trips the original regexp exactly, 
and it's only evaluated on the rare error path -- this removes the eager 
per-row `regexp.toString()` that the String-param version would otherwise have 
added in the generated code.
   
   Open question on the remaining `subject` argument: to make the eval site 
fully cast-free I pass it as `String` (`s.toString`). One consequence is that 
the out-of-range no-op branch now returns `UTF8String.fromString(source)` 
instead of the original `subject` bytes. For valid UTF-8 this is 
byte-identical, and it's consistent with the match path (which already builds 
its result via `UTF8String.fromString(...)`). The only divergence is for 
malformed UTF-8 subjects (e.g. `cast(binary as string)` with invalid bytes), 
where the String round-trip substitutes U+FFFD; passing it as `String` also 
drops the checked cast, so a hypothetical type mismatch would no longer fail 
fast.
   
   I can instead keep `subject: UTF8String` (returning it unchanged on the 
out-of-range path -> exact bytes + checked cast) at the cost of one remaining 
`.asInstanceOf[UTF8String]` at the call site. Do you prefer (a) the fully 
cast-free `String` version as-is, or (b) keeping `subject: UTF8String` for 
byte-exact passthrough? Happy to go either way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to