Re: [PR] [SPARK-57436][SQL] Support start and occurrence parameters in instr function [spark]

via GitHub Wed, 17 Jun 2026 11:45:18 -0700


cloud-fan commented on code in PR #56498:
URL: https://github.com/apache/spark/pull/56498#discussion_r3430661449



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1246,6 +1246,117 @@ public int indexOf(UTF8String v, int start) {
     return -1;
   }
 
+  /**
+   * Finds the {@code occurrence}-th occurrence of {@code pattern} in this 
string,
+   * starting the search at the specified position.
+   * When {@code start} is positive, the search proceeds forward from the
+   * {@code start}-th character (1‑based). When {@code start} is negative, the

Review Comment:
   Nit: this uses a U+2011 NON-BREAKING HYPHEN in "1‑based" instead of a plain 
ASCII "-", which violates the project's ASCII-in-comments convention (the 
`scalastyle` nonascii rule that flags this on `.scala`). The same non-breaking 
hyphen also appears at line 1259 (`1‑based start position`) and line 1261 
(`0‑based character index`) — please apply the same one-character fix there.
   ```suggestion
      * {@code start}-th character (1-based). When {@code start} is negative, 
the
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -1701,6 +1730,79 @@ case class StringInstr(str: Expression, substr: 
Expression)
     newLeft: Expression, newRight: Expression): StringInstr = copy(str = 
newLeft, substr = newRight)
 }
 
+/**
+ * A function that returns the position of the specified occurrence of 
`substr` in the given
+ * string, starting the search from position `start`. If `start` is positive, 
the search proceeds
+ * forward; if `start` is negative, the search proceeds backward. `start` = 0 
returns 0. If
+ * `start` is not specified, it defaults to 1. If `occurrence` is specified, 
it determines which
+ * occurrence of `substr` to return; `occurrence` must be a positive integer 
and defaults to 1.
+ *
+ * Returns null if either of the arguments are null and
+ * returns 0 if substr could not be found in str.
+ *
+ * NOTE: that this is not zero based, but 1-based index. The first character 
in str has index 1.
+ */
+case class StringInstrWithOccurrence(
+    str: Expression,
+    sub: Expression,
+    start: Expression,
+    occurrence: Expression,
+    failOnError: Boolean = SQLConf.get.ansiEnabled)

Review Comment:
   Reinforcing @uros-b's open thread below with a sharper rationale: I'd drop 
`failOnError` entirely and **always throw** 
`INVALID_PARAMETER_VALUE.OCCURRENCE` on `occurrence <= 0`, removing the `else 
return null` in `nullSafeEval`, the null branch in `doGenCode`, and the 
non-ANSI test.
   
   Two reasons:
   1. **New signature, nothing to preserve.** The non-ANSI path exists to keep 
legacy behavior, but the 3/4-arg `instr` is brand new and ANSI is on by default 
(`ANSI_ENABLED.createWithDefault(true)` since 4.0). So this branch only changes 
behavior for users who explicitly disable ANSI, with no compatibility reason — 
it's speculative complexity.
   2. **It's a usage error, not a data error.** `occurrence <= 0` is an invalid 
*argument*; `INVALID_PARAMETER_VALUE.*` usage errors throw unconditionally by 
convention (e.g. `BIT_POSITION_RANGE`, `PATTERN`). ANSI-gating is for 
data-dependent runtime errors (overflow, divide-by-zero), not argument 
validation.
   
   Dropping the field also simplifies both `nullSafeEval` and `doGenCode`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57436][SQL] Support start and occurrence parameters in instr function [spark]

Reply via email to