cloud-fan commented on code in PR #56498:
URL: https://github.com/apache/spark/pull/56498#discussion_r3430661449
##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1246,6 +1246,117 @@ public int indexOf(UTF8String v, int start) {
return -1;
}
+ /**
+ * Finds the {@code occurrence}-th occurrence of {@code pattern} in this
string,
+ * starting the search at the specified position.
+ * When {@code start} is positive, the search proceeds forward from the
+ * {@code start}-th character (1‑based). When {@code start} is negative, the
Review Comment:
Nit: this uses a U+2011 NON-BREAKING HYPHEN in "1‑based" instead of a plain
ASCII "-", which violates the project's ASCII-in-comments convention (the
`scalastyle` nonascii rule that flags this on `.scala`). The same non-breaking
hyphen also appears at line 1259 (`1‑based start position`) and line 1261
(`0‑based character index`) — please apply the same one-character fix there.
```suggestion
* {@code start}-th character (1-based). When {@code start} is negative,
the
```
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -1701,6 +1730,79 @@ case class StringInstr(str: Expression, substr:
Expression)
newLeft: Expression, newRight: Expression): StringInstr = copy(str =
newLeft, substr = newRight)
}
+/**
+ * A function that returns the position of the specified occurrence of
`substr` in the given
+ * string, starting the search from position `start`. If `start` is positive,
the search proceeds
+ * forward; if `start` is negative, the search proceeds backward. `start` = 0
returns 0. If
+ * `start` is not specified, it defaults to 1. If `occurrence` is specified,
it determines which
+ * occurrence of `substr` to return; `occurrence` must be a positive integer
and defaults to 1.
+ *
+ * Returns null if either of the arguments are null and
+ * returns 0 if substr could not be found in str.
+ *
+ * NOTE: that this is not zero based, but 1-based index. The first character
in str has index 1.
+ */
+case class StringInstrWithOccurrence(
+ str: Expression,
+ sub: Expression,
+ start: Expression,
+ occurrence: Expression,
+ failOnError: Boolean = SQLConf.get.ansiEnabled)
Review Comment:
Reinforcing @uros-b's open thread below with a sharper rationale: I'd drop
`failOnError` entirely and **always throw**
`INVALID_PARAMETER_VALUE.OCCURRENCE` on `occurrence <= 0`, removing the `else
return null` in `nullSafeEval`, the null branch in `doGenCode`, and the
non-ANSI test.
Two reasons:
1. **New signature, nothing to preserve.** The non-ANSI path exists to keep
legacy behavior, but the 3/4-arg `instr` is brand new and ANSI is on by default
(`ANSI_ENABLED.createWithDefault(true)` since 4.0). So this branch only changes
behavior for users who explicitly disable ANSI, with no compatibility reason —
it's speculative complexity.
2. **It's a usage error, not a data error.** `occurrence <= 0` is an invalid
*argument*; `INVALID_PARAMETER_VALUE.*` usage errors throw unconditionally by
convention (e.g. `BIT_POSITION_RANGE`, `PATTERN`). ANSI-gating is for
data-dependent runtime errors (overflow, divide-by-zero), not argument
validation.
Dropping the field also simplifies both `nullSafeEval` and `doGenCode`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]