MaxGekk commented on code in PR #56498:
URL: https://github.com/apache/spark/pull/56498#discussion_r3452888315
##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1246,6 +1246,117 @@ public int indexOf(UTF8String v, int start) {
return -1;
}
+ /**
+ * Finds the {@code occurrence}-th occurrence of {@code pattern} in this
string,
+ * starting the search at the specified position.
+ * When {@code start} is positive, the search proceeds forward from the
+ * {@code start}-th character (1-based). When {@code start} is negative, the
+ * search proceeds backward from the {@code |start|}-th character from the
end.
+ * Overlapping matches are supported (e.g. "aa" in "aaa" returns 0, 1, 2 for
+ * occurrence 1, 2, 3 respectively).
+ *
+ * @param pattern the substring to search for
+ * @param start 1-based start position; if negative, search direction
is reversed
+ * @param occurrence which occurrence to return (must be >= 1)
+ * @return 0-based character index of the match, or -1 if not found
+ */
+ public int indexOf(UTF8String pattern, int start, int occurrence) {
+ assert occurrence > 0;
+ if (pattern.numBytes() == 0) {
+ return indexOfEmpty(start);
+ }
+ if (start == 0) {
+ return -1;
+ }
+
+ int charCount = 0;
+ int charsToSkip, byteIdx;
+ if (start > 0) {
+ byteIdx = 0; // position in byte
+ charsToSkip = start - 1; // skip character count
+ while (byteIdx < numBytes && charCount < charsToSkip) {
+ byteIdx += numBytesForFirstByte(getByte(byteIdx));
+ charCount += 1;
+ }
+ } else {
+ // For negative start, byteIdx is the exclusive end boundary of the
search.
+ // We skip (-start - 1) characters to place byteIdx just past the last
+ // character that may participate in the match.
+ charsToSkip = -start - 1;
+ byteIdx = numBytes;
+ while (byteIdx > 0 && charCount < charsToSkip) {
+ byteIdx = prevCharStart(byteIdx);
+ charCount++;
+ }
+ }
+ // If start position is out of range, return -1 immediately
+ if (charCount < charsToSkip) return -1;
+
+ // For forward search, byteIdx is the starting byte offset of the current
character,
+ // and charCount tracks the 0-based character index of that character.
+ // For backward search, byteIdx is the exclusive end boundary (the byte
after the
+ // last character that can participate in the match), and charCount is not
used
+ // for indexing but kept for consistency.
+ if (start > 0) {
+ // Search for the occurrence-th match, starting from the current byteIdx.
+ while (occurrence > 0) {
+ // If byteIdx equals numBytes, it indicates that we have
+ // reached the end of the string, yet we still enter the loop
+ // and return -1 when the 'not found' condition is met.
+ while (byteIdx <= numBytes) {
+ if (pattern.numBytes + byteIdx > numBytes) {
+ return -1;
+ }
+
+ if (ByteArrayMethods.arrayEquals(base, offset + byteIdx,
+ pattern.base, pattern.offset, pattern.numBytes)) {
+ break;
+ }
+ byteIdx += numBytesForFirstByte(getByte(byteIdx));
+ charCount += 1;
+ }
+
+ occurrence--;
+ if (occurrence == 0) return charCount;
+
+ byteIdx += numBytesForFirstByte(getByte(byteIdx));
+ charCount += 1;
+ }
+ } else {
+ // Backward search: the search range is [0, byteIdx) in bytes.
Review Comment:
I think the backward-search boundary is off for multi-character patterns.
For Oracle-style `INSTR`, `position` is the first character of the first
substring to compare even when it is negative (Oracle docs:
https://docs.oracle.com/en/database/oracle/oracle-database/26/sqlrf/INSTR.html).
So `instr('abcde', 'cd', -3, 1)` should return `3`: `-3` points at `c`, and
the first candidate substring is `cd`.
This implementation treats negative `start` as an exclusive end boundary and
first compares the substring ending at `c` (`bc`), then moves left, so it
appears to return 0 for that case. Could you add a test like `instr('abcde',
'cd', -3, 1)` and adjust the backward path to compare candidates by their start
position rather than their end boundary? The UTF8_LCASE slow path seems to have
the same issue via `lowercaseMatchLengthUntil(..., startIdx)`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]