Akanksha-kedia commented on PR #18892:
URL: https://github.com/apache/pinot/pull/18892#issuecomment-4845995511

   cc @xiangfu0 @Jackie-Jiang — requesting your review.
   
   ---
   
   ### Self-review findings
   
   #### skill: review-performance
   
   **MAJOR — Double-scan for negative indices [BUG-PERF / C4.2]**
   
   For a negative `index`, the new code makes two full forward scans:
   1. `countFieldsLimited` — scans the entire string to count fields
   2. `splitPartLimitedForward` — scans the entire string again to extract the 
target field
   
   The 3-arg `splitPart` already has `splitPartNegativeIdxSingleCharDelim` 
which uses a backward scan (O(n), no count pass). The 4-arg overload could 
adopt the same pattern for single-char delimiters: scan backward once, stop 
when the field is found. For multi-char delimiters the double-scan is harder to 
avoid, but the current code applies it to all cases.
   
   Recommendation: add a single-char-delimiter fast path mirroring 
`splitPartNegativeIdxSingleCharDelim` to the 4-arg overload, or document the 
trade-off in a TODO.
   
   **MINOR — No benchmark attached [C4.5]**
   
   The PR claims a GC-pressure improvement but does not attach JMH results. 
Consider running `/bench-compare` on a `StringFunctionsBenchmark` (or 
equivalent) to quantify the per-row gain before merge.
   
   ---
   
   #### skill: review-correctness-nulls
   
   **MINOR — `while (pos <= inputLen)` upper-bound in `splitPartLimitedForward` 
[C5.6]**
   
   The loop `while (pos <= inputLen)` allows `pos == inputLen` to enter the 
body. At that point `input.indexOf(delimiter, inputLen)` always returns `-1`, 
so control falls into the "no more delimiters" branch and returns `"null"` (or 
the field value). This is functionally correct because the trailing-delimiter 
check inside the loop (`if (pos >= inputLen)`) already handles the real 
trailing-empty-field case and `return`s before the loop re-evaluates. But the 
`<=` condition is misleading — a reader has to trace the full control flow to 
convince themselves it is safe. Using `while (pos < inputLen)` with the 
existing trailing-field check would make the invariant self-evident.
   
   **MINOR — `countFieldsLimited` "all separators" return value mismatch with 
effectiveLimit [C5.3]**
   
   When the entire input is separators, `countFieldsLimited` returns `1` 
unconditionally, ignoring `effectiveLimit`. This matches 
`splitByWholeSeparator` semantics (always one empty token), but a reader 
comparing the two methods might expect the limit-reached check to apply. A 
one-line comment (`// all-separator input always yields a single empty trailing 
field regardless of limit`) would pre-empt the question.
   
   ---
   
   #### skill: review-testing
   
   **CRITICAL — Fuzz test does not cover the changed 4-arg overload [BUG-TEST / 
C6.1]**
   
   `testSplitPartRandomized` generates 10 000 random inputs but calls 
`splitPart(input, delimiter, index)` (3-arg). The PR changed the **4-arg** 
`splitPart(input, delimiter, limit, index)`. No equivalent randomised 
cross-check against `splitPartArrayBased` exists for the 4-arg path.
   
   Suggested addition:
   
   ```java
   @Test
   public void testSplitPartLimitedRandomized() {
     String chars = "abcdefg.,:;+-_/";
     String[] delimiters = {".", ",", ":", "::", "++", "ab", "///"};
     int[] limits = {0, 1, 2, 3, 5, 10, Integer.MAX_VALUE};
     Random random = new Random(42);
     for (int iter = 0; iter < 10_000; iter++) {
       int len = random.nextInt(50);
       StringBuilder sb = new StringBuilder(len);
       for (int i = 0; i < len; i++) 
sb.append(chars.charAt(random.nextInt(chars.length())));
       String input = sb.toString();
       String delimiter = delimiters[random.nextInt(delimiters.length)];
       int limit = limits[random.nextInt(limits.length)];
       int index = random.nextInt(21) - 10;
       String expected = StringFunctions.splitPartArrayBased(
           StringUtils.splitByWholeSeparator(input, delimiter, limit <= 0 ? 0 : 
limit), index);
       String actual = StringFunctions.splitPart(input, delimiter, limit, 
index);
       assertEquals(actual, expected,
           String.format("Mismatch input='%s' delim='%s' limit=%d index=%d", 
input, delimiter, limit, index));
     }
   }
   ```
   
   **MINOR — No integration test for the 4-arg overload [C6.9]**
   
   The `splitPart` function is registered in `all-functions.yaml` but 
`StringFunctions.json` has no SQL-level test cases for the `splitPart(col, 
delim, limit, index)` form. A few rows in 
`pinot-query-runtime/src/test/resources/queries/StringFunctions.json` covering 
positive index, negative index, and limit-truncation would close this gap 
without needing a new cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to