yashmayya commented on PR #18513:
URL: https://github.com/apache/pinot/pull/18513#issuecomment-4492234210

   This is the pre-existing flake tracked in #18490, not something this PR 
caused.
   
   **Failure point.** `org.apache.pinot.compat.StreamOp.fetchExistingTotalDocs` 
at line 281:
   
   ```java
   if (response.has(EXCEPTIONS) && !response.get(EXCEPTIONS).isEmpty()) {
     ...
     JsonNode exceptions = response.get(EXCEPTIONS);     // ArrayNode
     JsonNode errorCode  = exceptions.get(ERROR_CODE);   // 
ArrayNode.get(String) → null
     if (QueryErrorCode.BROKER_INSTANCE_MISSING.getId() == errorCode.asInt()) { 
 // NPE
   ```
   
   `exceptions` is a JSON array (`[{"errorCode": ..., "message": ...}, ...]`), 
but the code indexes it by a string key, so `errorCode` is always `null` 
whenever the broker returns any non-empty `exceptions` array. That happens 
during the post-upgrade/post-downgrade window where segments are still 
bootstrapping and the broker emits errors like `305=N segments unavailable` 
(visible right above the stack trace in the log). That code path is unreachable 
in the steady state, so the rest of the suite passes — it only blows up when 
the test hits the cluster mid-bootstrap.
   
   The query that crashes is `SELECT count(*) FROM <table>` (StreamOp.java:268) 
— no UNION, so `AggregateUnionTransposeRule` can't even match the plan.
   
   **Why I'm confident it's unrelated to this PR:**
   
   1. The same NPE at the same stack frame hits master itself in run 
[26066589112](https://github.com/apache/pinot/actions/runs/26066589112) (a pure 
`com.google.cloud:libraries-bom` dependency bump), with no MSE-planner changes 
whatsoever.
   2. Across the 5 attempts of this PR's compat check, the failures don't track 
this PR's code — they track the cluster's startup timing:
      - Attempt 1: against-master FAIL, against-release-1.5.0 SUCCESS
      - Attempt 2: both FAIL
      - Attempts 3–5: against-master SUCCESS, against-release-1.5.0 FAIL
   
      If the change in 332f5cf (hint propagation) were the cause, both 
baselines should fail deterministically — they don't.
   3. Issue #18490 (filed 2026-05-13) describes exactly this NPE, with the same 
root cause and the same fix (`exceptions.get(0).get("errorCode")` plus retry on 
`BROKER_SEGMENT_UNAVAILABLE`).
   
   Happy to send a separate small PR fixing #18490 to stop this from blocking 
unrelated PRs, but it's out of scope for this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to