gortiz commented on PR #18519:
URL: https://github.com/apache/pinot/pull/18519#issuecomment-4519341207

   After the latest round of reviewer-requested fixes, I ran an internal strict 
pre-push review across the cumulative diff. It surfaced three additional issues 
that weren't on any review thread — fixed all three to avoid shipping them.
   
   ### Fixed
   
   1. **`send(Eos)` vs `cancel` race throwing `IllegalStateException`** — 
symmetric to the data-path race that `c39e12f1a4` closed, but on the *closing* 
edge of the stream. Both paths pass their top-of-method `isTerminated()` check 
before either sets `_senderSideClosed`. Cancel wins the `_readyLock`, calls 
`onCompleted()`, releases. The EOS path then takes the lock and calls `onNext` 
/ `onCompleted` on an already-half-closed stream → `IllegalStateException` 
propagates out of `send(Eos)`. Cancel-path swallowed it defensively; EOS-path 
didn't. Fixed in `840c5640ea` with a defensive `try/catch` around the EOS path 
(matching the cancel-side pattern) plus a regression test 
`concurrentSendEosAndCancelDoesNotThrow` that drives `send(EOS) + cancel()` 
through a `CyclicBarrier(2)` for 200 iterations and asserts neither call throws.
   
   2. **`close()` leaving the gRPC sender stream half-open** — `close()` pushes 
an internal-error block via `processAndSend(...bypassReady=true)` but never 
calls `onCompleted()`. The stream sits half-open from the sender side until the 
per-call deadline fires — for multi-minute MSE queries that's exactly the 
allocator pin this PR exists to fix. Also had a `_contentObserver != null` 
check-then-act outside the lock (same shape `ensureContentObserverInitialized` 
exists to prevent). Fixed in `aeaacc893d`: route through 
`ensureContentObserverInitialized`, take `_readyLock`, set `_senderSideClosed = 
true`, call `onCompleted()` with the same defensive `try/catch`.
   
   3. **`wakeWaiters` from Netty event-loop locks `_readyLock`** — 
`setOnReadyHandler` fires on a Netty channel/event-loop thread; `wakeWaiters` 
then acquires `_readyLock`. If the sender thread is mid-`onNext` under that 
lock, the event-loop thread briefly blocks. Intentional and the lock-held 
sections are bounded by a single `onNext`/`onCompleted`, but the pattern is 
non-obvious enough to surprise a maintainer debugging latency spikes. 
Documented in `e367aa697f` so the next reader doesn't reach for a 
`tryLock`+executor restructure without context.
   
   ### Considered and skipped
   
   - `AtomicBoolean` CAS on `_senderSideClosed` to make the close transition 
atomic across EOS / cancel / `close()`. Behavioral effect is now bounded by the 
defensive try/catches; left as a separate concern.
   - Documenting the precondition on `ensureContentObserverInitialized` 
("callers must `isTerminated()`-check first") — pure docs.
   - Bench drainer's `_stop.set(true)` before `_readSignal.release()` ordering 
— benign at JVM shutdown.
   
   Strict review run: `code-reviewer` agent with 8 domain-specialized 
sub-reviewers in parallel (concurrency-state, performance, correctness-nulls, 
testing, architecture, naming-api, process-scope, config-backcompat). The three 
above were everything in the CRITICAL/MAJOR/MINOR-worth-fixing buckets.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to