markrmiller opened a new pull request, #4548:
URL: https://github.com/apache/solr/pull/4548

   https://issues.apache.org/jira/browse/SOLR-18244
   
   SOLR-18244: Fix several concurrency bugs in ParallelHttpShardHandler
   
   - Catch RejectedExecutionException from commExecutor and route through
     recordShardSubmitError so saturation/shutdown does not propagate
     synchronously and abandon already-submitted shard requests; this
     preserves shards.tolerant semantics and returns a 503 instead of 500.
   
   - Synchronize submitFutures registration and removal on the same monitor
     HttpShardHandler uses for responseFutureMap so cancelAll observes a
     consistent view of both maps.
   
   - Skip work in the runAsync runnable when the outer future has already
     been cancelled, avoiding a wasted lbClient.requestAsync call.
   
   - In HttpShardHandler.takeCompletedIncludingErrors, replace the blocking
     responses.take() with a 50ms poll so subclasses that gate
     responsesPending() on an async tracker (e.g. submitFutures) cannot
     cause a lost-wakeup hang when the tracker drains between the pending
     check and the queue wait.
   
   - Wrap the inner whenComplete body in a try/catch so a thrown
     transformResponse (or other failure inside the lambda) still enqueues
     a response and unblocks take() instead of stranding the consumer.
   
   - Expose cancellationLock() and isCanceled() on HttpShardHandler so
     subclasses can synchronize on the same monitor that already guards
     the canceled flag, responseFutureMap, and the responses queue.
   
   - Synchronize ParallelHttpShardHandler.responsesPending() on the
     cancellation monitor. Without the lock, take()'s loop performs three
     unsynchronized isEmpty() reads (responseFutureMap, responses,
     submitFutures) and can transiently observe all three as empty even
     though super.makeShardRequest's put and outer.whenComplete's
     submitFutures.remove are causally ordered. The intermediate state is
     consistent with each individual read but never with the cross-map
     invariant, so take() exits with null and the pending response (which
     arrives moments later via the inner whenComplete) is silently dropped
     by SearchHandler. Acquiring the lock serializes the three reads with
     every state-mutating operation.
   
   - Stop using submitFutures.isEmpty() as the loop guard in
     responsesPending(); use an exact AtomicInteger (inFlightSubmits)
     instead. ConcurrentHashMap.size()/isEmpty() are documented estimates:
     sumCount() = baseCount + sum(counterCells), and under contended
     put/remove from many threads the per-cell deltas can settle at a
     non-zero value while the table is physically empty. When that
     happens, isEmpty() returns false on a logically empty map,
     responsesPending() stays true, and the parent thread parks in
     responses.take() forever. AtomicInteger.get() is exact under any
     concurrency. The counter is incremented before runAsync, decremented
     unconditionally in the runAsync future's whenComplete finally, and
     also decremented in the RejectedExecutionException catch (no future)
     and the canceled-before-track early return (no whenComplete). The
     submitFutures map is retained only as the iteration target for
     cancelAll.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to