Mark Robert Miller created SOLR-18244:
-----------------------------------------
Summary: Fix concurrency bugs in HttpShardHandler /
ParallelHttpShardHandler that can hang searches or crash coordinator
Key: SOLR-18244
URL: https://issues.apache.org/jira/browse/SOLR-18244
Project: Solr
Issue Type: Bug
Reporter: Mark Robert Miller
Fix several concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that
could leave search threads parked in HttpShardHandler.take() indefinitely or
crash the
coordinator under load. (1) ParallelHttpShardHandler.cancelAll and
makeShardRequest
now synchronize their submitFutures operations on the same monitor that
HttpShardHandler uses, so the canceled flag, responseFutureMap, the responses
queue's CANCELLATION_NOTIFICATION, and submitFutures all transition atomically
— matching the base class's cancellation invariant. (2)
RejectedExecutionException from
commExecutor (queue full or executor shutting down) is now caught and routed
through recordShardSubmitError as a SERVICE_UNAVAILABLE shard failure instead
of propagating synchronously out of submit() and crashing SearchHandler's
distributed loop — preserving shards.tolerant=true semantics under thread pool
saturation. (3) The
Parallel runnable now reads its own outer CompletableFuture via an
AtomicReference
holder and short-circuits when cancelled before scheduling, avoiding a wasted
lbClient.requestAsync that super.makeShardRequest would just immediately cancel.
(4) The inner whenComplete in HttpShardHandler.makeShardRequest now wraps
response processing in a try/catch — any throwable (including from a subclass
transformResponse) is recorded on the ShardResponse and queued under the
canceled
monitor, so a thrown response handler can no longer leave responseFutureMap
populated with an entry that nothing will ever enqueue a response for. (5)
HttpShardHandler.take() now polls the responses queue with a 50ms timeout
instead of blocking indefinitely, closing a fast-path deadlock: when a subclass
async tracker like submitFutures drains AFTER the coordinator re-evaluated
responsesPending()==true and entered responses.take(), no further response will
ever arrive to wake it. The polling loop re-checks responsesPending()
periodically and exits cleanly when the tracker drains.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]