[
https://issues.apache.org/jira/browse/SOLR-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Robert Miller updated SOLR-18244:
--------------------------------------
Description:
Fix several concurrency bugs in HttpShardHandler / ParallelHttpShardHandler
that could leave search threads parked in HttpShardHandler.take() indefinitely
or crash the coordinator under load.
(1) ParallelHttpShardHandler.cancelAll and makeShardRequest now synchronize
their submitFutures operations on the same monitor that HttpShardHandler uses,
so the canceled flag, responseFutureMap, the responses queue's
CANCELLATION_NOTIFICATION, and submitFutures all transition atomically —
matching the base class's cancellation invariant.
(2) RejectedExecutionException from commExecutor (queue full or executor
shutting down) is now caught and routed through recordShardSubmitError as a
SERVICE_UNAVAILABLE shard failure instead of propagating synchronously out of
submit() and crashing SearchHandler's distributed loop — preserving
shards.tolerant=true semantics under thread pool saturation.
(3) The Parallel runnable now reads its own outer CompletableFuture via an
AtomicReference holder and short-circuits when cancelled before scheduling,
avoiding a wasted lbClient.requestAsync that super.makeShardRequest would just
immediately cancel.
(4) The inner whenComplete in HttpShardHandler.makeShardRequest now wraps
response processing in a try/catch — any throwable (including from a subclass
transformResponse) is recorded on the ShardResponse and queued under the
canceled monitor, so a thrown response handler can no longer leave
responseFutureMap populated with an entry that nothing will ever enqueue a
response for. (5) HttpShardHandler.take() now polls the responses queue with a
50ms timeout instead of blocking indefinitely, closing a fast-path deadlock:
when a subclass async tracker like submitFutures drains AFTER the coordinator
re-evaluated responsesPending()==true and entered responses.take(), no further
response will ever arrive to wake it. The polling loop re-checks
responsesPending() periodically and exits cleanly when the tracker drains.
was:
Fix several concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that
could leave search threads parked in HttpShardHandler.take() indefinitely or
crash the
coordinator under load. (1) ParallelHttpShardHandler.cancelAll and
makeShardRequest
now synchronize their submitFutures operations on the same monitor that
HttpShardHandler uses, so the canceled flag, responseFutureMap, the responses
queue's CANCELLATION_NOTIFICATION, and submitFutures all transition atomically
— matching the base class's cancellation invariant. (2)
RejectedExecutionException from
commExecutor (queue full or executor shutting down) is now caught and routed
through recordShardSubmitError as a SERVICE_UNAVAILABLE shard failure instead
of propagating synchronously out of submit() and crashing SearchHandler's
distributed loop — preserving shards.tolerant=true semantics under thread pool
saturation. (3) The
Parallel runnable now reads its own outer CompletableFuture via an
AtomicReference
holder and short-circuits when cancelled before scheduling, avoiding a wasted
lbClient.requestAsync that super.makeShardRequest would just immediately cancel.
(4) The inner whenComplete in HttpShardHandler.makeShardRequest now wraps
response processing in a try/catch — any throwable (including from a subclass
transformResponse) is recorded on the ShardResponse and queued under the
canceled
monitor, so a thrown response handler can no longer leave responseFutureMap
populated with an entry that nothing will ever enqueue a response for. (5)
HttpShardHandler.take() now polls the responses queue with a 50ms timeout
instead of blocking indefinitely, closing a fast-path deadlock: when a subclass
async tracker like submitFutures drains AFTER the coordinator re-evaluated
responsesPending()==true and entered responses.take(), no further response will
ever arrive to wake it. The polling loop re-checks responsesPending()
periodically and exits cleanly when the tracker drains.
> Fix concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that can
> hang searches or crash coordinator
> ---------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-18244
> URL: https://issues.apache.org/jira/browse/SOLR-18244
> Project: Solr
> Issue Type: Bug
> Reporter: Mark Robert Miller
> Priority: Major
>
> Fix several concurrency bugs in HttpShardHandler / ParallelHttpShardHandler
> that could leave search threads parked in HttpShardHandler.take()
> indefinitely or crash the coordinator under load.
> (1) ParallelHttpShardHandler.cancelAll and makeShardRequest now synchronize
> their submitFutures operations on the same monitor that HttpShardHandler
> uses, so the canceled flag, responseFutureMap, the responses queue's
> CANCELLATION_NOTIFICATION, and submitFutures all transition atomically —
> matching the base class's cancellation invariant.
> (2) RejectedExecutionException from commExecutor (queue full or executor
> shutting down) is now caught and routed through recordShardSubmitError as a
> SERVICE_UNAVAILABLE shard failure instead of propagating synchronously out of
> submit() and crashing SearchHandler's distributed loop — preserving
> shards.tolerant=true semantics under thread pool saturation.
> (3) The Parallel runnable now reads its own outer CompletableFuture via an
> AtomicReference holder and short-circuits when cancelled before scheduling,
> avoiding a wasted lbClient.requestAsync that super.makeShardRequest would
> just immediately cancel.
> (4) The inner whenComplete in HttpShardHandler.makeShardRequest now wraps
> response processing in a try/catch — any throwable (including from a subclass
> transformResponse) is recorded on the ShardResponse and queued under the
> canceled monitor, so a thrown response handler can no longer leave
> responseFutureMap populated with an entry that nothing will ever enqueue a
> response for. (5) HttpShardHandler.take() now polls the responses queue with
> a 50ms timeout instead of blocking indefinitely, closing a fast-path
> deadlock: when a subclass async tracker like submitFutures drains AFTER the
> coordinator re-evaluated responsesPending()==true and entered
> responses.take(), no further response will ever arrive to wake it. The
> polling loop re-checks responsesPending() periodically and exits cleanly when
> the tracker drains.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]