[ 
https://issues.apache.org/jira/browse/SOLR-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Robert Miller updated SOLR-18244:
--------------------------------------
    Description: 
Fix several concurrency bugs in ParallelHttpShardHandler

- Catch RejectedExecutionException from commExecutor and route through
  recordShardSubmitError so saturation/shutdown does not propagate
  synchronously and abandon already-submitted shard requests; this
  preserves shards.tolerant semantics and returns a 503 instead of 500.

- Synchronize submitFutures registration and removal on the same monitor
  HttpShardHandler uses for responseFutureMap so cancelAll observes a
  consistent view of both maps.

- Skip work in the runAsync runnable when the outer future has already
  been cancelled, avoiding a wasted lbClient.requestAsync call.

- In HttpShardHandler.takeCompletedIncludingErrors, replace the blocking
  responses.take() with a 50ms poll so subclasses that gate
  responsesPending() on an async tracker (e.g. submitFutures) cannot
  cause a lost-wakeup hang when the tracker drains between the pending
  check and the queue wait.

- Wrap the inner whenComplete body in a try/catch so a thrown
  transformResponse (or other failure inside the lambda) still enqueues
  a response and unblocks take() instead of stranding the consumer.

- Expose cancellationLock() and isCanceled() on HttpShardHandler so
  subclasses can synchronize on the same monitor that already guards
  the canceled flag, responseFutureMap, and the responses queue.

- Synchronize ParallelHttpShardHandler.responsesPending() on the
  cancellation monitor. Without the lock, take()'s loop performs three
  unsynchronized isEmpty() reads (responseFutureMap, responses,
  submitFutures) and can transiently observe all three as empty even
  though super.makeShardRequest's put and outer.whenComplete's
  submitFutures.remove are causally ordered. The intermediate state is
  consistent with each individual read but never with the cross-map
  invariant, so take() exits with null and the pending response (which
  arrives moments later via the inner whenComplete) is silently dropped
  by SearchHandler. Acquiring the lock serializes the three reads with
  every state-mutating operation.

  was:
Fix several concurrency bugs in HttpShardHandler / ParallelHttpShardHandler 
that could leave search threads parked in HttpShardHandler.take() indefinitely 
or crash the coordinator under load.

(1) ParallelHttpShardHandler.cancelAll and makeShardRequest now synchronize 
their submitFutures operations on the same monitor that HttpShardHandler uses, 
so the canceled flag, responseFutureMap, the responses queue's 
CANCELLATION_NOTIFICATION, and submitFutures all transition atomically — 
matching the base class's cancellation invariant.

(2) RejectedExecutionException from commExecutor (queue full or executor 
shutting down) is now caught and routed through recordShardSubmitError as a 
SERVICE_UNAVAILABLE shard failure instead of propagating synchronously out of 
submit() and crashing SearchHandler's distributed loop — preserving 
shards.tolerant=true semantics under thread pool saturation.

(3) The Parallel runnable now reads its own outer CompletableFuture via an 
AtomicReference holder and short-circuits when cancelled before scheduling, 
avoiding a wasted lbClient.requestAsync that super.makeShardRequest would just 
immediately cancel.


(4) The inner whenComplete in HttpShardHandler.makeShardRequest now wraps 
response processing in a try/catch — any throwable (including from a subclass
transformResponse) is recorded on the ShardResponse and queued under the 
canceled monitor, so a thrown response handler can no longer leave 
responseFutureMap populated with an entry that nothing will ever enqueue a 
response for. (5) HttpShardHandler.take() now polls the responses queue with a 
50ms timeout instead of blocking indefinitely, closing a fast-path deadlock: 
when a subclass async tracker like submitFutures drains AFTER the coordinator 
re-evaluated responsesPending()==true and entered responses.take(), no further 
response will ever arrive to wake it. The polling loop re-checks 
responsesPending() periodically and exits cleanly when the tracker drains.


> Fix concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that can 
> hang searches or crash coordinator
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-18244
>                 URL: https://issues.apache.org/jira/browse/SOLR-18244
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Robert Miller
>            Priority: Major
>
> Fix several concurrency bugs in ParallelHttpShardHandler
> - Catch RejectedExecutionException from commExecutor and route through
>   recordShardSubmitError so saturation/shutdown does not propagate
>   synchronously and abandon already-submitted shard requests; this
>   preserves shards.tolerant semantics and returns a 503 instead of 500.
> - Synchronize submitFutures registration and removal on the same monitor
>   HttpShardHandler uses for responseFutureMap so cancelAll observes a
>   consistent view of both maps.
> - Skip work in the runAsync runnable when the outer future has already
>   been cancelled, avoiding a wasted lbClient.requestAsync call.
> - In HttpShardHandler.takeCompletedIncludingErrors, replace the blocking
>   responses.take() with a 50ms poll so subclasses that gate
>   responsesPending() on an async tracker (e.g. submitFutures) cannot
>   cause a lost-wakeup hang when the tracker drains between the pending
>   check and the queue wait.
> - Wrap the inner whenComplete body in a try/catch so a thrown
>   transformResponse (or other failure inside the lambda) still enqueues
>   a response and unblocks take() instead of stranding the consumer.
> - Expose cancellationLock() and isCanceled() on HttpShardHandler so
>   subclasses can synchronize on the same monitor that already guards
>   the canceled flag, responseFutureMap, and the responses queue.
> - Synchronize ParallelHttpShardHandler.responsesPending() on the
>   cancellation monitor. Without the lock, take()'s loop performs three
>   unsynchronized isEmpty() reads (responseFutureMap, responses,
>   submitFutures) and can transiently observe all three as empty even
>   though super.makeShardRequest's put and outer.whenComplete's
>   submitFutures.remove are causally ordered. The intermediate state is
>   consistent with each individual read but never with the cross-map
>   invariant, so take() exits with null and the pending response (which
>   arrives moments later via the inner whenComplete) is silently dropped
>   by SearchHandler. Acquiring the lock serializes the three reads with
>   every state-mutating operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to