[ 
https://issues.apache.org/jira/browse/SOLR-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082206#comment-18082206
 ] 

Mark Robert Miller commented on SOLR-18244:
-------------------------------------------

I found another issue - ConcurrentHashMap#isEmpty is used for flow control and 
can cause an infinite wait on take()

> Fix concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that can 
> hang searches or crash coordinator
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-18244
>                 URL: https://issues.apache.org/jira/browse/SOLR-18244
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Robert Miller
>            Priority: Major
>
> Fix several concurrency bugs in ParallelHttpShardHandler
> - Catch RejectedExecutionException from commExecutor and route through
>   recordShardSubmitError so saturation/shutdown does not propagate
>   synchronously and abandon already-submitted shard requests; this
>   preserves shards.tolerant semantics and returns a 503 instead of 500.
> - Synchronize submitFutures registration and removal on the same monitor
>   HttpShardHandler uses for responseFutureMap so cancelAll observes a
>   consistent view of both maps.
> - Skip work in the runAsync runnable when the outer future has already
>   been cancelled, avoiding a wasted lbClient.requestAsync call.
> - In HttpShardHandler.takeCompletedIncludingErrors, replace the blocking
>   responses.take() with a 50ms poll so subclasses that gate
>   responsesPending() on an async tracker (e.g. submitFutures) cannot
>   cause a lost-wakeup hang when the tracker drains between the pending
>   check and the queue wait.
> - Wrap the inner whenComplete body in a try/catch so a thrown
>   transformResponse (or other failure inside the lambda) still enqueues
>   a response and unblocks take() instead of stranding the consumer.
> - Expose cancellationLock() and isCanceled() on HttpShardHandler so
>   subclasses can synchronize on the same monitor that already guards
>   the canceled flag, responseFutureMap, and the responses queue.
> - Synchronize ParallelHttpShardHandler.responsesPending() on the
>   cancellation monitor. Without the lock, take()'s loop performs three
>   unsynchronized isEmpty() reads (responseFutureMap, responses,
>   submitFutures) and can transiently observe all three as empty even
>   though super.makeShardRequest's put and outer.whenComplete's
>   submitFutures.remove are causally ordered. The intermediate state is
>   consistent with each individual read but never with the cross-map
>   invariant, so take() exits with null and the pending response (which
>   arrives moments later via the inner whenComplete) is silently dropped
>   by SearchHandler. Acquiring the lock serializes the three reads with
>   every state-mutating operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to