[
https://issues.apache.org/jira/browse/SOLR-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082206#comment-18082206
]
Mark Robert Miller commented on SOLR-18244:
-------------------------------------------
I found another issue - ConcurrentHashMap#isEmpty is used for flow control and
can cause an infinite wait on take()
> Fix concurrency bugs in HttpShardHandler / ParallelHttpShardHandler that can
> hang searches or crash coordinator
> ---------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-18244
> URL: https://issues.apache.org/jira/browse/SOLR-18244
> Project: Solr
> Issue Type: Bug
> Reporter: Mark Robert Miller
> Priority: Major
>
> Fix several concurrency bugs in ParallelHttpShardHandler
> - Catch RejectedExecutionException from commExecutor and route through
> recordShardSubmitError so saturation/shutdown does not propagate
> synchronously and abandon already-submitted shard requests; this
> preserves shards.tolerant semantics and returns a 503 instead of 500.
> - Synchronize submitFutures registration and removal on the same monitor
> HttpShardHandler uses for responseFutureMap so cancelAll observes a
> consistent view of both maps.
> - Skip work in the runAsync runnable when the outer future has already
> been cancelled, avoiding a wasted lbClient.requestAsync call.
> - In HttpShardHandler.takeCompletedIncludingErrors, replace the blocking
> responses.take() with a 50ms poll so subclasses that gate
> responsesPending() on an async tracker (e.g. submitFutures) cannot
> cause a lost-wakeup hang when the tracker drains between the pending
> check and the queue wait.
> - Wrap the inner whenComplete body in a try/catch so a thrown
> transformResponse (or other failure inside the lambda) still enqueues
> a response and unblocks take() instead of stranding the consumer.
> - Expose cancellationLock() and isCanceled() on HttpShardHandler so
> subclasses can synchronize on the same monitor that already guards
> the canceled flag, responseFutureMap, and the responses queue.
> - Synchronize ParallelHttpShardHandler.responsesPending() on the
> cancellation monitor. Without the lock, take()'s loop performs three
> unsynchronized isEmpty() reads (responseFutureMap, responses,
> submitFutures) and can transiently observe all three as empty even
> though super.makeShardRequest's put and outer.whenComplete's
> submitFutures.remove are causally ordered. The intermediate state is
> consistent with each individual read but never with the cross-map
> invariant, so take() exits with null and the pending response (which
> arrives moments later via the inner whenComplete) is silently dropped
> by SearchHandler. Acquiring the lock serializes the three reads with
> every state-mutating operation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]