[ 
https://issues.apache.org/jira/browse/GEODE-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk Lund updated GEODE-8357:
-----------------------------
    Labels: GeodeOperationAPI  (was: )

> Exhausting the high priority message pool can result in deadlock
> ----------------------------------------------------------------
>
>                 Key: GEODE-8357
>                 URL: https://issues.apache.org/jira/browse/GEODE-8357
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Kirk Lund
>            Assignee: Kirk Lund
>            Priority: Major
>              Labels: GeodeOperationAPI
>
> The system property "DistributionManager.MAX_THREADS" default to 100:
> {noformat}
> int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);
> {noformat}
> The system property used to be defined in geode-core 
> ClusterDistributionManager and has moved to geode-core OperationExecutors.
> The value is used to limit ClusterOperationExecutors threadPool and 
> highPriorityPool:
> {noformat}
> threadPool =
>     CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message 
> Processor ",
>         thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
>         MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
>         INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());
> highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
>     "Pooled High Priority Message Processor ",
>     thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
>     MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
>     INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());
> {noformat}
> I have seen server startup hang when recovering lots of expired entries from 
> disk while using PDX. The hang looks like a dlock request for the PDX lock is 
> not receiving a response. Checking the value for the 
> distributionStats#highPriorityQueueSize statistic (in VSD) shows the value 
> maxed out and never dropping.
> The dlock response granting the PDX lock is stuck in the highPriorityQueue 
> because there are no more highPriorityQueue threads available to process the 
> response. All of the highPriorityQueue thread stack dumps show tasks such as 
> recovering bucket from disk are blocked waiting for the PDX lock.
> Several changes could improve this situation, either in conjunction or 
> individually:
> # improve observability to enable support to identify that this situation has 
> occurred
> # automatically identify this situation and warn the user with a log statement
> # automatically prevent this situation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to