Kirk Lund created GEODE-8357:
--------------------------------
Summary: Exhausting the high priority message pool can result in
deadlock
Key: GEODE-8357
URL: https://issues.apache.org/jira/browse/GEODE-8357
Project: Geode
Issue Type: Bug
Components: messaging
Reporter: Kirk Lund
The system property "DistributionManager.MAX_THREADS" default to 100:
{noformat}
int MAX_THREADS = Integer.getInteger("DistributionManager.MAX_THREADS", 100);
{noformat}
The system property used to be defined in geode-core ClusterDistributionManager
and has moved to geode-core OperationExecutors.
The value is used to limit ClusterOperationExecutors threadPool and
highPriorityPool:
{noformat}
threadPool =
CoreLoggingExecutors.newThreadPoolWithFeedStatistics("Pooled Message
Processor ",
thread -> stats.incProcessingThreadStarts(), this::doProcessingThread,
MAX_THREADS, stats.getNormalPoolHelper(), threadMonitor,
INCOMING_QUEUE_LIMIT, stats.getOverflowQueueHelper());
highPriorityPool = CoreLoggingExecutors.newThreadPoolWithFeedStatistics(
"Pooled High Priority Message Processor ",
thread -> stats.incHighPriorityThreadStarts(), this::doHighPriorityThread,
MAX_THREADS, stats.getHighPriorityPoolHelper(), threadMonitor,
INCOMING_QUEUE_LIMIT, stats.getHighPriorityQueueHelper());
{noformat}
I have seen server startup hang when recovering lots of expired entries from
disk while using PDX. The hang looks like a dlock request for the PDX lock is
not receiving a response. Checking the value for the
distributionStats#highPriorityQueueSize statistic (in VSD) shows the value
maxed out and never dropping.
The dlock response granting the PDX lock is stuck in the highPriorityQueue
because there are no more highPriorityQueue threads available to process the
response. All of the highPriorityQueue thread stack dumps show tasks such as
recovering bucket from disk are blocked waiting for the PDX lock.
Several changes could improve this situation, either in conjunction or
separately:
# improve observability to enable support to identify that this situation has
occurred
# automatically identify this situation and warn the user with a log statement
# automatically prevent this situation
--
This message was sent by Atlassian Jira
(v8.3.4#803005)