Donal Evans created GEODE-10330:
-----------------------------------

             Summary: Resource issues lead to "MemberDisconnectedException: 
Member isn't responding to heartbeat requests"
                 Key: GEODE-10330
                 URL: https://issues.apache.org/jira/browse/GEODE-10330
             Project: Geode
          Issue Type: Bug
    Affects Versions: 1.16.0
            Reporter: Donal Evans


A failure was observed in 
DistributedMulticastRegionWithUDPSecurityDUnitTest > 
testMulticastAfterReconnect due to suspect strings with fatal-level logging of 
"Membership service failure: Member isn't responding to heartbeat requests".

Investigating the logs showed all members reporting long statistics sampling 
wakeup delays, indicating resource issues:

 
{code:java}
[vm3] [warn 2022/05/21 07:28:16.251 UTC LocatorWithMcast <StatSampler> 
tid=0xb8] Statistics sampling thread detected a wakeup delay of 4760 ms, 
indicating a possible resource issue. Check the GC, memory, and CPU statistics.

...

[locator] [warn 2022/05/21 07:28:20.288 UTC  <StatSampler> tid=0x3b] Statistics 
sampling thread detected a wakeup delay of 12400 ms, indicating a possible 
resource issue. Check the GC, memory, and CPU statistics.

...

[vm1] [warn 2022/05/21 07:28:20.969 UTC vm1 <StatSampler> tid=0xda] Statistics 
sampling thread detected a wakeup delay of 13738 ms, indicating a possible 
resource issue. Check the GC, memory, and CPU statistics.

...

[vm0] [warn 2022/05/21 07:28:22.226 UTC vm0 <StatSampler> tid=0xa9] Statistics 
sampling thread detected a wakeup delay of 15110 ms, indicating a possible 
resource issue. Check the GC, memory, and CPU statistics. {code}
Using the progress tool from the dev-tools directory in the Geode repository, 
the following tests were found to be running during the resource issues, 
possibly indicating that one or more of them are particularly 
resource-intensive:
{noformat}
$> progress -r '2022-05-21 07:28:16.251 -0000' | grep org | sort{noformat}
{code:java}
org.apache.geode.cache.PRCacheListenerWithInterestPolicyAllDistributedTest.afterUpdateIsInvokedInEveryMember[0:
 redundancy=0] 
org.apache.geode.cache.lucene.LuceneQueriesReindexDUnitTest.recreateIndexWithDifferentFieldsShouldFail(PARTITION_OVERFLOW_TO_DISK)
 [2] 
org.apache.geode.cache.query.cq.dunit.CqDataUsingPoolOptimizedExecuteDUnitTest.testCQHAWithState
 
org.apache.geode.cache.query.cq.dunit.PartitionedRegionCqQueryDUnitTest.testPartitionedCqOnAccessorBridgeServer
 org.apache.geode.cache30.CallbackArgDUnitTest.testForCA 
org.apache.geode.cache30.DistributedMulticastRegionWithUDPSecurityDUnitTest.testMulticastAfterReconnect
 
org.apache.geode.cache30.DistributedNoAckRegionCCEOffHeapDUnitTest.testDistributedInvalidate
 org.apache.geode.cache30.GlobalRegionOffHeapDUnitTest.testOrderedUpdates 
org.apache.geode.cache30.ReconnectWithClusterConfigurationDUnitTest.testReconnectAfterMeltdown
 
org.apache.geode.distributed.internal.P2PMessagingConcurrencyDUnitTest.testP2PMessaging(true,
 false, 32768, 65536) [6] 
org.apache.geode.disttx.PRDistTXDUnitTest.testSimulaneousChildRegionCreation 
org.apache.geode.internal.cache.ClientServerTransactionCCEDUnitTest.testClientCommitFunctionWithFailure
 
org.apache.geode.internal.cache.eviction.OffHeapEvictionStatsDUnitTest.testHeapLruCounter
 
org.apache.geode.internal.cache.wan.concurrent.ConcurrentParallelGatewaySenderOperation_1_DUnitTest.testParallelPropagationSenderStartAfterStopOnAccessorNode
 
org.apache.geode.internal.cache.wan.offheap.ParallelGatewaySenderOperationsOffHeapDistributedTest.testParallelGatewaySenderStartOnAccessorNode
 
org.apache.geode.internal.cache.wan.serial.SerialWANPropagation_PartitionedRegionDUnitTest.testPartitionedSerialPropagationHA
 org.apache.geode.internal.tcp.TCPConduitDUnitTest.basicAcceptConnection[0] 
org.apache.geode.management.internal.configuration.ClusterConfigImportDUnitTest.importFailWithExistingRegion
 
org.apache.geode.rest.internal.web.controllers.RestAPIsOnGroupsFunctionExecutionDUnitTest.testBasicP2PFunctionSelectedGroup[1]
 
org.apache.geode.session.tests.Jetty9CachingClientServerTest.failureShouldStillAllowOtherContainersDataAccess
 
org.apache.geode.session.tests.Tomcat8ClientServerCustomCacheXmlTest.containersShouldExpireInSetTimeframe
 org.apache.geode.session.tests.Tomcat8Test.containersShouldReplicateCookies 
org.apache.geode.session.tests.Tomcat9ClientServerTest.invalidationShouldRemoveValueAccessForAllContainers
{code}
Future failures due to this sort of resource issue should also list 
concurrently running tests so that repeat appearances by individual tests can 
be used to identify the culprits.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to