Bruce Schuchardt created GEODE-1236:
---------------------------------------
Summary: GFSH shutdown of a cache server causes JVM to exit before
completing shutdown
Key: GEODE-1236
URL: https://issues.apache.org/jira/browse/GEODE-1236
Project: Geode
Issue Type: Bug
Components: gfsh, management
Reporter: Bruce Schuchardt
Logs of a cache server being shut down by gfsh show that the DistributedSystem
is not completing shutdown before the JVM exits. This is probably due to all
non-daemon threads being terminated before shutdown completes. Other members
see the "shut down" as a crash and if too many servers are "shut down" in this
way at the same time it can cause other members to declare a loss of quorum and
shut themselves down.
Here is the log from a server shut down by gfsh:
{noformat}
[info 2016/04/15 14:45:20.614 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Received GFSH shutdown. Shutting down member
10.0.16.168(cacheserver4:7984)<v6>:14494
[info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] GemFireCache[id = 965586344; isClosing = true; isShutDownAll = false;
closingGatewayHubsByShutdownAll = false; created = Fri Apr 15 14:42:17 UTC
2016; server = true; copyOnRead = false; lockLease = 120; lockTimeout = 60]:
Now closing.
[info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Resetting original MemoryPoolMXBean heap threshold bytes 0 on pool
CMS Old Gen
[info 2016/04/15 14:45:20.669 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Bridge server on port 40,404 is shutting down.
[info 2016/04/15 14:45:20.744 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Rest Server on port 9,090 is shutting down
[info 2016/04/15 14:45:20.745 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Stopping the HTTP service...
[info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Stopped ServerConnector@a8c6d1d{HTTP/1.1}{0.0.0.0:9090}
[info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Destroying Spring FrameworkServlet 'gemfire-api'
[info 2016/04/15 14:45:20.749 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Closing WebApplicationContext for namespace 'gemfire-api-servlet':
startup date [Fri Apr 15 14:42:20 UTC 2016]; root of context hierarchy
[info 2016/04/15 14:45:20.750 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Destroying singletons in
org.springframework.beans.factory.support.DefaultListableBeanFactory@a7f0ab6:
defining beans
[org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,baseControllerAdvice,queryController,functionController,pdxCrudController,swaggerConfig,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping#0,org.springframework.format.support.FormattingConversionServiceFactoryBean#0,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter#0,org.springframework.web.servlet.handler.MappedInterceptor#0,org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver#0,org.springframework.web.servlet.mvc.annotation.ResponseStatusExceptionResolver#0,org.springframework.web.servlet.mvc.support.DefaultHandlerExceptionResolver#0,org.springframework.web.servlet.handler.BeanNameUrlHandlerMapping,org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter,org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter,org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler#0,org.springframework.web.servlet.handler.SimpleUrlHandlerMapping#0,contentNegotiationManager,objectMapper,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,classOrApiAnnotationResourceGrouping,swaggerCache,defaultSwaggerController,springSwaggerConfig,springSwaggerModelConfig,swaggerRequestMappingHandlerMappings,defaultResourceGroupingStrategy,defaultExcludeAnnotations,defaultSwaggerPathProvider,defaultIgnorableParameterTypes,defaultResponseMessages,defaultParameterDataTypes,jacksonScalaSupport,swaggerApiResourceListing,swaggerGlobalSettings,apiListingReferenceScanner,apiPathProvider];
root of factory hierarchy
[info 2016/04/15 14:45:20.794 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Stopped
o.e.j.w.WebAppContext@4afbb6c2{/gemfire-api,file:/var/vcap/store/server/GemFire_root/services/http/0.0.0.0_9090_gemfire-api/webapp/,UNAVAILABLE}{/var/vcap/packages/gemfire/Pivotal_GemFire/tools/Extensions/gemfire-api.war}
{noformat}
This does not contain the DistributionManager's shutdown message, which should
look like this:
DistributionManager stopped in 1035ms.
In the system that this log is from there were three servers and all were "shut
down" at the same time. This caused the locator to declare a network partition
as it did not receive shutdown messages from the servers:
{noformat}
[info 2016/04/15 14:45:21.155 UTC locator3 <UDP ucast receiver> tid=0x27]
Received Suspect notification for member(s)
[10.0.16.168(cacheserver4:7984)<v6>:14494] from
10.0.16.156(locator2:7421:locator)<ec><v1>:3350. Reason=Could not set up ping
socket
[info 2016/04/15 14:45:21.193 UTC locator3 <UDP ucast receiver> tid=0x27]
Received Suspect notification for member(s)
[10.0.16.167(cacheserver3:7984)<v7>:13905] from
10.0.16.156(locator2:7421:locator)<ec><v1>:3350. Reason=Socket was not closed
nicely
[info 2016/04/15 14:45:26.150 UTC locator3 <UDP ucast receiver> tid=0x27]
Received Suspect notification for member(s)
[10.0.16.168(cacheserver4:7984)<v6>:14494,
10.0.16.167(cacheserver3:7984)<v7>:13905] from
10.0.16.156(locator2:7421:locator)<ec><v1>:3350. Reason=still being suspected
[info 2016/04/15 14:45:26.165 UTC locator3 <VERIFY_SUSPECT.TimerThread>
tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect
verification response received from 10.0.16.168(cacheserver4:7984)<v6>:14494 in
5000 milliseconds: I believe it is gone.
[info 2016/04/15 14:45:26.195 UTC locator3 <VERIFY_SUSPECT.TimerThread>
tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect
verification response received from 10.0.16.167(cacheserver3:7984)<v7>:13905 in
5002 milliseconds: I believe it is gone.
[info 2016/04/15 14:45:26.318 UTC locator3 <ViewHandler> tid=0x7d] Membership:
sending new view [[10.0.16.157(locator3:7369:locator)<ec><v0>:11977|8]
[10.0.16.157(locator3:7369:locator)<ec><v0>:11977/48436,
10.0.16.156(locator2:7421:locator)<ec><v1>:3350/13719,
10.0.16.169(cacheserver5:7959)<v7>:19055/60393] crashed mbrs:
[10.0.16.168(cacheserver4:7984)<v6>:14494/5420,
10.0.16.167(cacheserver3:7984)<v7>:13905/30231]] (3 mbrs)
[info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d]
member 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has a weight of 3
member 10.0.16.156(locator2:7421:locator)<ec><v1>:3350 has a weight of 3
member 10.0.16.168(cacheserver4:7984)<v6>:14494 has a weight of 15
member 10.0.16.167(cacheserver3:7984)<v7>:13905 has a weight of 10
member 10.0.16.169(cacheserver5:7959)<v7>:19055 has a weight of 10
[info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] quorum
weight calculation: oldWeight=41 failureWeight=25 threshold=51% (failure
weight must be < 21)
[info 2016/04/15 14:45:26.342 UTC locator3 <FD_SOCK Ping thread> tid=0x7f]
GemFire failure detection is now monitoring
10.0.16.156(locator2:7421:locator)<ec><v1>:3350
[severe 2016/04/15 14:45:26.354 UTC locator3 <Quorum Lost Notification>
tid=0xf4] Possible loss of quorum due to the loss of 2 cache processes:
[10.0.16.168(cacheserver4:7984)<v6>:14494,
10.0.16.167(cacheserver3:7984)<v7>:13905]
[severe 2016/04/15 14:45:26.360 UTC locator3 <CloserThread> tid=0xf1]
Membership service failure: Channel closed:
com.gemstone.gemfire.ForcedDisconnectException: Membership coordinator
10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has declared that a network
partition has occurred
{noformat}
One way to correct this problem would be to change ShutDownFunction to create a
non-daemon thread to perform the shutdown and have the function execution
thread wait for the non-daemon thread to finish before returning.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)