[
https://issues.apache.org/jira/browse/GEODE-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243386#comment-15243386
]
Bruce Schuchardt commented on GEODE-1236:
-----------------------------------------
This also affects GemFire 8.x and should be fixed there as well.
> GFSH shutdown of a cache server causes JVM to exit before completing shutdown
> -----------------------------------------------------------------------------
>
> Key: GEODE-1236
> URL: https://issues.apache.org/jira/browse/GEODE-1236
> Project: Geode
> Issue Type: Bug
> Components: gfsh, management
> Reporter: Bruce Schuchardt
>
> Logs of a cache server being shut down by gfsh show that the
> DistributedSystem is not completing shutdown before the JVM exits. This is
> probably due to all non-daemon threads being terminated before shutdown
> completes. Other members see the "shut down" as a crash and if too many
> servers are "shut down" in this way at the same time it can cause other
> members to declare a loss of quorum and shut themselves down.
> Here is the log from a server shut down by gfsh:
> {noformat}
> [info 2016/04/15 14:45:20.614 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Received GFSH shutdown. Shutting down member
> 10.0.16.168(cacheserver4:7984)<v6>:14494
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] GemFireCache[id = 965586344; isClosing = true;
> isShutDownAll = false; closingGatewayHubsByShutdownAll = false; created = Fri
> Apr 15 14:42:17 UTC 2016; server = true; copyOnRead = false; lockLease = 120;
> lockTimeout = 60]: Now closing.
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Resetting original MemoryPoolMXBean heap threshold
> bytes 0 on pool CMS Old Gen
> [info 2016/04/15 14:45:20.669 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Bridge server on port 40,404 is shutting down.
> [info 2016/04/15 14:45:20.744 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Rest Server on port 9,090 is shutting down
> [info 2016/04/15 14:45:20.745 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Stopping the HTTP service...
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Stopped ServerConnector@a8c6d1d{HTTP/1.1}{0.0.0.0:9090}
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Destroying Spring FrameworkServlet 'gemfire-api'
> [info 2016/04/15 14:45:20.749 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Closing WebApplicationContext for namespace
> 'gemfire-api-servlet': startup date [Fri Apr 15 14:42:20 UTC 2016]; root of
> context hierarchy
> [info 2016/04/15 14:45:20.750 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Destroying singletons in
> org.springframework.beans.factory.support.DefaultListableBeanFactory@a7f0ab6:
> defining beans
> [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,baseControllerAdvice,queryController,functionController,pdxCrudController,swaggerConfig,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping#0,org.springframework.format.support.FormattingConversionServiceFactoryBean#0,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter#0,org.springframework.web.servlet.handler.MappedInterceptor#0,org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver#0,org.springframework.web.servlet.mvc.annotation.ResponseStatusExceptionResolver#0,org.springframework.web.servlet.mvc.support.DefaultHandlerExceptionResolver#0,org.springframework.web.servlet.handler.BeanNameUrlHandlerMapping,org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter,org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter,org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler#0,org.springframework.web.servlet.handler.SimpleUrlHandlerMapping#0,contentNegotiationManager,objectMapper,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,classOrApiAnnotationResourceGrouping,swaggerCache,defaultSwaggerController,springSwaggerConfig,springSwaggerModelConfig,swaggerRequestMappingHandlerMappings,defaultResourceGroupingStrategy,defaultExcludeAnnotations,defaultSwaggerPathProvider,defaultIgnorableParameterTypes,defaultResponseMessages,defaultParameterDataTypes,jacksonScalaSupport,swaggerApiResourceListing,swaggerGlobalSettings,apiListingReferenceScanner,apiPathProvider];
> root of factory hierarchy
> [info 2016/04/15 14:45:20.794 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Stopped
> o.e.j.w.WebAppContext@4afbb6c2{/gemfire-api,file:/var/vcap/store/server/GemFire_root/services/http/0.0.0.0_9090_gemfire-api/webapp/,UNAVAILABLE}{/var/vcap/packages/gemfire/Pivotal_GemFire/tools/Extensions/gemfire-api.war}
> {noformat}
> This does not contain the DistributionManager's shutdown message, which
> should look like this:
> DistributionManager stopped in 1035ms.
> In the system that this log is from there were three servers and all were
> "shut down" at the same time. This caused the locator to declare a network
> partition as it did not receive shutdown messages from the servers:
> {noformat}
> [info 2016/04/15 14:45:21.155 UTC locator3 <UDP ucast receiver> tid=0x27]
> Received Suspect notification for member(s)
> [10.0.16.168(cacheserver4:7984)<v6>:14494] from
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350. Reason=Could not set up
> ping socket
> [info 2016/04/15 14:45:21.193 UTC locator3 <UDP ucast receiver> tid=0x27]
> Received Suspect notification for member(s)
> [10.0.16.167(cacheserver3:7984)<v7>:13905] from
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350. Reason=Socket was not
> closed nicely
> [info 2016/04/15 14:45:26.150 UTC locator3 <UDP ucast receiver> tid=0x27]
> Received Suspect notification for member(s)
> [10.0.16.168(cacheserver4:7984)<v6>:14494,
> 10.0.16.167(cacheserver3:7984)<v7>:13905] from
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350. Reason=still being suspected
> [info 2016/04/15 14:45:26.165 UTC locator3 <VERIFY_SUSPECT.TimerThread>
> tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect
> verification response received from 10.0.16.168(cacheserver4:7984)<v6>:14494
> in 5000 milliseconds: I believe it is gone.
> [info 2016/04/15 14:45:26.195 UTC locator3 <VERIFY_SUSPECT.TimerThread>
> tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect
> verification response received from 10.0.16.167(cacheserver3:7984)<v7>:13905
> in 5002 milliseconds: I believe it is gone.
> [info 2016/04/15 14:45:26.318 UTC locator3 <ViewHandler> tid=0x7d]
> Membership: sending new view
> [[10.0.16.157(locator3:7369:locator)<ec><v0>:11977|8]
> [10.0.16.157(locator3:7369:locator)<ec><v0>:11977/48436,
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350/13719,
> 10.0.16.169(cacheserver5:7959)<v7>:19055/60393] crashed mbrs:
> [10.0.16.168(cacheserver4:7984)<v6>:14494/5420,
> 10.0.16.167(cacheserver3:7984)<v7>:13905/30231]] (3 mbrs)
>
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d]
> member 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has a weight of 3
> member 10.0.16.156(locator2:7421:locator)<ec><v1>:3350 has a weight of 3
> member 10.0.16.168(cacheserver4:7984)<v6>:14494 has a weight of 15
> member 10.0.16.167(cacheserver3:7984)<v7>:13905 has a weight of 10
> member 10.0.16.169(cacheserver5:7959)<v7>:19055 has a weight of 10
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] quorum
> weight calculation: oldWeight=41 failureWeight=25 threshold=51% (failure
> weight must be < 21)
> [info 2016/04/15 14:45:26.342 UTC locator3 <FD_SOCK Ping thread> tid=0x7f]
> GemFire failure detection is now monitoring
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350
> [severe 2016/04/15 14:45:26.354 UTC locator3 <Quorum Lost Notification>
> tid=0xf4] Possible loss of quorum due to the loss of 2 cache processes:
> [10.0.16.168(cacheserver4:7984)<v6>:14494,
> 10.0.16.167(cacheserver3:7984)<v7>:13905]
> [severe 2016/04/15 14:45:26.360 UTC locator3 <CloserThread> tid=0xf1]
> Membership service failure: Channel closed:
> com.gemstone.gemfire.ForcedDisconnectException: Membership coordinator
> 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has declared that a network
> partition has occurred
> {noformat}
> One way to correct this problem would be to change ShutDownFunction to create
> a non-daemon thread to perform the shutdown and have the function execution
> thread wait for the non-daemon thread to finish before returning.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)