[jira] [Commented] (GEODE-1236) GFSH shutdown of a cache server causes JVM to exit before completing shutdown

Bruce Schuchardt (JIRA) Fri, 15 Apr 2016 11:37:29 -0700

    [ 
https://issues.apache.org/jira/browse/GEODE-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243386#comment-15243386
 ]


Bruce Schuchardt commented on GEODE-1236:
-----------------------------------------

This also affects GemFire 8.x and should be fixed there as well.

> GFSH shutdown of a cache server causes JVM to exit before completing shutdown
> -----------------------------------------------------------------------------
>
>                 Key: GEODE-1236
>                 URL: https://issues.apache.org/jira/browse/GEODE-1236
>             Project: Geode
>          Issue Type: Bug
>          Components: gfsh, management
>            Reporter: Bruce Schuchardt
>
> Logs of a cache server being shut down by gfsh show that the 
> DistributedSystem is not completing shutdown before the JVM exits.  This is 
> probably due to all non-daemon threads being terminated before shutdown 
> completes.  Other members see the "shut down" as a crash and if too many 
> servers are "shut down" in this way at the same time it can cause other 
> members to declare a loss of quorum and shut themselves down.
> Here is the log from a server shut down by gfsh:
> {noformat}
> [info 2016/04/15 14:45:20.614 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Received GFSH shutdown. Shutting down member 
> 10.0.16.168(cacheserver4:7984)<v6>:14494
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] GemFireCache[id = 965586344; isClosing = true; 
> isShutDownAll = false; closingGatewayHubsByShutdownAll = false; created = Fri 
> Apr 15 14:42:17 UTC 2016; server = true; copyOnRead = false; lockLease = 120; 
> lockTimeout = 60]: Now closing.
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Resetting original MemoryPoolMXBean heap threshold 
> bytes 0 on pool CMS Old Gen
> [info 2016/04/15 14:45:20.669 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Bridge server on port 40,404 is shutting down.
> [info 2016/04/15 14:45:20.744 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Rest Server on port 9,090 is shutting down
> [info 2016/04/15 14:45:20.745 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Stopping the HTTP service...
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Stopped ServerConnector@a8c6d1d{HTTP/1.1}{0.0.0.0:9090}
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Destroying Spring FrameworkServlet 'gemfire-api'
> [info 2016/04/15 14:45:20.749 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Closing WebApplicationContext for namespace 
> 'gemfire-api-servlet': startup date [Fri Apr 15 14:42:20 UTC 2016]; root of 
> context hierarchy
> [info 2016/04/15 14:45:20.750 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Destroying singletons in 
> org.springframework.beans.factory.support.DefaultListableBeanFactory@a7f0ab6: 
> defining beans 
> [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,baseControllerAdvice,queryController,functionController,pdxCrudController,swaggerConfig,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping#0,org.springframework.format.support.FormattingConversionServiceFactoryBean#0,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter#0,org.springframework.web.servlet.handler.MappedInterceptor#0,org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver#0,org.springframework.web.servlet.mvc.annotation.ResponseStatusExceptionResolver#0,org.springframework.web.servlet.mvc.support.DefaultHandlerExceptionResolver#0,org.springframework.web.servlet.handler.BeanNameUrlHandlerMapping,org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter,org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter,org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler#0,org.springframework.web.servlet.handler.SimpleUrlHandlerMapping#0,contentNegotiationManager,objectMapper,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,classOrApiAnnotationResourceGrouping,swaggerCache,defaultSwaggerController,springSwaggerConfig,springSwaggerModelConfig,swaggerRequestMappingHandlerMappings,defaultResourceGroupingStrategy,defaultExcludeAnnotations,defaultSwaggerPathProvider,defaultIgnorableParameterTypes,defaultResponseMessages,defaultParameterDataTypes,jacksonScalaSupport,swaggerApiResourceListing,swaggerGlobalSettings,apiListingReferenceScanner,apiPathProvider];
>  root of factory hierarchy
> [info 2016/04/15 14:45:20.794 UTC cacheserver4 <Function Execution 
> Processor2> tid=0x67] Stopped 
> o.e.j.w.WebAppContext@4afbb6c2{/gemfire-api,file:/var/vcap/store/server/GemFire_root/services/http/0.0.0.0_9090_gemfire-api/webapp/,UNAVAILABLE}{/var/vcap/packages/gemfire/Pivotal_GemFire/tools/Extensions/gemfire-api.war}
> {noformat}
> This does not contain the DistributionManager's shutdown message, which 
> should look like this:
> DistributionManager stopped in 1035ms.
> In the system that this log is from there were three servers and all were 
> "shut down" at the same time.  This caused the locator to declare a network 
> partition as it did not receive shutdown messages from the servers:
> {noformat}
> [info 2016/04/15 14:45:21.155 UTC locator3 <UDP ucast receiver> tid=0x27] 
> Received Suspect notification for member(s) 
> [10.0.16.168(cacheserver4:7984)<v6>:14494] from 
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.  Reason=Could not set up 
> ping socket
> [info 2016/04/15 14:45:21.193 UTC locator3 <UDP ucast receiver> tid=0x27] 
> Received Suspect notification for member(s) 
> [10.0.16.167(cacheserver3:7984)<v7>:13905] from 
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.  Reason=Socket was not 
> closed nicely
> [info 2016/04/15 14:45:26.150 UTC locator3 <UDP ucast receiver> tid=0x27] 
> Received Suspect notification for member(s) 
> [10.0.16.168(cacheserver4:7984)<v6>:14494, 
> 10.0.16.167(cacheserver3:7984)<v7>:13905] from 
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.  Reason=still being suspected
> [info 2016/04/15 14:45:26.165 UTC locator3 <VERIFY_SUSPECT.TimerThread> 
> tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect 
> verification response received from 10.0.16.168(cacheserver4:7984)<v6>:14494 
> in 5000 milliseconds: I believe it is gone.
> [info 2016/04/15 14:45:26.195 UTC locator3 <VERIFY_SUSPECT.TimerThread> 
> tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect 
> verification response received from 10.0.16.167(cacheserver3:7984)<v7>:13905 
> in 5002 milliseconds: I believe it is gone.
> [info 2016/04/15 14:45:26.318 UTC locator3 <ViewHandler> tid=0x7d] 
> Membership: sending new view 
> [[10.0.16.157(locator3:7369:locator)<ec><v0>:11977|8] 
> [10.0.16.157(locator3:7369:locator)<ec><v0>:11977/48436, 
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350/13719, 
> 10.0.16.169(cacheserver5:7959)<v7>:19055/60393] crashed mbrs: 
> [10.0.16.168(cacheserver4:7984)<v6>:14494/5420, 
> 10.0.16.167(cacheserver3:7984)<v7>:13905/30231]] (3 mbrs)
>   
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] 
>      member 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has a weight of 3
>      member 10.0.16.156(locator2:7421:locator)<ec><v1>:3350 has a weight of 3
>      member 10.0.16.168(cacheserver4:7984)<v6>:14494 has a weight of 15
>      member 10.0.16.167(cacheserver3:7984)<v7>:13905 has a weight of 10
>      member 10.0.16.169(cacheserver5:7959)<v7>:19055 has a weight of 10
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] quorum 
> weight calculation:  oldWeight=41 failureWeight=25 threshold=51% (failure 
> weight must be < 21)
> [info 2016/04/15 14:45:26.342 UTC locator3 <FD_SOCK Ping thread> tid=0x7f] 
> GemFire failure detection is now monitoring 
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350
> [severe 2016/04/15 14:45:26.354 UTC locator3 <Quorum Lost Notification> 
> tid=0xf4] Possible loss of quorum due to the loss of 2 cache processes: 
> [10.0.16.168(cacheserver4:7984)<v6>:14494, 
> 10.0.16.167(cacheserver3:7984)<v7>:13905]
> [severe 2016/04/15 14:45:26.360 UTC locator3 <CloserThread> tid=0xf1] 
> Membership service failure: Channel closed: 
> com.gemstone.gemfire.ForcedDisconnectException: Membership coordinator 
> 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has declared that a network 
> partition has occurred
> {noformat}
> One way to correct this problem would be to change ShutDownFunction to create 
> a non-daemon thread to perform the shutdown and have the function execution 
> thread wait for the non-daemon thread to finish before returning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GEODE-1236) GFSH shutdown of a cache server causes JVM to exit before completing shutdown

Reply via email to