Bruce Schuchardt created GEODE-1236:
---------------------------------------

             Summary: GFSH shutdown of a cache server causes JVM to exit before 
completing shutdown
                 Key: GEODE-1236
                 URL: https://issues.apache.org/jira/browse/GEODE-1236
             Project: Geode
          Issue Type: Bug
          Components: gfsh, management
            Reporter: Bruce Schuchardt


Logs of a cache server being shut down by gfsh show that the DistributedSystem 
is not completing shutdown before the JVM exits.  This is probably due to all 
non-daemon threads being terminated before shutdown completes.  Other members 
see the "shut down" as a crash and if too many servers are "shut down" in this 
way at the same time it can cause other members to declare a loss of quorum and 
shut themselves down.

Here is the log from a server shut down by gfsh:

{noformat}
[info 2016/04/15 14:45:20.614 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Received GFSH shutdown. Shutting down member 
10.0.16.168(cacheserver4:7984)<v6>:14494

[info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] GemFireCache[id = 965586344; isClosing = true; isShutDownAll = false; 
closingGatewayHubsByShutdownAll = false; created = Fri Apr 15 14:42:17 UTC 
2016; server = true; copyOnRead = false; lockLease = 120; lockTimeout = 60]: 
Now closing.

[info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Resetting original MemoryPoolMXBean heap threshold bytes 0 on pool 
CMS Old Gen

[info 2016/04/15 14:45:20.669 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Bridge server on port 40,404 is shutting down.

[info 2016/04/15 14:45:20.744 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Rest Server on port 9,090 is shutting down

[info 2016/04/15 14:45:20.745 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Stopping the HTTP service...

[info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Stopped ServerConnector@a8c6d1d{HTTP/1.1}{0.0.0.0:9090}

[info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Destroying Spring FrameworkServlet 'gemfire-api'

[info 2016/04/15 14:45:20.749 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Closing WebApplicationContext for namespace 'gemfire-api-servlet': 
startup date [Fri Apr 15 14:42:20 UTC 2016]; root of context hierarchy

[info 2016/04/15 14:45:20.750 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Destroying singletons in 
org.springframework.beans.factory.support.DefaultListableBeanFactory@a7f0ab6: 
defining beans 
[org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,baseControllerAdvice,queryController,functionController,pdxCrudController,swaggerConfig,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping#0,org.springframework.format.support.FormattingConversionServiceFactoryBean#0,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter#0,org.springframework.web.servlet.handler.MappedInterceptor#0,org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver#0,org.springframework.web.servlet.mvc.annotation.ResponseStatusExceptionResolver#0,org.springframework.web.servlet.mvc.support.DefaultHandlerExceptionResolver#0,org.springframework.web.servlet.handler.BeanNameUrlHandlerMapping,org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter,org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter,org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler#0,org.springframework.web.servlet.handler.SimpleUrlHandlerMapping#0,contentNegotiationManager,objectMapper,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,classOrApiAnnotationResourceGrouping,swaggerCache,defaultSwaggerController,springSwaggerConfig,springSwaggerModelConfig,swaggerRequestMappingHandlerMappings,defaultResourceGroupingStrategy,defaultExcludeAnnotations,defaultSwaggerPathProvider,defaultIgnorableParameterTypes,defaultResponseMessages,defaultParameterDataTypes,jacksonScalaSupport,swaggerApiResourceListing,swaggerGlobalSettings,apiListingReferenceScanner,apiPathProvider];
 root of factory hierarchy

[info 2016/04/15 14:45:20.794 UTC cacheserver4 <Function Execution Processor2> 
tid=0x67] Stopped 
o.e.j.w.WebAppContext@4afbb6c2{/gemfire-api,file:/var/vcap/store/server/GemFire_root/services/http/0.0.0.0_9090_gemfire-api/webapp/,UNAVAILABLE}{/var/vcap/packages/gemfire/Pivotal_GemFire/tools/Extensions/gemfire-api.war}
{noformat}

This does not contain the DistributionManager's shutdown message, which should 
look like this:

DistributionManager stopped in 1035ms.

In the system that this log is from there were three servers and all were "shut 
down" at the same time.  This caused the locator to declare a network partition 
as it did not receive shutdown messages from the servers:

{noformat}
[info 2016/04/15 14:45:21.155 UTC locator3 <UDP ucast receiver> tid=0x27] 
Received Suspect notification for member(s) 
[10.0.16.168(cacheserver4:7984)<v6>:14494] from 
10.0.16.156(locator2:7421:locator)<ec><v1>:3350.  Reason=Could not set up ping 
socket

[info 2016/04/15 14:45:21.193 UTC locator3 <UDP ucast receiver> tid=0x27] 
Received Suspect notification for member(s) 
[10.0.16.167(cacheserver3:7984)<v7>:13905] from 
10.0.16.156(locator2:7421:locator)<ec><v1>:3350.  Reason=Socket was not closed 
nicely

[info 2016/04/15 14:45:26.150 UTC locator3 <UDP ucast receiver> tid=0x27] 
Received Suspect notification for member(s) 
[10.0.16.168(cacheserver4:7984)<v6>:14494, 
10.0.16.167(cacheserver3:7984)<v7>:13905] from 
10.0.16.156(locator2:7421:locator)<ec><v1>:3350.  Reason=still being suspected

[info 2016/04/15 14:45:26.165 UTC locator3 <VERIFY_SUSPECT.TimerThread> 
tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect 
verification response received from 10.0.16.168(cacheserver4:7984)<v6>:14494 in 
5000 milliseconds: I believe it is gone.

[info 2016/04/15 14:45:26.195 UTC locator3 <VERIFY_SUSPECT.TimerThread> 
tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect 
verification response received from 10.0.16.167(cacheserver3:7984)<v7>:13905 in 
5002 milliseconds: I believe it is gone.

[info 2016/04/15 14:45:26.318 UTC locator3 <ViewHandler> tid=0x7d] Membership: 
sending new view [[10.0.16.157(locator3:7369:locator)<ec><v0>:11977|8] 
[10.0.16.157(locator3:7369:locator)<ec><v0>:11977/48436, 
10.0.16.156(locator2:7421:locator)<ec><v1>:3350/13719, 
10.0.16.169(cacheserver5:7959)<v7>:19055/60393] crashed mbrs: 
[10.0.16.168(cacheserver4:7984)<v6>:14494/5420, 
10.0.16.167(cacheserver3:7984)<v7>:13905/30231]] (3 mbrs)
  

[info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] 
     member 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has a weight of 3
     member 10.0.16.156(locator2:7421:locator)<ec><v1>:3350 has a weight of 3
     member 10.0.16.168(cacheserver4:7984)<v6>:14494 has a weight of 15
     member 10.0.16.167(cacheserver3:7984)<v7>:13905 has a weight of 10
     member 10.0.16.169(cacheserver5:7959)<v7>:19055 has a weight of 10

[info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] quorum 
weight calculation:  oldWeight=41 failureWeight=25 threshold=51% (failure 
weight must be < 21)

[info 2016/04/15 14:45:26.342 UTC locator3 <FD_SOCK Ping thread> tid=0x7f] 
GemFire failure detection is now monitoring 
10.0.16.156(locator2:7421:locator)<ec><v1>:3350

[severe 2016/04/15 14:45:26.354 UTC locator3 <Quorum Lost Notification> 
tid=0xf4] Possible loss of quorum due to the loss of 2 cache processes: 
[10.0.16.168(cacheserver4:7984)<v6>:14494, 
10.0.16.167(cacheserver3:7984)<v7>:13905]

[severe 2016/04/15 14:45:26.360 UTC locator3 <CloserThread> tid=0xf1] 
Membership service failure: Channel closed: 
com.gemstone.gemfire.ForcedDisconnectException: Membership coordinator 
10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has declared that a network 
partition has occurred
{noformat}

One way to correct this problem would be to change ShutDownFunction to create a 
non-daemon thread to perform the shutdown and have the function execution 
thread wait for the non-daemon thread to finish before returning.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to