[
https://issues.apache.org/jira/browse/GEODE-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250786#comment-15250786
]
ASF subversion and git services commented on GEODE-1236:
--------------------------------------------------------
Commit ea97a536e9175f36c4bc8d69a89d079649d44f82 in incubator-geode's branch
refs/heads/develop from [~jens.deppe]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=ea97a53 ]
GEODE-1236 GEODE-1248: Fix gfsh sutdown call
- This fixes two issues when using gfsh 'shutdown' command
- One is that the JVM can exit prematurely because all remaining threads
are daemon threads. When coupled with network partition detection this
can result in member departed events causing split brain scenarios -
[GEODE-1236].
- The other issue is that when a member is starting up it may have
synchronized on the CacheFactory class waiting on disk store recovery.
This prevented gfsh shutdown to run as it would also try and
synchronize on the CacheFactory and be blocked.
> GFSH shutdown of a cache server causes JVM to exit before completing shutdown
> -----------------------------------------------------------------------------
>
> Key: GEODE-1236
> URL: https://issues.apache.org/jira/browse/GEODE-1236
> Project: Geode
> Issue Type: Bug
> Components: gfsh, management
> Reporter: Bruce Schuchardt
>
> Logs of a cache server being shut down by gfsh show that the
> DistributedSystem is not completing shutdown before the JVM exits. This is
> probably due to all non-daemon threads being terminated before shutdown
> completes. Other members see the "shut down" as a crash and if too many
> servers are "shut down" in this way at the same time it can cause other
> members to declare a loss of quorum and shut themselves down.
> Here is the log from a server shut down by gfsh:
> {noformat}
> [info 2016/04/15 14:45:20.614 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Received GFSH shutdown. Shutting down member
> 10.0.16.168(cacheserver4:7984)<v6>:14494
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] GemFireCache[id = 965586344; isClosing = true;
> isShutDownAll = false; closingGatewayHubsByShutdownAll = false; created = Fri
> Apr 15 14:42:17 UTC 2016; server = true; copyOnRead = false; lockLease = 120;
> lockTimeout = 60]: Now closing.
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Resetting original MemoryPoolMXBean heap threshold
> bytes 0 on pool CMS Old Gen
> [info 2016/04/15 14:45:20.669 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Bridge server on port 40,404 is shutting down.
> [info 2016/04/15 14:45:20.744 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Rest Server on port 9,090 is shutting down
> [info 2016/04/15 14:45:20.745 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Stopping the HTTP service...
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Stopped ServerConnector@a8c6d1d{HTTP/1.1}{0.0.0.0:9090}
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Destroying Spring FrameworkServlet 'gemfire-api'
> [info 2016/04/15 14:45:20.749 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Closing WebApplicationContext for namespace
> 'gemfire-api-servlet': startup date [Fri Apr 15 14:42:20 UTC 2016]; root of
> context hierarchy
> [info 2016/04/15 14:45:20.750 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Destroying singletons in
> org.springframework.beans.factory.support.DefaultListableBeanFactory@a7f0ab6:
> defining beans
> [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,baseControllerAdvice,queryController,functionController,pdxCrudController,swaggerConfig,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping#0,org.springframework.format.support.FormattingConversionServiceFactoryBean#0,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter#0,org.springframework.web.servlet.handler.MappedInterceptor#0,org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver#0,org.springframework.web.servlet.mvc.annotation.ResponseStatusExceptionResolver#0,org.springframework.web.servlet.mvc.support.DefaultHandlerExceptionResolver#0,org.springframework.web.servlet.handler.BeanNameUrlHandlerMapping,org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter,org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter,org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler#0,org.springframework.web.servlet.handler.SimpleUrlHandlerMapping#0,contentNegotiationManager,objectMapper,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,classOrApiAnnotationResourceGrouping,swaggerCache,defaultSwaggerController,springSwaggerConfig,springSwaggerModelConfig,swaggerRequestMappingHandlerMappings,defaultResourceGroupingStrategy,defaultExcludeAnnotations,defaultSwaggerPathProvider,defaultIgnorableParameterTypes,defaultResponseMessages,defaultParameterDataTypes,jacksonScalaSupport,swaggerApiResourceListing,swaggerGlobalSettings,apiListingReferenceScanner,apiPathProvider];
> root of factory hierarchy
> [info 2016/04/15 14:45:20.794 UTC cacheserver4 <Function Execution
> Processor2> tid=0x67] Stopped
> o.e.j.w.WebAppContext@4afbb6c2{/gemfire-api,file:/var/vcap/store/server/GemFire_root/services/http/0.0.0.0_9090_gemfire-api/webapp/,UNAVAILABLE}{/var/vcap/packages/gemfire/Pivotal_GemFire/tools/Extensions/gemfire-api.war}
> {noformat}
> This does not contain the DistributionManager's shutdown message, which
> should look like this:
> DistributionManager stopped in 1035ms.
> In the system that this log is from there were three servers and all were
> "shut down" at the same time. This caused the locator to declare a network
> partition as it did not receive shutdown messages from the servers:
> {noformat}
> [info 2016/04/15 14:45:21.155 UTC locator3 <UDP ucast receiver> tid=0x27]
> Received Suspect notification for member(s)
> [10.0.16.168(cacheserver4:7984)<v6>:14494] from
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350. Reason=Could not set up
> ping socket
> [info 2016/04/15 14:45:21.193 UTC locator3 <UDP ucast receiver> tid=0x27]
> Received Suspect notification for member(s)
> [10.0.16.167(cacheserver3:7984)<v7>:13905] from
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350. Reason=Socket was not
> closed nicely
> [info 2016/04/15 14:45:26.150 UTC locator3 <UDP ucast receiver> tid=0x27]
> Received Suspect notification for member(s)
> [10.0.16.168(cacheserver4:7984)<v6>:14494,
> 10.0.16.167(cacheserver3:7984)<v7>:13905] from
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350. Reason=still being suspected
> [info 2016/04/15 14:45:26.165 UTC locator3 <VERIFY_SUSPECT.TimerThread>
> tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect
> verification response received from 10.0.16.168(cacheserver4:7984)<v6>:14494
> in 5000 milliseconds: I believe it is gone.
> [info 2016/04/15 14:45:26.195 UTC locator3 <VERIFY_SUSPECT.TimerThread>
> tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect
> verification response received from 10.0.16.167(cacheserver3:7984)<v7>:13905
> in 5002 milliseconds: I believe it is gone.
> [info 2016/04/15 14:45:26.318 UTC locator3 <ViewHandler> tid=0x7d]
> Membership: sending new view
> [[10.0.16.157(locator3:7369:locator)<ec><v0>:11977|8]
> [10.0.16.157(locator3:7369:locator)<ec><v0>:11977/48436,
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350/13719,
> 10.0.16.169(cacheserver5:7959)<v7>:19055/60393] crashed mbrs:
> [10.0.16.168(cacheserver4:7984)<v6>:14494/5420,
> 10.0.16.167(cacheserver3:7984)<v7>:13905/30231]] (3 mbrs)
>
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d]
> member 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has a weight of 3
> member 10.0.16.156(locator2:7421:locator)<ec><v1>:3350 has a weight of 3
> member 10.0.16.168(cacheserver4:7984)<v6>:14494 has a weight of 15
> member 10.0.16.167(cacheserver3:7984)<v7>:13905 has a weight of 10
> member 10.0.16.169(cacheserver5:7959)<v7>:19055 has a weight of 10
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] quorum
> weight calculation: oldWeight=41 failureWeight=25 threshold=51% (failure
> weight must be < 21)
> [info 2016/04/15 14:45:26.342 UTC locator3 <FD_SOCK Ping thread> tid=0x7f]
> GemFire failure detection is now monitoring
> 10.0.16.156(locator2:7421:locator)<ec><v1>:3350
> [severe 2016/04/15 14:45:26.354 UTC locator3 <Quorum Lost Notification>
> tid=0xf4] Possible loss of quorum due to the loss of 2 cache processes:
> [10.0.16.168(cacheserver4:7984)<v6>:14494,
> 10.0.16.167(cacheserver3:7984)<v7>:13905]
> [severe 2016/04/15 14:45:26.360 UTC locator3 <CloserThread> tid=0xf1]
> Membership service failure: Channel closed:
> com.gemstone.gemfire.ForcedDisconnectException: Membership coordinator
> 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has declared that a network
> partition has occurred
> {noformat}
> One way to correct this problem would be to change ShutDownFunction to create
> a non-daemon thread to perform the shutdown and have the function execution
> thread wait for the non-daemon thread to finish before returning.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)