[ https://issues.apache.org/jira/browse/GEODE-4990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Barbara Pruijn updated GEODE-4990: ---------------------------------- Fix Version/s: (was: 1.6.0) 1.7.0 > Race Condition When Using Cluster Configuration Service > ------------------------------------------------------- > > Key: GEODE-4990 > URL: https://issues.apache.org/jira/browse/GEODE-4990 > Project: Geode > Issue Type: Bug > Components: configuration, locator > Affects Versions: 1.4.0 > Reporter: Juan José Ramos Cassella > Assignee: Juan José Ramos Cassella > Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > There's a small window of time on which the locator > ({{enable-cluster-configuration=true}}) is receiving TCP requests from > starting up servers ({{use-cluster-configuration=true}}) but the cluster > configuration service is not yet fully loaded, because of this the servers > fail to startup and throw the following exception: > {noformat} > Exception in thread "main" org.apache.geode.cache.execute.FunctionException: > java.lang.NullPointerException > at > org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:232) > at > org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:285) > at > org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1036) > at > org.apache.geode.internal.cache.GemFireCacheImpl.<init>(GemFireCacheImpl.java:835) > at > org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:775) > at > org.apache.geode.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:764) > at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:175) > at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:222) > at > org.apache.geode.distributed.internal.DefaultServerLauncherCacheProvider.createCache(DefaultServerLauncherCacheProvider.java:52) > at > org.apache.geode.distributed.ServerLauncher.createCache(ServerLauncher.java:844) > at > org.apache.geode.distributed.ServerLauncher.start(ServerLauncher.java:762) > at > org.apache.geode.distributed.ServerLauncher.run(ServerLauncher.java:692) > at > org.apache.geode.distributed.ServerLauncher.main(ServerLauncher.java:226) > Caused by: java.lang.NullPointerException > at > org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55) > at > org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186) > at > org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382) > at > org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099) > at > org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108) > at > org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970) > at java.lang.Thread.run(Thread.java:748) > {noformat} > The logs from the locator, when using {{log-level=fine}} show the following: > {noformat} > [fine 2018/04/03 15:24:40.501 IST locator1 <Function Execution Processor1> > tid=0x51] Exception occurred on remote member while executing Function: > org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction > java.lang.NullPointerException > at > org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55) > at > org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186) > at > org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382) > at > org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099) > at > org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108) > at > org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970) > at java.lang.Thread.run(Thread.java:748) > [fine 2018/04/03 15:24:40.502 IST locator1 <Function Execution Processor1> > tid=0x51] Replying with exception: ReplyMessage processorId=2 from null with > exception org.apache.geode.distributed.internal.ReplyException: > java.lang.NullPointerException > org.apache.geode.distributed.internal.ReplyException: > java.lang.NullPointerException > at > org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:216) > at > org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382) > at > org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099) > at > org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108) > at > org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.NullPointerException > at > org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55) > at > org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186) > ... 8 more > {noformat} > It doesn't matter wether the user has configured a value for the parameter > {{locator-wait-time}} or not, the value is only used when establishing the > TCP connection and, since the locator is already running, the server > continues the startup process and tries to get the cluster configuration > service, failing afterwards due to the {{NullPointerException}}. > The problematic code can be seen below: > {code:title=GetClusterConfigurationFunction.java|borderStyle=solid} > public class GetClusterConfigurationFunction implements InternalFunction { > private static final Logger logger = LogService.getLogger(); > @Override > public void execute(FunctionContext context) { > InternalClusterConfigurationService clusterConfigurationService = > InternalLocator.getLocator().getSharedConfiguration(); > Set<String> groups = (Set<String>) context.getArguments(); > logger.info("Received request for configuration : {}", groups); > try { > // [XXXXX]: clusterConfigurationService is null as it hasn't been loaded > yet. > ConfigurationResponse response = > clusterConfigurationService.createConfigurationResponse(groups); > context.getResultSender().lastResult(response); > } catch (IOException e) { > logger.error("Unable to retrieve the cluster configuration", e); > context.getResultSender().lastResult(e); > } > } > } > {code} > The servers, on the other hand, already have the logic in place to wait and > retry the retrieval of the cluster configuration service when the response is > {{null}}: > {code:title=ClusterConfigurationLoader.java|borderStyle=solid} > public ConfigurationResponse requestConfigurationFromLocators(String > groupList, > Set<InternalDistributedMember> locatorList) > throws ClusterConfigurationNotAvailableException, UnknownHostException { > Set<String> groups = getGroups(groupList); > GetClusterConfigurationFunction function = new > GetClusterConfigurationFunction(); > ConfigurationResponse response = null; > int attempts = 6; > OUTER: while (attempts > 0) { > for (InternalDistributedMember locator : locatorList) { > logger.info("Attempting to retrieve cluster configuration from {} - > {} attempts remaining", > locator.getName(), attempts); > response = requestConfigurationFromOneLocator(locator, groups); > if (response != null) { > break OUTER; > } > } > try { > Thread.sleep(10000); > } catch (InterruptedException ex) { > break; > } > attempts--; > } > // if the response is null > if (response == null) { > throw new ClusterConfigurationNotAvailableException( > "Unable to retrieve cluster configuration from the locator."); > } > return response; > } > {code} > That said, the fix seems to be quite simple and it implies modifying > {{GetClusterConfigurationFunction}} to return {{null}} whenever the cluster > configuration is enabled but not fully loaded yet, and throw an exception > whenever the cluster configuration service is not enabled on the locator > (allowing us to fail fast when the server is requesting the cluster > configuration from a locator that doesn't have it enabled). -- This message was sent by Atlassian JIRA (v7.6.3#76005)