[
https://issues.apache.org/jira/browse/GEODE-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dick Cavender closed GEODE-6977.
--------------------------------
> WAN Auto Discovery Failure
> --------------------------
>
> Key: GEODE-6977
> URL: https://issues.apache.org/jira/browse/GEODE-6977
> Project: Geode
> Issue Type: Improvement
> Components: membership, wan
> Reporter: Juan Ramos
> Assignee: Juan Ramos
> Priority: Major
> Labels: GeodeCommons
> Fix For: 1.10.0
>
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> The topology for the test is 4 clusters with 4 locators and 4 servers each,
> the sites are connected using a ring topology (cluster1 knows about cluster2,
> cluster2 knows about cluster3, cluster3 knows about cluster4 and cluster4
> knows about cluster1).
> After the startup finished, it was found out that one locator (out of
> sixteen) didn't know about another locator in other site.
> According to the docs:
> {noformat}
> Discovery for Multi-Site Systems
> Each Geode cluster in a WAN configuration uses locators to discover remote
> clusters as well as local members.
> Each locator in a WAN configuration defines a unique distributed-system-id
> property that identifies the local cluster to which it belongs. A locator
> uses the remote-locators property to define the addresses of one or more
> locators in remote clusters to use for WAN distribution.
> When a locator starts up, it contacts each locator that is configured in the
> remote-locators property to exchange information about the available locators
> and gateway receivers in the cluster. The locator also shares information
> about locators and gateway receivers in any other Geode clusters that have
> connected to the cluster. Connected clusters can then use the shared gateway
> receiver information to distribute region events according to their
> configured gateway senders.
> Each time a new locator starts up or an existing locator shuts down, the
> changed information is broadcast to other connected Geode clusters.
> {noformat}
> I've added some extra logging to the {{LocatorMembershipListenerImpl}} class
> and reproduced the issue from scratch. In this particular run {{locator_1_1}}
> ({{rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}) was not found in
> {{locator_3_3}} ({{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}). The
> member {{locator 1_1}} is the last one to startup:
> {noformat}
> [info 2019/07/12 12:10:47.737 PDT <vm_34_thr_66_locator_1_3_host1_6234>
> tid=0x15] Locator started on 10.32.111.135[22488]
> [info 2019/07/12 12:10:47.756 PDT <vm_36_thr_68_locator_2_1_host2_8759>
> tid=0x17] Locator started on 10.32.109.20[23639]
> [info 2019/07/12 12:10:47.809 PDT <vm_44_thr_76_locator_4_1_host2_8840>
> tid=0x17] Locator started on 10.32.109.20[24645]
> [info 2019/07/12 12:10:48.800 PDT <vm_41_thr_73_locator_3_2_host1_6306>
> tid=0x15] Locator started on 10.32.111.135[23280]
> [info 2019/07/12 12:10:50.042 PDT <vm_47_thr_79_locator_4_4_host2_8863>
> tid=0x15] Locator started on 10.32.109.20[29961]
> [info 2019/07/12 12:10:50.162 PDT <vm_38_thr_70_locator_2_3_host2_8780>
> tid=0x15] Locator started on 10.32.109.20[27708]
> [info 2019/07/12 12:10:50.172 PDT <vm_46_thr_78_locator_4_3_host2_8860>
> tid=0x15] Locator started on 10.32.109.20[22615]
> [info 2019/07/12 12:10:50.493 PDT <vm_39_thr_71_locator_2_4_host2_8797>
> tid=0x17] Locator started on 10.32.109.20[25209]
> [info 2019/07/12 12:10:51.052 PDT <vm_37_thr_69_locator_2_2_host2_8766>
> tid=0x15] Locator started on 10.32.109.20[24073]
> [info 2019/07/12 12:10:51.067 PDT <vm_35_thr_67_locator_1_4_host1_6246>
> tid=0x15] Locator started on 10.32.111.135[28694]
> [info 2019/07/12 12:10:51.082 PDT <vm_45_thr_77_locator_4_2_host2_8849>
> tid=0x15] Locator started on 10.32.109.20[20182]
> [info 2019/07/12 12:10:51.112 PDT <vm_42_thr_74_locator_3_3_host1_6314>
> tid=0x15] Locator started on 10.32.111.135[22655]
> [info 2019/07/12 12:10:51.709 PDT <vm_33_thr_65_locator_1_2_host1_6229>
> tid=0x15] Locator started on 10.32.111.135[28293]
> [info 2019/07/12 12:10:52.007 PDT <vm_40_thr_72_locator_3_1_host1_6297>
> tid=0x15] Locator started on 10.32.111.135[22923]
> [info 2019/07/12 12:10:52.625 PDT <vm_43_thr_75_locator_3_4_host1_6326>
> tid=0x15] Locator started on 10.32.111.135[26144]
> [info 2019/07/12 12:10:56.183 PDT <vm_32_thr_64_locator_1_1_host1_6225>
> tid=0x15] Locator started on 10.32.111.135[23955]
> {noformat}
> After {{locator_1_1}} joins the cluster, every locator informs the other
> locators about this new addition (and also about the other locators known so
> far), as expected.
> The logs below show that all locators tried to share the addition of locator
> {{locator_1_1 rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with
> {{locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
> {noformat}
> /locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:57.153 PDT
> <Thread-26> tid=0x7c] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-7[28293]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:57.192 PDT
> <Thread-24> tid=0x6c] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-7[22488]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:58.011 PDT
> <Thread-44> tid=0xa7] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-7[28694]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:57.796 PDT
> <Thread-25> tid=0x6d] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-41[23639]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:58.334 PDT
> <Thread-64> tid=0xa2] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-41[24073]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:57.368 PDT
> <Thread-29> tid=0x78] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-41[27708]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:57.086 PDT
> <Thread-42> tid=0x93] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-41[25209]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:10:59.059 PDT
> <Thread-23> tid=0x76] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-7[22923]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:58.027 PDT
> <Thread-25> tid=0x6c] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-7[23280]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:58.890 PDT
> <Thread-30> tid=0x7f] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-7[26144]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:57.506 PDT
> <Thread-23> tid=0x74] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-41[24645]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:57.179 PDT
> <Thread-25> tid=0x77] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-41[20182]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:10:59.006 PDT
> <Thread-53> tid=0xb1] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-41[22615]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:57.507 PDT
> <Thread-28> tid=0xa1] TcpClient sending
> LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
> rs-GEM-2603-1105a2i3large-hydra-client-41[29961]} to
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> {noformat}
> One second afterwards, every single locator also logs that it failed to share
> the information about {{locator_1_1
> rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with {{locator_3_3
> rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
> {noformat}
> ./locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:58.155 PDT
> <Thread-26> tid=0x7c] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:58.196 PDT
> <Thread-24> tid=0x6c] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:59.012 PDT
> <Thread-44> tid=0xa7] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:58.798 PDT
> <Thread-25> tid=0x6d] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:59.336 PDT
> <Thread-64> tid=0xa2] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:58.371 PDT
> <Thread-29> tid=0x78] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:58.093 PDT
> <Thread-42> tid=0x93] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:11:00.061 PDT
> <Thread-23> tid=0x76] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:59.029 PDT
> <Thread-25> tid=0x6c] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:59.903 PDT
> <Thread-30> tid=0x7f] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:58.514 PDT
> <Thread-23> tid=0x74] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:58.368 PDT
> <Thread-25> tid=0x77] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:11:00.007 PDT
> <Thread-53> tid=0xb1] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:58.515 PDT
> <Thread-28> tid=0xa1] Locator Membership listener could not exchange locator
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> {noformat}
> This is also backed up by the fact that {{locator_3_3}}
> {{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}} is the only one that
> never logs the reception of the {{LocatorJoinMessage}} for {{locator_1_1
> rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}:
> {noformat}
> ##### grep -lR "LocatorJoinMessage{distributedSystemId=1
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7\[23955\]" . | sort
> ./locatorgemfire_1_1_6225/system.log
> ./locatorgemfire_1_2_6229/system.log
> ./locatorgemfire_1_3_6234/system.log
> ./locatorgemfire_1_4_6246/system.log
> ./locatorgemfire_2_1_8759/system.log
> ./locatorgemfire_2_2_8766/system.log
> ./locatorgemfire_2_3_8780/system.log
> ./locatorgemfire_2_4_8797/system.log
> ./locatorgemfire_3_1_6297/system.log
> ./locatorgemfire_3_2_6306/system.log
> ./locatorgemfire_3_4_6326/system.log
> ./locatorgemfire_4_1_8840/system.log
> ./locatorgemfire_4_2_8849/system.log
> ./locatorgemfire_4_3_8860/system.log
> ./locatorgemfire_4_4_8863/system.log
> {noformat}
> The exception when trying to exchange the locator information
> {{rs-GEM-2603-1105a2i3large-hydra-client-7:23955}} with
> {{rs-GEM-2603-1105a2i3large-hydra-client-7:22655}} is always the same on
> every single locator:
> {noformat}
> java.net.SocketTimeoutException: connect timed out
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at
> org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:898)
> at
> org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:839)
> at
> org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:828)
> at
> org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:205)
> at
> org.apache.geode.cache.client.internal.locator.wan.LocatorMembershipListenerImpl$1.run(LocatorMembershipListenerImpl.java:120)
> {noformat}
> Even though there are no errors on the logs for {{locator_3_3}} within the
> relevant time frame, the statistics show that the cpu is at 100% and that
> there are no samples taken for around 5 seconds, probably due to a short JVM
> pause or resource contention, that's probably the reason why the connection
> from other locators failed. That said, there were no more additions/deletions
> of extra locators within any of the clusters either (which would have fired
> another round of messages from every locator to every other known locator
> about the addition/deletion), so in the end "it is expected" that
> {{locator_3_3}} didn't have the information about {{locator_1_1}} in its
> internal map of known locators.
> There's no retry logic within {{LocatorJoinMessage.locatorJoined}} and the
> timeout of 1 second seems to be too short: any locator might be busy for a
> short period of time and, as such, the information about the new locator will
> be missed until there's another locator membership change; this ticket is to
> change that: increase the timeout and retry a couple of times by default.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)