Juan José Ramos Cassella created GEODE-6977:
-----------------------------------------------

             Summary: WAN Auto Discovery Failure
                 Key: GEODE-6977
                 URL: https://issues.apache.org/jira/browse/GEODE-6977
             Project: Geode
          Issue Type: Improvement
          Components: membership, wan
            Reporter: Juan José Ramos Cassella


The topology for the test is 4 clusters with 4 locators and 4 servers each, the 
sites are connected using a ring topology (cluster1 knows about cluster2, 
cluster2 knows about cluster3, cluster3 knows about cluster4 and cluster4 knows 
about cluster1).
After the startup finished, it was found out that one locator (out of sixteen) 
didn't know about another locator in other site.

According to the docs:
{noformat}
Discovery for Multi-Site Systems
Each Geode cluster in a WAN configuration uses locators to discover remote 
clusters as well as local members.
Each locator in a WAN configuration defines a unique distributed-system-id 
property that identifies the local cluster to which it belongs. A locator uses 
the remote-locators property to define the addresses of one or more locators in 
remote clusters to use for WAN distribution.
When a locator starts up, it contacts each locator that is configured in the 
remote-locators property to exchange information about the available locators 
and gateway receivers in the cluster. The locator also shares information about 
locators and gateway receivers in any other Geode clusters that have connected 
to the cluster. Connected clusters can then use the shared gateway receiver 
information to distribute region events according to their configured gateway 
senders.
Each time a new locator starts up or an existing locator shuts down, the 
changed information is broadcast to other connected Geode clusters.
{noformat}

I've added some extra logging to the {{LocatorMembershipListenerImpl}} class 
and reproduced the issue from scratch. In this particular run {{locator_1_1}} 
({{rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}) was not found in 
{{locator_3_3}} ({{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}). The 
member {{locator 1_1}} is the last one to startup:
{noformat}
[info 2019/07/12 12:10:47.737 PDT <vm_34_thr_66_locator_1_3_host1_6234> 
tid=0x15] Locator started on 10.32.111.135[22488]
[info 2019/07/12 12:10:47.756 PDT <vm_36_thr_68_locator_2_1_host2_8759> 
tid=0x17] Locator started on 10.32.109.20[23639]
[info 2019/07/12 12:10:47.809 PDT <vm_44_thr_76_locator_4_1_host2_8840> 
tid=0x17] Locator started on 10.32.109.20[24645]
[info 2019/07/12 12:10:48.800 PDT <vm_41_thr_73_locator_3_2_host1_6306> 
tid=0x15] Locator started on 10.32.111.135[23280]
[info 2019/07/12 12:10:50.042 PDT <vm_47_thr_79_locator_4_4_host2_8863> 
tid=0x15] Locator started on 10.32.109.20[29961]
[info 2019/07/12 12:10:50.162 PDT <vm_38_thr_70_locator_2_3_host2_8780> 
tid=0x15] Locator started on 10.32.109.20[27708]
[info 2019/07/12 12:10:50.172 PDT <vm_46_thr_78_locator_4_3_host2_8860> 
tid=0x15] Locator started on 10.32.109.20[22615]
[info 2019/07/12 12:10:50.493 PDT <vm_39_thr_71_locator_2_4_host2_8797> 
tid=0x17] Locator started on 10.32.109.20[25209]
[info 2019/07/12 12:10:51.052 PDT <vm_37_thr_69_locator_2_2_host2_8766> 
tid=0x15] Locator started on 10.32.109.20[24073]
[info 2019/07/12 12:10:51.067 PDT <vm_35_thr_67_locator_1_4_host1_6246> 
tid=0x15] Locator started on 10.32.111.135[28694]
[info 2019/07/12 12:10:51.082 PDT <vm_45_thr_77_locator_4_2_host2_8849> 
tid=0x15] Locator started on 10.32.109.20[20182]
[info 2019/07/12 12:10:51.112 PDT <vm_42_thr_74_locator_3_3_host1_6314> 
tid=0x15] Locator started on 10.32.111.135[22655]
[info 2019/07/12 12:10:51.709 PDT <vm_33_thr_65_locator_1_2_host1_6229> 
tid=0x15] Locator started on 10.32.111.135[28293]
[info 2019/07/12 12:10:52.007 PDT <vm_40_thr_72_locator_3_1_host1_6297> 
tid=0x15] Locator started on 10.32.111.135[22923]
[info 2019/07/12 12:10:52.625 PDT <vm_43_thr_75_locator_3_4_host1_6326> 
tid=0x15] Locator started on 10.32.111.135[26144]
[info 2019/07/12 12:10:56.183 PDT <vm_32_thr_64_locator_1_1_host1_6225> 
tid=0x15] Locator started on 10.32.111.135[23955]
{noformat}

After {{locator_1_1}} joins the cluster, every locator informs the other 
locators about this new addition (and also about the other locators known so 
far), as expected.
The logs below show that all locators tried to share the addition of locator 
{{locator_1_1 rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with 
{{locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
{noformat}
/locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:57.153 PDT 
<Thread-26> tid=0x7c] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-7[28293]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:57.192 PDT 
<Thread-24> tid=0x6c] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-7[22488]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:58.011 PDT 
<Thread-44> tid=0xa7] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-7[28694]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:57.796 PDT 
<Thread-25> tid=0x6d] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-41[23639]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:58.334 PDT 
<Thread-64> tid=0xa2] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-41[24073]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:57.368 PDT 
<Thread-29> tid=0x78] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-41[27708]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:57.086 PDT 
<Thread-42> tid=0x93] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-41[25209]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:10:59.059 PDT 
<Thread-23> tid=0x76] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-7[22923]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:58.027 PDT 
<Thread-25> tid=0x6c] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-7[23280]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:58.890 PDT 
<Thread-30> tid=0x7f] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-7[26144]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:57.506 PDT 
<Thread-23> tid=0x74] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-41[24645]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:57.179 PDT 
<Thread-25> tid=0x77] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-41[20182]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:10:59.006 PDT 
<Thread-53> tid=0xb1] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-41[22615]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:57.507 PDT 
<Thread-28> tid=0xa1] TcpClient sending 
LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
rs-GEM-2603-1105a2i3large-hydra-client-41[29961]} to 
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
{noformat}

One second afterwards, every single locator also logs that it failed to share 
the information about {{locator_1_1 
rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with {{locator_3_3 
rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
{noformat}
./locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:58.155 PDT 
<Thread-26> tid=0x7c] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:58.196 PDT 
<Thread-24> tid=0x6c] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:59.012 PDT 
<Thread-44> tid=0xa7] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:58.798 PDT 
<Thread-25> tid=0x6d] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:59.336 PDT 
<Thread-64> tid=0xa2] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:58.371 PDT 
<Thread-29> tid=0x78] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:58.093 PDT 
<Thread-42> tid=0x93] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:11:00.061 PDT 
<Thread-23> tid=0x76] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:59.029 PDT 
<Thread-25> tid=0x6c] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:59.903 PDT 
<Thread-30> tid=0x7f] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:58.514 PDT 
<Thread-23> tid=0x74] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:58.368 PDT 
<Thread-25> tid=0x77] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:11:00.007 PDT 
<Thread-53> tid=0xb1] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:58.515 PDT 
<Thread-28> tid=0xa1] Locator Membership listener could not exchange locator 
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
{noformat}

This is also backed up by the fact that {{locator_3_3}} 
{{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}} is the only one that never 
logs the reception of the {{LocatorJoinMessage}} for {{locator_1_1 
rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}:
{noformat}
##### grep -lR "LocatorJoinMessage{distributedSystemId=1 
locators=rs-GEM-2603-1105a2i3large-hydra-client-7\[23955\]" . | sort
./locatorgemfire_1_1_6225/system.log
./locatorgemfire_1_2_6229/system.log
./locatorgemfire_1_3_6234/system.log
./locatorgemfire_1_4_6246/system.log
./locatorgemfire_2_1_8759/system.log
./locatorgemfire_2_2_8766/system.log
./locatorgemfire_2_3_8780/system.log
./locatorgemfire_2_4_8797/system.log
./locatorgemfire_3_1_6297/system.log
./locatorgemfire_3_2_6306/system.log
./locatorgemfire_3_4_6326/system.log
./locatorgemfire_4_1_8840/system.log
./locatorgemfire_4_2_8849/system.log
./locatorgemfire_4_3_8860/system.log
./locatorgemfire_4_4_8863/system.log
{noformat}

The exception when trying to exchange the locator information 
{{rs-GEM-2603-1105a2i3large-hydra-client-7:23955}} with 
{{rs-GEM-2603-1105a2i3large-hydra-client-7:22655}} is always the same on every 
single locator:
{noformat}
java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at 
org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:898)
        at 
org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:839)
        at 
org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:828)
        at 
org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:205)
        at 
org.apache.geode.cache.client.internal.locator.wan.LocatorMembershipListenerImpl$1.run(LocatorMembershipListenerImpl.java:120)
{noformat}

Even though there are no errors on the logs for {{locator_3_3}} within the 
relevant time frame, the statistics show that the cpu is at 100% and that there 
are no samples taken for around 5 seconds, probably due to a short JVM pause or 
resource contention, that's probably the reason why the connection from other 
locators failed. That said, there were no more additions/deletions of extra 
locators within any of the clusters either (which would have fired another 
round of messages from every locator to every other known locator about the 
addition/deletion), so in the end "it is expected" that {{locator_3_3}} didn't 
have the information about {{locator_1_1}} in its internal map of known 
locators. 
There's no retry logic within {{LocatorJoinMessage.locatorJoined}} and the 
timeout of 1 second seems to be too short: any locator might be busy for a 
short period of time and, as such, the information about the new locator will 
be missed until there's another locator membership change; this ticket is to 
change that: increase the timeout and retry a couple of times by default.





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to