Juan José Ramos Cassella created GEODE-6977:
-----------------------------------------------
Summary: WAN Auto Discovery Failure
Key: GEODE-6977
URL: https://issues.apache.org/jira/browse/GEODE-6977
Project: Geode
Issue Type: Improvement
Components: membership, wan
Reporter: Juan José Ramos Cassella
The topology for the test is 4 clusters with 4 locators and 4 servers each, the
sites are connected using a ring topology (cluster1 knows about cluster2,
cluster2 knows about cluster3, cluster3 knows about cluster4 and cluster4 knows
about cluster1).
After the startup finished, it was found out that one locator (out of sixteen)
didn't know about another locator in other site.
According to the docs:
{noformat}
Discovery for Multi-Site Systems
Each Geode cluster in a WAN configuration uses locators to discover remote
clusters as well as local members.
Each locator in a WAN configuration defines a unique distributed-system-id
property that identifies the local cluster to which it belongs. A locator uses
the remote-locators property to define the addresses of one or more locators in
remote clusters to use for WAN distribution.
When a locator starts up, it contacts each locator that is configured in the
remote-locators property to exchange information about the available locators
and gateway receivers in the cluster. The locator also shares information about
locators and gateway receivers in any other Geode clusters that have connected
to the cluster. Connected clusters can then use the shared gateway receiver
information to distribute region events according to their configured gateway
senders.
Each time a new locator starts up or an existing locator shuts down, the
changed information is broadcast to other connected Geode clusters.
{noformat}
I've added some extra logging to the {{LocatorMembershipListenerImpl}} class
and reproduced the issue from scratch. In this particular run {{locator_1_1}}
({{rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}) was not found in
{{locator_3_3}} ({{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}). The
member {{locator 1_1}} is the last one to startup:
{noformat}
[info 2019/07/12 12:10:47.737 PDT <vm_34_thr_66_locator_1_3_host1_6234>
tid=0x15] Locator started on 10.32.111.135[22488]
[info 2019/07/12 12:10:47.756 PDT <vm_36_thr_68_locator_2_1_host2_8759>
tid=0x17] Locator started on 10.32.109.20[23639]
[info 2019/07/12 12:10:47.809 PDT <vm_44_thr_76_locator_4_1_host2_8840>
tid=0x17] Locator started on 10.32.109.20[24645]
[info 2019/07/12 12:10:48.800 PDT <vm_41_thr_73_locator_3_2_host1_6306>
tid=0x15] Locator started on 10.32.111.135[23280]
[info 2019/07/12 12:10:50.042 PDT <vm_47_thr_79_locator_4_4_host2_8863>
tid=0x15] Locator started on 10.32.109.20[29961]
[info 2019/07/12 12:10:50.162 PDT <vm_38_thr_70_locator_2_3_host2_8780>
tid=0x15] Locator started on 10.32.109.20[27708]
[info 2019/07/12 12:10:50.172 PDT <vm_46_thr_78_locator_4_3_host2_8860>
tid=0x15] Locator started on 10.32.109.20[22615]
[info 2019/07/12 12:10:50.493 PDT <vm_39_thr_71_locator_2_4_host2_8797>
tid=0x17] Locator started on 10.32.109.20[25209]
[info 2019/07/12 12:10:51.052 PDT <vm_37_thr_69_locator_2_2_host2_8766>
tid=0x15] Locator started on 10.32.109.20[24073]
[info 2019/07/12 12:10:51.067 PDT <vm_35_thr_67_locator_1_4_host1_6246>
tid=0x15] Locator started on 10.32.111.135[28694]
[info 2019/07/12 12:10:51.082 PDT <vm_45_thr_77_locator_4_2_host2_8849>
tid=0x15] Locator started on 10.32.109.20[20182]
[info 2019/07/12 12:10:51.112 PDT <vm_42_thr_74_locator_3_3_host1_6314>
tid=0x15] Locator started on 10.32.111.135[22655]
[info 2019/07/12 12:10:51.709 PDT <vm_33_thr_65_locator_1_2_host1_6229>
tid=0x15] Locator started on 10.32.111.135[28293]
[info 2019/07/12 12:10:52.007 PDT <vm_40_thr_72_locator_3_1_host1_6297>
tid=0x15] Locator started on 10.32.111.135[22923]
[info 2019/07/12 12:10:52.625 PDT <vm_43_thr_75_locator_3_4_host1_6326>
tid=0x15] Locator started on 10.32.111.135[26144]
[info 2019/07/12 12:10:56.183 PDT <vm_32_thr_64_locator_1_1_host1_6225>
tid=0x15] Locator started on 10.32.111.135[23955]
{noformat}
After {{locator_1_1}} joins the cluster, every locator informs the other
locators about this new addition (and also about the other locators known so
far), as expected.
The logs below show that all locators tried to share the addition of locator
{{locator_1_1 rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with
{{locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
{noformat}
/locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:57.153 PDT
<Thread-26> tid=0x7c] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-7[28293]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:57.192 PDT
<Thread-24> tid=0x6c] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-7[22488]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:58.011 PDT
<Thread-44> tid=0xa7] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-7[28694]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:57.796 PDT
<Thread-25> tid=0x6d] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-41[23639]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:58.334 PDT
<Thread-64> tid=0xa2] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-41[24073]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:57.368 PDT
<Thread-29> tid=0x78] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-41[27708]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:57.086 PDT
<Thread-42> tid=0x93] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-41[25209]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:10:59.059 PDT
<Thread-23> tid=0x76] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-7[22923]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:58.027 PDT
<Thread-25> tid=0x6c] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-7[23280]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:58.890 PDT
<Thread-30> tid=0x7f] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-7[26144]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:57.506 PDT
<Thread-23> tid=0x74] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-41[24645]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:57.179 PDT
<Thread-25> tid=0x77] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-41[20182]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:10:59.006 PDT
<Thread-53> tid=0xb1] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-41[22615]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:57.507 PDT
<Thread-28> tid=0xa1] TcpClient sending
LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator :
rs-GEM-2603-1105a2i3large-hydra-client-41[29961]} to
rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
{noformat}
One second afterwards, every single locator also logs that it failed to share
the information about {{locator_1_1
rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with {{locator_3_3
rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
{noformat}
./locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:58.155 PDT
<Thread-26> tid=0x7c] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:58.196 PDT
<Thread-24> tid=0x6c] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:59.012 PDT
<Thread-44> tid=0xa7] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:58.798 PDT
<Thread-25> tid=0x6d] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:59.336 PDT
<Thread-64> tid=0xa2] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:58.371 PDT
<Thread-29> tid=0x78] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:58.093 PDT
<Thread-42> tid=0x93] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:11:00.061 PDT
<Thread-23> tid=0x76] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:59.029 PDT
<Thread-25> tid=0x6c] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:59.903 PDT
<Thread-30> tid=0x7f] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:58.514 PDT
<Thread-23> tid=0x74] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:58.368 PDT
<Thread-25> tid=0x77] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:11:00.007 PDT
<Thread-53> tid=0xb1] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:58.515 PDT
<Thread-28> tid=0xa1] Locator Membership listener could not exchange locator
information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with
rs-GEM-2603-1105a2i3large-hydra-client-7:22655
{noformat}
This is also backed up by the fact that {{locator_3_3}}
{{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}} is the only one that never
logs the reception of the {{LocatorJoinMessage}} for {{locator_1_1
rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}:
{noformat}
##### grep -lR "LocatorJoinMessage{distributedSystemId=1
locators=rs-GEM-2603-1105a2i3large-hydra-client-7\[23955\]" . | sort
./locatorgemfire_1_1_6225/system.log
./locatorgemfire_1_2_6229/system.log
./locatorgemfire_1_3_6234/system.log
./locatorgemfire_1_4_6246/system.log
./locatorgemfire_2_1_8759/system.log
./locatorgemfire_2_2_8766/system.log
./locatorgemfire_2_3_8780/system.log
./locatorgemfire_2_4_8797/system.log
./locatorgemfire_3_1_6297/system.log
./locatorgemfire_3_2_6306/system.log
./locatorgemfire_3_4_6326/system.log
./locatorgemfire_4_1_8840/system.log
./locatorgemfire_4_2_8849/system.log
./locatorgemfire_4_3_8860/system.log
./locatorgemfire_4_4_8863/system.log
{noformat}
The exception when trying to exchange the locator information
{{rs-GEM-2603-1105a2i3large-hydra-client-7:23955}} with
{{rs-GEM-2603-1105a2i3large-hydra-client-7:22655}} is always the same on every
single locator:
{noformat}
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at
org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:898)
at
org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:839)
at
org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:828)
at
org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:205)
at
org.apache.geode.cache.client.internal.locator.wan.LocatorMembershipListenerImpl$1.run(LocatorMembershipListenerImpl.java:120)
{noformat}
Even though there are no errors on the logs for {{locator_3_3}} within the
relevant time frame, the statistics show that the cpu is at 100% and that there
are no samples taken for around 5 seconds, probably due to a short JVM pause or
resource contention, that's probably the reason why the connection from other
locators failed. That said, there were no more additions/deletions of extra
locators within any of the clusters either (which would have fired another
round of messages from every locator to every other known locator about the
addition/deletion), so in the end "it is expected" that {{locator_3_3}} didn't
have the information about {{locator_1_1}} in its internal map of known
locators.
There's no retry logic within {{LocatorJoinMessage.locatorJoined}} and the
timeout of 1 second seems to be too short: any locator might be busy for a
short period of time and, as such, the information about the new locator will
be missed until there's another locator membership change; this ticket is to
change that: increase the timeout and retry a couple of times by default.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)