[
https://issues.apache.org/jira/browse/GEODE-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278313#comment-17278313
]
Donal Evans edited comment on GEODE-8890 at 2/3/21, 7:30 PM:
-------------------------------------------------------------
In the logs associated with the original failure, we see that an earlier call
to {{TcpClient.requestToServer()}} in
{{LocatorDiscovery.exchangeRemoteLocators()}} with a
{{RemoteLocatorJoinRequest}} succeeds, as the following output is logged:
{code:java}
[vm2] [info 2020/12/25 05:33:18.209 GMT locator-2 <WAN Locator Discovery
Thread1> tid=0x52] Locator discovery task exchanged locator information
7a2096ae9648[44531] with localhost[45273]: {1=[7a2096ae9648[45273]],
2=[7a2096ae9648[44531]]}.{code}
Next, the code sleeps for {{WAN_LOCATOR_PING_INTERVAL}} ms (default value is 10
seconds), then attempts to ping the remote locator with a call to
{{TcpClient.requestToServer()}} with a {{RemoteLocatorPingRequest}}. It is this
second call that results in the {{ClassCastException}}, as can be seen by the
time stamp on the fatal log ({{2020/12/25 05:33:28.218 GMT}}), which is almost
exactly 10 seconds after the first log.
What is strange is that the test which logged the exception started after the
first logged message from the WAN Locator Discovery Thread:
{code:java}
[vm3] [info 2020/12/25 05:33:26.063 GMT <Geode Failure Detection Server thread
1> tid=0x33] GMSHealthMonitor server thread exiting [setup] START TEST
org.apache.geode.internal.cache.wan.serial.WANHostNameVerificationDistributedTest.enableHostNameValidationAcrossAllComponents
[vm0] [info 2020/12/25 05:33:26.065 GMT locator-0 <RMI TCP
Connection(5)-172.17.0.10> tid=0x5a] Received method:
org.apache.geode.test.dunit.rules.DistributedRestoreSystemProperties$$Lambda$19/39844442.run
with 0 args on object:{code}
which implies that the thread doing the logging was started by a previous test.
When searching the logs for the ports used by the locators in the first logged
message (44531 and 45273), we see that both locators that are mentioned stop
before the test that logged the exception started:
{code:java}
[vm2] [info 2020/12/25 05:33:25.705 GMT locator-2 <RMI TCP
Connection(1)-172.17.0.10> tid=0x20] Distribution Locator on
7a2096ae9648[44531] is stopped
{code}
{code:java}
[vm0] [info 2020/12/25 05:33:25.989 GMT locator-0 <RMI TCP
Connection(5)-172.17.0.10> tid=0x5a] Distribution Locator on
7a2096ae9648[45273] is stopped {code}
but also see that a locator with the same port (45273) is started shortly
before the test logged the exception:
{code:java}
[vm0] [info 2020/12/25 05:33:27.188 GMT locator-0 <RMI TCP
Connection(5)-172.17.0.10> tid=0x5a] DistributionManager
172.17.0.10(locator-0:245:locator)<ec><v0>:41001 started on
7a2096ae9648[45273]. There were 0 other DMs. others: [] (took 637 ms) (locator)
{code}
which may be relevant.
was (Author: donalevans):
In the logs associated with the original failure, we see that an earlier call
to {{TcpClient.requestToServer()}} in
{{LocatorDiscovery.exchangeRemoteLocators()}} with a
{{RemoteLocatorJoinRequest}} succeeds, as the following output is logged:
{code:java}
[vm2] [info 2020/12/25 05:33:18.209 GMT locator-2 <WAN Locator Discovery
Thread1> tid=0x52] Locator discovery task exchanged locator information
7a2096ae9648[44531] with localhost[45273]: {1=[7a2096ae9648[45273]],
2=[7a2096ae9648[44531]]}.{code}
Next, the code sleeps for {{WAN_LOCATOR_PING_INTERVAL}} ms (default value is 10
seconds), then attempts to ping the remote locator with a call to
{{TcpClient.requestToServer()}} with a {{RemoteLocatorPingRequest}}. It is this
second call that results in the {{ClassCastException}}, as can be seen by the
time stamp on the fatal log ({{2020/12/25 05:33:28.218 GMT}}), which is almost
exactly 10 seconds after the first log.
What is strange is that the test which logged the exception started after the
first logged message from the WAN Locator Discovery Thread:
{code:java}
[vm3] [info 2020/12/25 05:33:26.063 GMT <Geode Failure Detection Server thread
1> tid=0x33] GMSHealthMonitor server thread exiting [setup] START TEST
org.apache.geode.internal.cache.wan.serial.WANHostNameVerificationDistributedTest.enableHostNameValidationAcrossAllComponents
[vm0] [info 2020/12/25 05:33:26.065 GMT locator-0 <RMI TCP
Connection(5)-172.17.0.10> tid=0x5a] Received method:
org.apache.geode.test.dunit.rules.DistributedRestoreSystemProperties$$Lambda$19/39844442.run
with 0 args on object:{code}
which implies that the thread doing the logging was started by a previous test.
When searching the logs for the ports used by the locators in the first logged
message (44531 and 45273), we see that both locators that are mentioned stop
before the test that logged the exception started:
{code:java}
[vm2] [info 2020/12/25 05:33:25.705 GMT locator-2 <RMI TCP
Connection(1)-172.17.0.10> tid=0x20] Distribution Locator on
7a2096ae9648[44531] is stopped
{code}
{code:java}
[vm0] [info 2020/12/25 05:33:25.989 GMT locator-0 <RMI TCP
Connection(5)-172.17.0.10> tid=0x5a] Distribution Locator on
7a2096ae9648[45273] is stopped {code}
but also see that a locator with the same port (45273) is started shortly
before the test logged the exception:
{code:java}
[vm0] [info 2020/12/25 05:33:27.188 GMT locator-0 <RMI TCP
Connection(5)-172.17.0.10> tid=0x5a] DistributionManager
172.17.0.10(locator-0:245:locator)<ec><v0>:41001 started on
7a2096ae9648[45273]. There were 0 other DMs. others: [] (took 637 ms) (locator)
{code}
which may be relevant.
> ClassCastException in LocatorDiscovery.exchangeRemoteLocators
> -------------------------------------------------------------
>
> Key: GEODE-8890
> URL: https://issues.apache.org/jira/browse/GEODE-8890
> Project: Geode
> Issue Type: Bug
> Components: wan
> Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.1, 1.14.0
> Reporter: Donal Evans
> Assignee: Donal Evans
> Priority: Major
> Labels: blocks-1.14.0, pull-request-available
>
> {noformat}
> org.apache.geode.internal.cache.wan.serial.WANHostNameVerificationDistributedTest
> > enableHostNameValidationAcrossAllComponents FAILED
> java.lang.AssertionError: Suspicious strings were written to the log
> during this run.
> Fix the strings or use IgnoredException.addIgnoredException to ignore.
> -----------------------------------------------------------------------
> Found suspect string in log4j at line 682
> [fatal 2020/12/25 05:33:28.218 GMT <WAN Locator Discovery Thread1>
> tid=82] Uncaught exception in thread Thread[WAN Locator Discovery
> Thread1,5,RMI Runtime]
> java.lang.ClassCastException: java.lang.Class cannot be cast to
> org.apache.geode.cache.client.internal.locator.wan.RemoteLocatorPingResponse
> at
> org.apache.geode.cache.client.internal.locator.wan.LocatorDiscovery.exchangeRemoteLocators(LocatorDiscovery.java:191)
> at
> org.apache.geode.cache.client.internal.locator.wan.LocatorDiscovery$RemoteLocatorDiscovery.run(LocatorDiscovery.java:121)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)