Juan Ramos created GEODE-9512:
---------------------------------

             Summary: java.lang.IllegalStateException: Detected old version 
(pre 5.0.1) of GemFire or non-GemFire during handshake due to initial byte 
being 1
                 Key: GEODE-9512
                 URL: https://issues.apache.org/jira/browse/GEODE-9512
             Project: Geode
          Issue Type: Bug
          Components: membership
    Affects Versions: 1.15.0
            Reporter: Juan Ramos


I've hit this issue while executing some chaos testing over a GemFire cluster 
using 2 locators and 3 servers; {{SSL}} is enabled, and a dummy 
{{SecurityManager}} is configured which authenticates any user and always 
returns {{true}} within the {{authorize}} method. There are 3 
{{PARTITION_REDUNDANT}} regions configured, one per client, each with 1 
redundant copy. Once the cluster is up and running, 3 clients continuously 
execute {{Region.put}} operations on a well known set of keys for its own 
{{Region}} (created with {{PROXY}} type), and another process executes the 
following logic in parallel (pseudocode):
{noformat}
for server in ${servers}
do
        # Pause the JVM for 30 seconds to simulate a stop the world GC
        kill -STOP server 
        sleep 30

        # Unpause the JVM, wait for member to reconnect and regions to recover 
redundancy configured
        kill -CONT "${SERVER_PID}"
        waitForReconnectcompletedInServerLog
        waitForNumBucketsWithoutRedundancyToBeZeroInGfshShowRegionMetrics
done
{noformat}
The test works fine most of the time, but randomly fails due to an unexpected 
exception logged within the logs for at least one locator. The exception is 
always reported from a {{P2P message reader}} thread on the locator for a 
server member that has just returned to life, as an example:
{noformat}
#### LOCATOR-0
[warn 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message 
reader for 
system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 unshared 
ordered sender uid=61 dom #1 local port=48141 remote port=46174> tid=0x6f] P2P 
message reader@354fac47 timed out during a membership check.

[fatal 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message 
reader for 
system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 unshared 
ordered sender uid=61 dom #1 local port=48141 remote port=46174> tid=0x6f] 
Error deserializing P2P handshake message
java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or 
non-GemFire during handshake due to initial byte being 1
        at 
org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875)
        at 
org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)


#### SERVER-2
[warn 2021/08/17 05:20:44.012 GMT system-test-gemfire-server-2 <StatSampler> 
tid=0x35] Statistics sampling thread detected a wakeup delay of 29070 ms, 
indicating a possible resource issue. Check the GC, memory, and CPU statistics.

[warn 2021/08/17 05:20:44.015 GMT system-test-gemfire-server-2 <Geode Heartbeat 
Sender> tid=0x23] Failure detection heartbeat-generation thread overslept by 
more than a full period. Asleep time: 31,175,291,931 nanoseconds. Period: 
2,500,000,000 nanoseconds.

[info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast 
receiver,system-test-gemfire-server-2-25596> tid=0x1e] saving cache server 
configuration for use with the cluster-configuration service on reconnect

[info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast 
receiver,system-test-gemfire-server-2-25596> tid=0x1e] cache server 
configuration saved

[info 2021/08/17 05:20:44.233 GMT system-test-gemfire-server-2 
<DisconnectThread> tid=0xe5] Stopping membership services

[info 2021/08/17 05:20:44.455 GMT system-test-gemfire-server-2 
<ReconnectThread> tid=0xe5] Disconnecting old DistributedSystem to prepare for 
a reconnect attempt

[info 2021/08/17 05:20:44.463 GMT system-test-gemfire-server-2 
<ReconnectThread> tid=0xe5] GemFireCache[id = 252990056; isClosing = true; 
isShutDownAll = false; created = Tue Aug 17 05:11:50 GMT 2021; server = true; 
copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

[info 2021/08/17 05:20:44.544 GMT system-test-gemfire-server-2 
<ReconnectThread> tid=0xe5] Cache server on port 40404 is shutting down.

[info 2021/08/17 05:20:44.565 GMT system-test-gemfire-server-2 <Queue Removal 
Thread> tid=0x5e] The QueueRemovalThread is done.
{noformat}
Contrary to what the exception message states, it's worth noticing that *_all 
members within the cluster (including clients) are using the same Geode 
version._*

—

Below are some extra logs from when I was able to reproduce the issue with 
{{log-level=fine}} configured on all members:
{noformat}
#### LOCATOR-0
[debug 2021/08/16 20:40:22.858 GMT system-test-gemfire-locator-0 <P2P message 
reader for 
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared 
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] P2P 
handshake remoteAddr is 
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 (GEODE 
1.15.0)

[warn 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message 
reader for 
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared 
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] P2P 
message reader@6673b475 timed out during a membership check.

[fatal 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message 
reader for 
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared 
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] 
Error deserializing P2P handshake message
java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or 
non-GemFire during handshake due to initial byte being 1
        at 
org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875)
        at 
org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

[debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message 
reader for 
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared 
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] 
readMessages terminated id=/10.12.0.9:58373 from 
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 
isHandshakeReader=false

[debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message 
reader for 
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared 
ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] 
Stopping P2P message reader@6673b475 for 
system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000


#### SERVER-1
[warn 2021/08/16 20:40:22.698 GMT system-test-gemfire-server-1 <StatSampler> 
tid=0x35] Statistics sampling thread detected a wakeup delay of 29219 ms, 
indicating a possible resource issue. Check the GC, memory, and CPU statistics.

...

[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 
<ClientMembership Event Invoker1> tid=0x58] Starting thread for P2P handshake 
reader@6161c139-47

[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake 
reader@6161c139-47> tid=0x33] Starting P2P handshake reader@6161c139-47 on 
Socket[addr=/10.12.0.9,port=58373,localport=57818]

[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake 
reader@6161c139-47> tid=0x33] readMessages terminated id=/10.12.1.6:44219 from 
system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000
 isHandshakeReader=false

[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake 
reader@6161c139-47> tid=0x33] Stopping P2P handshake reader@6161c139-47 for 
system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000

[debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 
<ClientMembership Event Invoker1> tid=0x58] ConnectionTable: created an ordered 
connection: 
system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)

[debug 2021/08/16 20:40:23.059 GMT system-test-gemfire-server-1 
<ClientMembership Event Invoker1> tid=0x58] Sending on these 1 connections: 
[system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)]

[debug 2021/08/16 20:40:23.065 GMT system-test-gemfire-server-1 
<ClientMembership Event Invoker1> tid=0x58] distribution of message aborted by 
shutdown: 
UpdateOperation(EntryEventImpl[op=CREATE;region=/_notificationRegion_10.12.1.6<v3>41000;key=org.apache.geode.management.internal.NotificationKey@9b5d432b;callbackArg=null;originRemote=false;originMember=system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000;callbacksInvoked;id=EventID[system-test-gemfire-server-1(system-test-gemfire-server-1)<v3>:41000;threadID=2;sequenceID=5]])
{noformat}
I'm not familiar with this part of the code but, from the above logs, it looks 
like the same reader ({{reader@6161c139-47}}) is being used by two different 
threads at the same time?, is this expected?, might that be the cause of the 
exception we're seeing?.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to