[
https://issues.apache.org/jira/browse/IGNITE-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546206#comment-15546206
]
Anton Vinogradov commented on IGNITE-1924:
------------------------------------------
Situation:
Node configured to use SSL.
Node 1 sends message to node 2
This cause client creation at node 1
This cause sending HandshakeMessage from node 1 to node 2
This cause client creation at node 2 and RecoveryLastReceivedMessage sending to
node 1 (which should finish handshale with node 1).
Current implementation finishes client future at node 2 right after sending
RecoveryLastReceivedMessage and allows to use it.
In case TcpCommunicationSpi.sendMessage0() will be called at node 2 right after
client created, this message can be mixed with RecoveryLastReceivedMessage
(They both will be sent as a one SSL message).
At node 1 at safeHandshake we can see buf = ByteBuffer.allocate(1000); which
will read both messages and second message will be lost.
So, this should be fixed.
checked following solutions:
1) Do not allow to mix handshake messages with another by not finishing client
future untill first message received from node 1.
Implemented and checked it works (at my tests)
Can cause hang while client at node 1 creating.
2) Improved code to send RecoveryLastReceivedMessage as a separate SSL message.
This allow to read both message from socket, then somehow (for example increase
by 1 byte until decoded) find size of first message,
then gain second SSL message, save it to session meta and handle it under
NioClientWorker.processSelectedKeysOptimized
Problem that ssl message can't be decoded twice, this cause
javax.net.ssl.SSLException: bad record MAC
3) Gain both messages (decoded) and save second to session meta.
After that it can be added to decoded buffer at SslHandler.onMessageReceived
In this case sometimes I see java.lang.AssertionError: Missed message future,
when not whole message read from socket.
> Incomplete marshaller cache rebalancing causes Grid hangs
> ---------------------------------------------------------
>
> Key: IGNITE-1924
> URL: https://issues.apache.org/jira/browse/IGNITE-1924
> Project: Ignite
> Issue Type: Bug
> Reporter: Anton Vinogradov
> Priority: Critical
> Labels: Muted_test
> Fix For: 1.8
>
>
> End of the log.
> [11:49:32] : [org.apache.ignite:ignite-core] [11:49:32,947][INFO
> ][exchange-worker-#220584%tcp.IgniteCacheSslStartStopSelfTest3%][GridDhtPartitionDemander]
> <ignite-marshaller-sys-cache> Starting rebalancing
> [cache=ignite-marshaller-sys-cache, mode=SYNC,
> fromNode=108bffdb-1c1e-49aa-9525-b434784fa001, partitionsCount=7,
> topology=AffinityTopologyVersion [topVer=594, minorTopVer=0], updateSeq=1]
> [11:49:32] : [org.apache.ignite:ignite-core] [11:49:32,962][INFO
> ][exchange-worker-#220584%tcp.IgniteCacheSslStartStopSelfTest3%][GridDhtPartitionDemander]
> <ignite-marshaller-sys-cache> Starting rebalancing
> [cache=ignite-marshaller-sys-cache, mode=SYNC,
> fromNode=20660c29-91a1-4279-9dc1-88d192bc6002, partitionsCount=6,
> topology=AffinityTopologyVersion [topVer=594, minorTopVer=0], updateSeq=1]
> [11:49:32] : [org.apache.ignite:ignite-core] [11:49:32,962][INFO
> ][exchange-worker-#220584%tcp.IgniteCacheSslStartStopSelfTest3%][GridDhtPartitionDemander]
> <ignite-marshaller-sys-cache> Starting rebalancing
> [cache=ignite-marshaller-sys-cache, mode=SYNC,
> fromNode=00b3a75a-074d-46a5-a158-3956c0ec4000, partitionsCount=7,
> topology=AffinityTopologyVersion [topVer=594, minorTopVer=0], updateSeq=1]
> [11:49:32] : [org.apache.ignite:ignite-core] [11:49:32,963][INFO
> ][ignite-#220587%marshaller-cache-tcp.IgniteCacheSslStartStopSelfTest3%][GridDhtPartitionDemander]
> <ignite-marshaller-sys-cache> Completed rebalancing
> [cache=ignite-marshaller-sys-cache,
> fromNode=00b3a75a-074d-46a5-a158-3956c0ec4000,
> topology=AffinityTopologyVersion [topVer=594, minorTopVer=0], time=21 ms]
> [11:49:32] : [org.apache.ignite:ignite-core] [11:49:32,963][INFO
> ][ignite-#220586%marshaller-cache-tcp.IgniteCacheSslStartStopSelfTest3%][GridDhtPartitionDemander]
> <ignite-marshaller-sys-cache> Completed rebalancing
> [cache=ignite-marshaller-sys-cache,
> fromNode=108bffdb-1c1e-49aa-9525-b434784fa001,
> topology=AffinityTopologyVersion [topVer=594, minorTopVer=0], time=21 ms]
> Hang on:
> [11:51:56] : [org.apache.ignite:ignite-core] Thread
> [name="ignite-#220562%sys-tcp.IgniteCacheSslStartStopSelfTest3%", id=287517,
> state=WAITING, blockCnt=0, waitCnt=3]
> [11:51:56] : [org.apache.ignite:ignite-core] Lock
> [object=o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander$RebalanceFuture@b402f89,
> ownerName=null, ownerId=-1]
> [11:51:56] : [org.apache.ignite:ignite-core] at
> sun.misc.Unsafe.park(Native Method)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:157)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:115)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.waitForCacheRebalancing(GridDhtPartitionDemander.java:265)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.access$400(GridDhtPartitionDemander.java:85)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander$3.call(GridDhtPartitionDemander.java:323)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander$3.call(GridDhtPartitionDemander.java:320)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker$1.call(GridCachePartitionExchangeManager.java:1386)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker$1.call(GridCachePartitionExchangeManager.java:1377)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6371)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:929)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.util.worker.GridWorker.run(GridWorker.java:110)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.lang.Thread.run(Thread.java:745)
> [11:51:56] : [org.apache.ignite:ignite-core]
> [11:51:56] : [org.apache.ignite:ignite-core] Locked
> synchronizers:
> [11:51:56] : [org.apache.ignite:ignite-core]
> java.util.concurrent.ThreadPoolExecutor$Worker@7a9245cc
> [11:51:56] : [org.apache.ignite:ignite-core] Thread
> [name="ignite-#220561%sys-tcp.IgniteCacheSslStartStopSelfTest3%", id=287516,
> state=WAITING, blockCnt=0, waitCnt=2]
> [11:51:56] : [org.apache.ignite:ignite-core] Lock
> [object=java.util.concurrent.CountDownLatch$Sync@22f0d124, ownerName=null,
> ownerId=-1]
> [11:51:56] : [org.apache.ignite:ignite-core] at
> sun.misc.Unsafe.park(Native Method)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.util.IgniteUtils.awaitQuiet(IgniteUtils.java:7201)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.MarshallerContextImpl.className(MarshallerContextImpl.java:143)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.MarshallerContextAdapter.getClass(MarshallerContextAdapter.java:174)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.marshaller.optimized.OptimizedMarshallerUtils.classDescriptor(OptimizedMarshallerUtils.java:257)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.marshaller.optimized.OptimizedObjectInputStream.readObjectOverride(OptimizedObjectInputStream.java:309)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:364)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.marshaller.optimized.OptimizedMarshaller.unmarshal(OptimizedMarshaller.java:248)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.GridCacheMessage.unmarshalCollection(GridCacheMessage.java:599)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateRequest.finishUnmarshal(GridNearAtomicUpdateRequest.java:584)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.GridCacheIoManager.unmarshall(GridCacheIoManager.java:996)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:268)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:197)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.GridCacheIoManager.access$000(GridCacheIoManager.java:76)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.cache.GridCacheIoManager$1$1$1.run(GridCacheIoManager.java:150)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6427)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.processors.closure.GridClosureProcessor$1.body(GridClosureProcessor.java:788)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> o.a.i.i.util.worker.GridWorker.run(GridWorker.java:110)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [11:51:56] : [org.apache.ignite:ignite-core] at
> java.lang.Thread.run(Thread.java:745)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)