[
https://issues.apache.org/jira/browse/IGNITE-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15780621#comment-15780621
]
Semen Boikov commented on IGNITE-4003:
--------------------------------------
Hi Andrey,
I started review, so far found two issues (left TODOs in branch), please take a
look:
- 'onTimeout' method is called from dedicated thread (one thread per node), it
is not safe to retry connect from this callback
- when connect is cancelled by ConnectionTimeoutObject you call
'recoveryDesc.release()' from comm worker. Actually at this moment session can
be already created and used, so call 'release' can break something. Need move
descriptor release to NIO thread
Thanks!
> Slow or faulty client can stall the whole cluster.
> --------------------------------------------------
>
> Key: IGNITE-4003
> URL: https://issues.apache.org/jira/browse/IGNITE-4003
> Project: Ignite
> Issue Type: Bug
> Components: cache, general
> Affects Versions: 1.7
> Reporter: Vladimir Ozerov
> Assignee: Andrey Gura
> Priority: Critical
> Fix For: 2.0
>
>
> Steps to reproduce:
> 1) Start two server nodes and some data to cache.
> 2) Start a client from Docker subnet, which is not visible from the outside.
> Client will join the cluster.
> 3) Try to put something to cache or start another node to force rabalance.
> Cluster is stuck at this moment. Root cause - servers are constantly trying
> to establish outgoing connection to the client, but fail as Docker subnet is
> not visible from the outside. It may stop virtually all cluster operations.
> Typical thread dump:
> {code}
> org.apache.ignite.IgniteCheckedException: Failed to send message (node may
> have left the grid or TCP connection cannot be established due to firewall
> issues) [node=TcpDiscoveryNode [id=a15d74c2-1ec2-4349-9640-aeacd70d8714,
> addrs=[127.0.0.1, 172.17.0.6], sockAddrs=[/127.0.0.1:0, /127.0.0.1:0,
> /172.17.0.6:0], discPort=0, order=7241, intOrder=3707,
> lastExchangeTime=1474096941045, loc=false, ver=1.5.23#20160526-sha1:259146da,
> isClient=true], topic=T4 [topic=TOPIC_CACHE,
> id1=949732fd-1360-3a58-8d9e-0ff6ea6182cc,
> id2=a15d74c2-1ec2-4349-9640-aeacd70d8714, id3=2], msg=GridContinuousMessage
> [type=MSG_EVT_NOTIFICATION, routineId=7e13c48e-6933-48b2-9f15-8d92007930db,
> data=null, futId=null], policy=2]
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1129)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1347)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1227)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1198)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1180)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:841)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:800)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:787)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.access$700(CacheContinuousQueryHandler.java:91)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler$1.onEntryUpdated(CacheContinuousQueryHandler.java:412)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:343)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:250)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3476)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtForceKeysFuture$MiniFuture.onResult(GridDhtForceKeysFuture.java:548)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtForceKeysFuture.onResult(GridDhtForceKeysFuture.java:207)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.processForceKeyResponse(GridDhtPreloader.java:636)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.access$1000(GridDhtPreloader.java:81)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.onMessage(GridDhtPreloader.java:202)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.onMessage(GridDhtPreloader.java:200)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$MessageHandler.apply(GridDhtPreloader.java:877)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$MessageHandler.apply(GridDhtPreloader.java:859)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:582)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:280)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:204)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$000(GridCacheIoManager.java:80)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:163)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1058)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:836)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1700(GridIoManager.java:104)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:799)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_51]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_51]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
> Caused by: org.apache.ignite.spi.IgniteSpiException: Failed to send message
> to remote node: TcpDiscoveryNode [id=a15d74c2-1ec2-4349-9640-aeacd70d8714,
> addrs=[127.0.0.1, 172.17.0.6], sockAddrs=[/127.0.0.1:0, /127.0.0.1:0,
> /172.17.0.6:0], discPort=0, order=7241, intOrder=3707,
> lastExchangeTime=1474096941045, loc=false, ver=1.5.23#20160526-sha1:259146da,
> isClient=true]
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1986)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1926)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1124)
> [ignite-core-1.5.23.jar:1.5.23]
> ... 32 common frames omitted
> Caused by: org.apache.ignite.IgniteCheckedException: Failed to connect to
> node (is node still alive?). Make sure that each GridComputeTask and
> GridCacheTransaction has a timeout set in order to prevent parties from
> waiting forever in case of network issues
> [nodeId=a15d74c2-1ec2-4349-9640-aeacd70d8714, addrs=[/172.17.0.6:47100,
> /127.0.0.1:47100]]
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2489)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2130)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2024)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1960)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1926)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1124)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1347)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1227)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1198)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1180)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:841)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:800)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:787)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler.access$700(CacheContinuousQueryHandler.java:91)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryHandler$1.onEntryUpdated(CacheContinuousQueryHandler.java:412)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:343)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:250)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3476)
> [ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture$MiniFuture.onResult(GridDhtLockFuture.java:1213)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.onResult(GridDhtLockFuture.java:529)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter.processDhtLockResponse(GridDhtTransactionalCacheAdapter.java:639)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter.access$100(GridDhtTransactionalCacheAdapter.java:89)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter$5.apply(GridDhtTransactionalCacheAdapter.java:151)
> ~[ignite-core-1.5.23.jar:1.5.23]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter$5.apply(GridDhtTransactionalCacheAdapter.java:149)
> ~[ignite-core-1.5.23.jar:1.5.23]
> ... 12 common frames omitted
> Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect
> to address: /172.17.0.6:47100
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2494)
> ~[ignite-core-1.5.23.jar:1.5.23]
> ... 35 common frames omitted
> Caused by: java.net.SocketTimeoutException: null
> at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:118)
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2353)
> ... 35 common frames omitted
> Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect
> to address: /127.0.0.1:47100
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2494)
> ~[ignite-core-1.5.23.jar:1.5.23]
> ... 35 common frames omitted
> Caused by: org.apache.ignite.IgniteCheckedException: Remote node ID is
> not as expected [expected=a15d74c2-1ec2-4349-9640-aeacd70d8714,
> rcvd=48cccf25-7c29-4048-bd52-704acdb552e6]
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.safeHandshake(TcpCommunicationSpi.java:2604)
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2361)
> ... 35 common frames omitted
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)