Ilya Shishkov created IGNITE-20514:
--------------------------------------

             Summary: Transaction becomes stuck after GridNearTxFinishRequest 
was lost
                 Key: IGNITE-20514
                 URL: https://issues.apache.org/jira/browse/IGNITE-20514
             Project: Ignite
          Issue Type: Bug
            Reporter: Ilya Shishkov


In case of network failures we can get into situation when 
{{GridNearTxFinishRequest}} which was sent from transaction coordinator (near 
node) is lost. 
For example:
{noformat:title=Near node - handshake failed}
2023-09-19 11:49:55.504 [WARN ] 
[org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] 
[tcp-comm-worker-#1%Node%NodeName%-#138%Node%NodeName%] - Handshake timed out 
(will stop attempts to perform the handshake)
...
addr=/10.10.10.9:47100, failureDetectionTimeoutEnabled=true, timeout=28441]
{noformat}
{noformat:title=Near node - failed to send message}
2023-09-19 11:49:55.539 [ERROR] 
[org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] 
[sys-stripe-39-#40%Node%NodeName%] - Failed to send message to remote node 
[node=TcpDiscoveryNode [id=537f0a80-cef0-44df-a082-2fd6652e3eee, 
consistentId=host.name, addrs=ArrayList [10.10.10.9, 127.0.0.1],
...
msg=GridNearTxFinishRequest
...
ver=GridCacheVersion [topVer=306492927, order=1695095952984, nodeOrder=63, 
dataCenterId=0]
...
org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node 
still alive?). Make sure that each ComputeTask and cache Transaction has a 
timeout set in order to prevent parties from waiting forever in case of network 
issues [nodeId=537f0a80-cef0-44df-a082-2fd6652e3eee, addrs=[/10.10.10.9:47100]]
        at 
org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:565)
        at 
org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:693)
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:1181)
        at 
org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:691)
        at 
org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.createCommunicationClient(ConnectionClientPool.java:442)
        at 
org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.reserveClient(ConnectionClientPool.java:231)
        at 
org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.processDisconnect(CommunicationWorker.java:376)
        at 
org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.body(CommunicationWorker.java:174)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
        at 
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$3.body(TcpCommunicationSpi.java:848)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
Caused by: org.apache.ignite.spi.IgniteSpiOperationTimeoutException: Failed to 
perform handshake due to timeout (consider increasing 'connectionTimeout' 
configuration property).
        at 
org.apache.ignite.spi.communication.tcp.internal.CommunicationTcpUtils.handshakeTimeoutException(CommunicationTcpUtils.java:156)
        at 
org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.safeTcpHandshake(GridNioServerWrapper.java:1197)
        at 
org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:485)
        ... 10 common frames omitted
{noformat}
After such message you will get long running transaction on primary and backups 
which will not rollback itself. In order to stop transaction *_you have to kill 
it explicitly_* via {{{}control.sh{}}}.
{noformat:title=LRT on primary}
2023-09-19 12:23:39.915 [WARN 
][sys-#115589][org.apache.ignite.internal.diagnostic] >>> Transaction 
[startTime=11:49:16,483, curTime=12:23:39,913, tx=GridDhtTxLocal 
...
nearXidVer=GridCacheVersion [topVer=306492927, order=1695095952984, 
nodeOrder=63, dataCenterId=0]
...
isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, 
timeout=300000
...
state=PREPARED, 
timedOut=false, 
...
duration=2063430ms
...
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to