[ 
https://issues.apache.org/jira/browse/IGNITE-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Shishkov updated IGNITE-20514:
-----------------------------------
    Labels:   (was: ise)

> Transaction becomes stuck after GridNearTxFinishRequest was lost
> ----------------------------------------------------------------
>
>                 Key: IGNITE-20514
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20514
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.15
>            Reporter: Ilya Shishkov
>            Priority: Major
>         Attachments: IGNITE-20514_NearFinishRequestDelayTest.patch
>
>
> In case of network failures we can get into situation when 
> {{GridNearTxFinishRequest}} which was sent from transaction coordinator (near 
> node) is lost. 
> For example:
> {code:title=Near node - handshake failed}
> 2023-09-19 11:49:55.504 [WARN ] 
> [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] 
> [tcp-comm-worker-#1%Node%NodeName%-#138%Node%NodeName%] - Handshake timed out 
> (will stop attempts to perform the handshake)
> ...
> addr=/10.10.10.9:47100, failureDetectionTimeoutEnabled=true, timeout=28441]
> {code}
> {code:title=Near node - failed to send message}
> 2023-09-19 11:49:55.539 [ERROR] 
> [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] 
> [sys-stripe-39-#40%Node%NodeName%] - Failed to send message to remote node 
> [node=TcpDiscoveryNode [id=537f0a80-cef0-44df-a082-2fd6652e3eee, 
> consistentId=host.name, addrs=ArrayList [10.10.10.9, 127.0.0.1],
> ...
> msg=GridNearTxFinishRequest
> ...
> ver=GridCacheVersion [topVer=306492927, order=1695095952984, nodeOrder=63, 
> dataCenterId=0]
> ...
> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node 
> still alive?). Make sure that each ComputeTask and cache Transaction has a 
> timeout set in order to prevent parties from waiting forever in case of 
> network issues [nodeId=537f0a80-cef0-44df-a082-2fd6652e3eee, 
> addrs=[/10.10.10.9:47100]]
>       at 
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:565)
>       at 
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:693)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:1181)
>       at 
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:691)
>       at 
> org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.createCommunicationClient(ConnectionClientPool.java:442)
>       at 
> org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.reserveClient(ConnectionClientPool.java:231)
>       at 
> org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.processDisconnect(CommunicationWorker.java:376)
>       at 
> org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.body(CommunicationWorker.java:174)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
>       at 
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$3.body(TcpCommunicationSpi.java:848)
>       at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
> Caused by: org.apache.ignite.spi.IgniteSpiOperationTimeoutException: Failed 
> to perform handshake due to timeout (consider increasing 'connectionTimeout' 
> configuration property).
>       at 
> org.apache.ignite.spi.communication.tcp.internal.CommunicationTcpUtils.handshakeTimeoutException(CommunicationTcpUtils.java:156)
>       at 
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.safeTcpHandshake(GridNioServerWrapper.java:1197)
>       at 
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:485)
>       ... 10 common frames omitted
> {code}
> After such message you will get long running transaction on primary and 
> backups which will not rollback itself. In order to stop transaction *_you 
> have to kill it explicitly_* via {{{}control.sh{}}}.
> {code:title=LRT on primary}
> 2023-09-19 12:23:39.915 [WARN 
> ][sys-#115589][org.apache.ignite.internal.diagnostic] >>> Transaction 
> [startTime=11:49:16,483, curTime=12:23:39,913, tx=GridDhtTxLocal 
> ...
> nearXidVer=GridCacheVersion [topVer=306492927, order=1695095952984, 
> nodeOrder=63, dataCenterId=0]
> ...
> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, 
> timeout=300000
> ...
> state=PREPARED, 
> timedOut=false, 
> ...
> duration=2063430ms
> ...
> {code}
> ----
> *Some points:*
> # Transaction stuck in PREPARED state.
> # Transaction was not rolled back after timeout on finish phase.  
> # LRT goes away in case if near node restarts, because of two-phase commit 
> recovery.
> *Reproducer:*  [^IGNITE-20514_NearFinishRequestDelayTest.patch] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to