[
https://issues.apache.org/jira/browse/IGNITE-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ilya Shishkov updated IGNITE-20514:
-----------------------------------
Labels: (was: ise)
> Transaction becomes stuck after GridNearTxFinishRequest was lost
> ----------------------------------------------------------------
>
> Key: IGNITE-20514
> URL: https://issues.apache.org/jira/browse/IGNITE-20514
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 2.15
> Reporter: Ilya Shishkov
> Priority: Major
> Attachments: IGNITE-20514_NearFinishRequestDelayTest.patch
>
>
> In case of network failures we can get into situation when
> {{GridNearTxFinishRequest}} which was sent from transaction coordinator (near
> node) is lost.
> For example:
> {code:title=Near node - handshake failed}
> 2023-09-19 11:49:55.504 [WARN ]
> [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]
> [tcp-comm-worker-#1%Node%NodeName%-#138%Node%NodeName%] - Handshake timed out
> (will stop attempts to perform the handshake)
> ...
> addr=/10.10.10.9:47100, failureDetectionTimeoutEnabled=true, timeout=28441]
> {code}
> {code:title=Near node - failed to send message}
> 2023-09-19 11:49:55.539 [ERROR]
> [org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi]
> [sys-stripe-39-#40%Node%NodeName%] - Failed to send message to remote node
> [node=TcpDiscoveryNode [id=537f0a80-cef0-44df-a082-2fd6652e3eee,
> consistentId=host.name, addrs=ArrayList [10.10.10.9, 127.0.0.1],
> ...
> msg=GridNearTxFinishRequest
> ...
> ver=GridCacheVersion [topVer=306492927, order=1695095952984, nodeOrder=63,
> dataCenterId=0]
> ...
> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node
> still alive?). Make sure that each ComputeTask and cache Transaction has a
> timeout set in order to prevent parties from waiting forever in case of
> network issues [nodeId=537f0a80-cef0-44df-a082-2fd6652e3eee,
> addrs=[/10.10.10.9:47100]]
> at
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:565)
> at
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:693)
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:1181)
> at
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createTcpClient(GridNioServerWrapper.java:691)
> at
> org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.createCommunicationClient(ConnectionClientPool.java:442)
> at
> org.apache.ignite.spi.communication.tcp.internal.ConnectionClientPool.reserveClient(ConnectionClientPool.java:231)
> at
> org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.processDisconnect(CommunicationWorker.java:376)
> at
> org.apache.ignite.spi.communication.tcp.internal.CommunicationWorker.body(CommunicationWorker.java:174)
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
> at
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$3.body(TcpCommunicationSpi.java:848)
> at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
> Caused by: org.apache.ignite.spi.IgniteSpiOperationTimeoutException: Failed
> to perform handshake due to timeout (consider increasing 'connectionTimeout'
> configuration property).
> at
> org.apache.ignite.spi.communication.tcp.internal.CommunicationTcpUtils.handshakeTimeoutException(CommunicationTcpUtils.java:156)
> at
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.safeTcpHandshake(GridNioServerWrapper.java:1197)
> at
> org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:485)
> ... 10 common frames omitted
> {code}
> After such message you will get long running transaction on primary and
> backups which will not rollback itself. In order to stop transaction *_you
> have to kill it explicitly_* via {{{}control.sh{}}}.
> {code:title=LRT on primary}
> 2023-09-19 12:23:39.915 [WARN
> ][sys-#115589][org.apache.ignite.internal.diagnostic] >>> Transaction
> [startTime=11:49:16,483, curTime=12:23:39,913, tx=GridDhtTxLocal
> ...
> nearXidVer=GridCacheVersion [topVer=306492927, order=1695095952984,
> nodeOrder=63, dataCenterId=0]
> ...
> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC,
> timeout=300000
> ...
> state=PREPARED,
> timedOut=false,
> ...
> duration=2063430ms
> ...
> {code}
> ----
> *Some points:*
> # Transaction stuck in PREPARED state.
> # Transaction was not rolled back after timeout on finish phase.
> # LRT goes away in case if near node restarts, because of two-phase commit
> recovery.
> *Reproducer:* [^IGNITE-20514_NearFinishRequestDelayTest.patch]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)