[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800425#comment-17800425 ]
Vipul Thakur commented on IGNITE-21059: --------------------------------------- Hi [~zstan] *PFB for another such scenario from the logs file. This is the same kind of logs.[^ignite-server-nohup.out] , may be if you search for it in this file you will get more context for you to observe.* [2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache] <EventCustomerCache> Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], threadId=567, futId=13db6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, timeout=30000, isInTx=true, isInvalidate=false, isRead=true, isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, super=GridCacheMessage [msgId=55445052, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], err=null, skipPrepare=false]]]]] org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed to acquire lock within provided timeout for transaction [timeout=30000, tx=GridDhtTxLocal[xid=5f4b66f1c81-00000000-12a3-06d7-0000-000000000001, xidVersion=GridCacheVersion [topVer=312674007, order=1701333873909, nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, invalidate=false, rollbackOnly=true, nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=30000, startTime=1701334276938, duration=30003]] at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1798) ~[ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1746) ~[ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridEmbeddedFuture$2.applyx(GridEmbeddedFuture.java:86) ~[ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:292) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:285) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:464) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:348) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:336) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:576) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.GridCacheCompoundIdentityFuture.onDone(GridCacheCompoundIdentityFuture.java:56) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:555) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.onComplete(GridDhtLockFuture.java:807) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.access$900(GridDhtLockFuture.java:93) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture$LockTimeoutObject.onTimeout(GridDhtLockFuture.java:1207) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:234) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) [ignite-core-2.14.0.jar:2.14.0] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_351] 2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache] <EventCustomerCache> Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], threadId=567, futId=13db6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, timeout=30000, isInTx=true, isInvalidate=false, isRead=true, isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, super=GridCacheMessage [msgId=55445052, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], err=null, skipPrepare=false]]]]] 2023-11-30T14:21:48,343][INFO ][db-checkpoint-thread-#762%EVENT_PROCESSING%][Checkpointer] Skipping checkpoint (no pages were modified) [checkpointBeforeLockTime=1ms, checkpointLockWait=0ms, checkpointListenersExecuteTime=1ms, checkpointLockHoldTime=4ms, reason='timeout'] 2023-11-30T14:21:55,776][WARN ][sys-#18981%EVENT_PROCESSING%][diagnostic] First 10 long running transactions [total=15] 2023-11-30T14:21:55,776][WARN ][sys-#18981%EVENT_PROCESSING%][diagnostic] >>> Transaction [startTime=14:08:02,187, curTime=14:21:55,743, tx=GridDhtTxRemote [nearNodeId=46de26cf-a0d2-4a99-b278-734eb44f1330, rmtFutId=6f2623f1c81-c87b49b4-cfa3-4140-b9e7-69e2a57b6830, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333633815, nodeOrder=52, dataCenterId=0], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter [explicitVers=null, started=true, commitAllowed=0, txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry [txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=81, val=9022054169, hasValBytes=true], cacheId=1760547175], val=TxEntryValueHolder [val=com.jio.digitalapi.eventprocessing.persistent.datamodel.ServiceEntity [idHash=23146844, hash=-1542268073, serviceState=TERMINATED, accountId=100962392318, timeLastModifed=2023-11-30T14:08:02.188, customerId=1487385033, entityId=9022054169, timeCreated=2020-05-06T01:11:23.682, serviceId=9022054169, productType=1, serviceCharingIds=919022054169|405864090495748|405864144962514], op=UPDATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=ReaderId[] [], part=81, super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=81, val=9022054169, hasValBytes=true], val=com.jio.digitalapi.eventprocessing.persistent.datamodel.ServiceEntity [idHash=814654353, hash=-1899253163, serviceState=SUSPENDED, accountId=100962392318, timeLastModifed=2023-11-21T13:00:14.527, customerId=1487385033, entityId=9022054169, timeCreated=2020-05-06T01:11:23.682, serviceId=9022054169, productType=1, serviceCharingIds=919022054169|405864090495748|405864144962514], ver=GridCacheVersion [topVer=310345733, order=1701229328778, nodeOrder=4, dataCenterId=0], hash=621622236, extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null, rmts=LinkedList [GridCacheMvccCandidate [nodeId=ca92e2c7-b8ed-42c8-a89b-748815640c78, ver=GridCacheVersion [topVer=312674007, order=1701333639730, nodeOrder=4, dataCenterId=0], threadId=117, id=47917537, topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], reentry=null, otherNodeId=46de26cf-a0d2-4a99-b278-734eb44f1330, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=81, val=9022054169, hasValBytes=true], masks=local=0|owner=0|ready=0|reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false, nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=26980702, serReadVer=null, xidVer=null]], txLbl=null, super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=312674007, order=1701333639730, nodeOrder=4, dataCenterId=0], writeVer=GridCacheVersion [topVer=312674007, order=1701333639731, nodeOrder=4, dataCenterId=0], implicit=false, loc=false, threadId=117, startTime=1701333482187, nodeId=ca92e2c7-b8ed-42c8-a89b-748815640c78, isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=30000, sysInvalidate=false, sys=false, plc=2, commitVer=null, finalizing=NONE, invalidParts=null, state=PREPARED, timedOut=false, topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], mvccSnapshot=null, skipCompletedVers=false, parentTx=null, duration=833556ms, onePhaseCommit=false]]]] > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > -------------------------------------------------------------------------------------------- > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients > Affects Versions: 2.14 > Reporter: Vipul Thakur > Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)