[ 
https://issues.apache.org/jira/browse/IGNITE-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754021#comment-17754021
 ] 

Denis Chudov commented on IGNITE-16700:
---------------------------------------

This test creates 2 * CPU_COUNT threads and each thread repeats transactions 
transferring money from one account to another, using the number of accounts 
similar to threads' number. In fact, it’s load test from some point of view as 
it discovers performance problems. The reason of test failures are replication 
timeout exception but the reasons of exceptions are different.
 * upsert operations timeouts: the reason of these timeouts is long waiting of 
lock acquisition because of high contention, and lock release after cleanup, so 
that there can be a queue of waiters to acquire lock for each key, and each of 
them wait for tx cleanup.

 * any command timeouts: seems that there are problems with rocksdb log 
storage, and storage flush in RocksDbSharedLogStorage#commitWriteBatch: having 
batch size of several hundred of bytes, the db put operation can last over a 
second. I see many such records in log while logging time for flushing that 
took over 100 ms.

If I turn off fsync for Raft log, and increase number of accounts by 10 times, 
it drastically reduces the fail rate of the test (no failures after 600 runs, 
comparing with 1 per ~25 without fixes). The problem with Raft storage needs 
separate ticket.

> ItTxDistributedTestThreeNodesThreeReplicas#testBalance is flaky
> ---------------------------------------------------------------
>
>                 Key: IGNITE-16700
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16700
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mirza Aliev
>            Assignee: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: _Integration_Tests_Module_Table_2055.log, 
> _Integration_Tests_Module_Table_2098.log
>
>
> {{ItTxDistributedTestThreeNodesThreeReplicas#testBalance}} periodically falls 
> with 
> {noformat}
> org.apache.ignite.lang.IgniteException
> org.apache.ignite.lang.IgniteException: java.util.concurrent.TimeoutException 
> ==> expected: <true> but was: <false>
> {noformat}
> We've noticed that the test become flaky after IGNITE-16393 has been merged. 
> Probably, the current problem is related to the problem with stopping 
> executors for network's user object serialization threads IGNITE-16699 as far 
> as the logs are full of warnings from IGNITE-16699.
> The plan for this ticket is to wait for IGNITE-16699 to be fixed and check 
> whether this issue is still reproducible. 
> https://ci.ignite.apache.org/buildConfiguration/ignite3_Test_IntegrationTests_ModuleTable/6466138
> UPD: Ticket IGNITE-16699 has been fixed and but the current ticket is still 
> reproducible, so the problem is not related to IGNITE-16699.
> In logs, we can see some suspicious message, need to investigate if this is 
> related to the problem. Actual run 
> https://ci.ignite.apache.org/buildConfiguration/ignite3_Test_RunAllTests/6470268,
>  actual logs are attached
> {noformat}
> 2022-03-18 10:29:33:399 +0300 
> [INFO][%ItTxDistributedTestSingleNode_null_20000%JRaft-FSMCaller-Disruptor-_stripe_35-0][ActionRequestProcessor]
>  Error occurred on a user's state machine
> class org.apache.ignite.tx.TransactionException: Failed to enlist a key into 
> a transaction, state=ABORTED
>   at 
> org.apache.ignite.internal.table.distributed.raft.PartitionListener.tryEnlistIntoTransaction(PartitionListener.java:196)
>   at 
> org.apache.ignite.internal.table.distributed.raft.PartitionListener.lambda$onWrite$1(PartitionListener.java:134)
>   at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
>   at 
> org.apache.ignite.internal.table.distributed.raft.PartitionListener.onWrite(PartitionListener.java:131)
>   at 
> org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine.onApply(JraftServerImpl.java:415)
>   at 
> org.apache.ignite.raft.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:539)
>   at 
> org.apache.ignite.raft.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:507)
>   at 
> org.apache.ignite.raft.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:437)
>   at 
> org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:134)
>   at 
> org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:128)
>   at 
> org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:215)
>   at 
> org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:179)
>   at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to