[
https://issues.apache.org/jira/browse/HDDS-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775623#comment-17775623
]
Raju Balpande edited comment on HDDS-6115 at 10/16/23 9:28 AM:
---------------------------------------------------------------
After I added the logs I observed a pattern in failure cases that a method
RaftLogIndex.updateIncreasingly() which is part of ratis library is not been
called or can say missed or can say been delayed to call.
I tried to debug the flow and observed it been called by a separate thread
which is triggered but doesn't have surety to complete before our code gets
finished.
The call log I see is as follows:
{code:java}
updateIncreasingly:68, RaftLogIndex (org.apache.ratis.server.raftlog)
updateFlushedIndexIncreasingly:429, SegmentedRaftLogWorker
(org.apache.ratis.server.raftlog.segmented)
updateFlushedIndexIncreasingly:424, SegmentedRaftLogWorker
(org.apache.ratis.server.raftlog.segmented)
flushIfNecessary:391, SegmentedRaftLogWorker
(org.apache.ratis.server.raftlog.segmented)
access$1200:61, SegmentedRaftLogWorker
(org.apache.ratis.server.raftlog.segmented)
execute:566, SegmentedRaftLogWorker$WriteLog
(org.apache.ratis.server.raftlog.segmented)
run:316, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
run:-1, 0x0000000000000000
(org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$$Lambda$780)
call:511, Executors$RunnableAdapter (java.util.concurrent)
run$$$capture:266, FutureTask (java.util.concurrent)
run:-1, FutureTask (java.util.concurrent)
- Async stack trace
<init>:151, FutureTask (java.util.concurrent)
newTaskFor:87, AbstractExecutorService (java.util.concurrent)
submit:111, AbstractExecutorService (java.util.concurrent)
submit:678, Executors$DelegatedExecutorService (java.util.concurrent)
start:242, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
openImpl:216, SegmentedRaftLog (org.apache.ratis.server.raftlog.segmented)
open:251, RaftLogBase (org.apache.ratis.server.raftlog)
initRaftLog:239, ServerState (org.apache.ratis.server.impl)
initRaftLog:220, ServerState (org.apache.ratis.server.impl)
lambda$new$5:161, ServerState (org.apache.ratis.server.impl)
get:-1, ServerState$$Lambda$678/0x0000000000000000
(org.apache.ratis.server.impl)
get:62, MemoizedSupplier (org.apache.ratis.util)
initialize:177, ServerState (org.apache.ratis.server.impl)
start:338, RaftServerImpl (org.apache.ratis.server.impl)
accept:-1, RaftServerProxy$$Lambda$723/0x0000000000000000
(org.apache.ratis.server.impl)
accept:188, ConcurrentUtils (org.apache.ratis.util)
lambda$null$4:180, ConcurrentUtils (org.apache.ratis.util)
run:-1, ConcurrentUtils$$Lambda$645/0x0000000000000000 (org.apache.ratis.util)
runWorker:1149, ThreadPoolExecutor (java.util.concurrent)
run:624, ThreadPoolExecutor$Worker (java.util.concurrent)
run:823, Thread (java.lang) {code}
To prove my assumption I introduced some wait in main thread and as I see it
worked fine for 10splits*4iterations in
[https://github.com/raju-balpande/apache_ozone/actions/runs/6511303376]
was (Author: JIRAUSER296391):
After I added the logs I observed a pattern in failure cases that a method
RaftLogIndex.updateIncreasingly() which is part of ratis library is not been
called or can say missed or can say been delayed to call.
I tried to debug the flow and observed it been called by a separate thread
which is triggered but doesn't have surety to complete before our code gets
finished.
The call log I see is as follows:
{code:java}
call:511, Executors$RunnableAdapter (java.util.concurrent)
run$$$capture:266, FutureTask (java.util.concurrent)
run:-1, FutureTask (java.util.concurrent)
- Async stack trace
<init>:151, FutureTask (java.util.concurrent)
newTaskFor:87, AbstractExecutorService (java.util.concurrent)
submit:111, AbstractExecutorService (java.util.concurrent)
submit:678, Executors$DelegatedExecutorService (java.util.concurrent)
start:242, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
openImpl:216, SegmentedRaftLog (org.apache.ratis.server.raftlog.segmented)
open:251, RaftLogBase (org.apache.ratis.server.raftlog)
initRaftLog:239, ServerState (org.apache.ratis.server.impl)
initRaftLog:220, ServerState (org.apache.ratis.server.impl)
lambda$new$5:161, ServerState (org.apache.ratis.server.impl)
get:-1, ServerState$$Lambda$678/0x0000000000000000
(org.apache.ratis.server.impl)
get:62, MemoizedSupplier (org.apache.ratis.util)
initialize:177, ServerState (org.apache.ratis.server.impl)
start:338, RaftServerImpl (org.apache.ratis.server.impl)
accept:-1, RaftServerProxy$$Lambda$723/0x0000000000000000
(org.apache.ratis.server.impl)
accept:188, ConcurrentUtils (org.apache.ratis.util)
lambda$null$4:180, ConcurrentUtils (org.apache.ratis.util)
run:-1, ConcurrentUtils$$Lambda$645/0x0000000000000000 (org.apache.ratis.util)
runWorker:1149, ThreadPoolExecutor (java.util.concurrent)
run:624, ThreadPoolExecutor$Worker (java.util.concurrent)
run:823, Thread (java.lang) {code}
To prove my assumption I introduced some wait in main thread and as I see it
worked fine for 10splits*4iterations in
[https://github.com/raju-balpande/apache_ozone/actions/runs/6511303376]
> Intermittent failure in
> TestContainerStateMachineFailures#testApplyTransactionIdempotencyWithClosedContainer
> ------------------------------------------------------------------------------------------------------------
>
> Key: HDDS-6115
> URL: https://issues.apache.org/jira/browse/HDDS-6115
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: test
> Reporter: Attila Doroszlai
> Assignee: Raju Balpande
> Priority: Major
>
> {code:title=https://github.com/elek/ozone-build-results/blob/master/2021/12/08/11979/it-client/hadoop-ozone/integration-test/org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures.txt}
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 138.65 s <<<
> FAILURE! - in
> org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures
> testApplyTransactionIdempotencyWithClosedContainer Time elapsed: 1.13 s <<<
> FAILURE!
> java.lang.AssertionError
> at org.junit.Assert.fail(Assert.java:87)
> at org.junit.Assert.assertTrue(Assert.java:42)
> at org.junit.Assert.assertFalse(Assert.java:65)
> at org.junit.Assert.assertFalse(Assert.java:75)
> at
> org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures.testApplyTransactionIdempotencyWithClosedContainer(TestContainerStateMachineFailures.java:465)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]