[
https://issues.apache.org/jira/browse/HDDS-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121130#comment-17121130
]
Isa Hekmatizadeh commented on HDDS-3379:
----------------------------------------
This is my understanding of the problem, please correct me if I'm wrong,
* "write" method in KeyOutputStream (line 187) calls "handleWrite" method with
parameter retry=false, so it does not expect the handleWrite retry on failure
(is it Ok? shall we pass retry field always false in this method?)
* handleWrite method in KeyOutputStream (line 204) has two points to call the
cluster, first:
{code:java}
BlockOutputStreamEntry current =
blockOutputStreamEntryPool.allocateBlockIfNeeded();{code}
{color:#172b4d} which calls the SCM to allocate new block if needed and {color}
{code:java}
int writtenLength =
writeToOutputStream(current, retry, len, b, expectedWriteLen,
off, currentPos);{code}
which writes the byte array into the buffer and pass them if needed into
DataNode,
The second point handle retry internally in writeToOutputStream method, but the
first one does not retry if it face any exception while allocating new block.
There are several questions upon this issue:
* Is write method in KeyOutputStream suppose to retry or it should pass any
exception to its client to handle?
* Is the allocateBlock in handleWrite method within the scope of retry? in
another word, shall handleWrite method retry allocate block when its retry
parameter is true?
I would be glad to work on this issue if someone answers these questions and
make things more clear to me.
> Clients unable to failover after the OzoneManager leader is restart in
> MiniOzoneChaosCluster
> --------------------------------------------------------------------------------------------
>
> Key: HDDS-3379
> URL: https://issues.apache.org/jira/browse/HDDS-3379
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Manager
> Reporter: Mukul Kumar Singh
> Priority: Major
> Labels: MiniOzoneChaosCluster
>
> Clients unable to failover after the OzoneManager leader is restart in
> MiniOzoneChaosCluster.
> This happens after the following restart events.
> {code}
> ➜ chaos-2020-04-11-21-51-52-IST egrep "iniOzoneHAClusterImp|Failures"
> complete.log
> 2020-04-11 21:52:08,296
> [org.apache.hadoop.ozone.TestMiniChaosOzoneCluster.main()] INFO
> ozone.MiniOzoneHAClusterImpl
> (MiniOzoneHAClusterImpl.java:createOMService(373)) - Started OzoneManager RPC
> server at localhost/127.0.0.1:10804
> 2020-04-11 21:52:08,387
> [org.apache.hadoop.ozone.TestMiniChaosOzoneCluster.main()] INFO
> ozone.MiniOzoneHAClusterImpl
> (MiniOzoneHAClusterImpl.java:createOMService(373)) - Started OzoneManager RPC
> server at localhost/127.0.0.1:10810
> 2020-04-11 21:52:08,485
> [org.apache.hadoop.ozone.TestMiniChaosOzoneCluster.main()] INFO
> ozone.MiniOzoneHAClusterImpl
> (MiniOzoneHAClusterImpl.java:createOMService(373)) - Started OzoneManager RPC
> server at localhost/127.0.0.1:10816
> 2020-04-11 21:52:22,845
> [org.apache.hadoop.ozone.TestMiniChaosOzoneCluster.main()] INFO
> failure.Failures (FailureManager.java:start(66)) - starting failure manager
> 60 60 SECONDS
> 2020-04-11 21:53:22,850 [pool-59-thread-1] INFO failure.Failures
> (FailureManager.java:fail(56)) - time failure with OzoneManagerRestartFailure
> 2020-04-11 21:53:22,853 [pool-59-thread-1] INFO ozone.MiniOzoneHAClusterImpl
> (MiniOzoneHAClusterImpl.java:shutdownOzoneManager(211)) - Shutting down
> OzoneManager omNode-3
> 2020-04-11 21:53:22,988 [pool-59-thread-1] INFO ozone.MiniOzoneHAClusterImpl
> (MiniOzoneHAClusterImpl.java:restartOzoneManager(228)) - Restarting
> OzoneManager omNode-3
> at
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.restartOzoneManager(MiniOzoneHAClusterImpl.java:229)
> at
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.restartOzoneManager(MiniOzoneHAClusterImpl.java:223)
> at
> org.apache.hadoop.ozone.failure.Failures$OzoneManagerRestartFailure.lambda$fail$0(Failures.java:101)
> at
> org.apache.hadoop.ozone.failure.Failures$OzoneManagerRestartFailure.fail(Failures.java:98)
> 2020-04-11 21:54:22,849 [pool-59-thread-1] INFO failure.Failures
> (FailureManager.java:fail(56)) - time failure with OzoneManagerRestartFailure
> 2020-04-11 21:54:22,850 [pool-59-thread-1] INFO ozone.MiniOzoneHAClusterImpl
> (MiniOzoneHAClusterImpl.java:shutdownOzoneManager(211)) - Shutting down
> OzoneManager omNode-1
> 2020-04-11 21:54:22,895 [pool-59-thread-1] INFO ozone.MiniOzoneHAClusterImpl
> (MiniOzoneHAClusterImpl.java:restartOzoneManager(228)) - Restarting
> OzoneManager omNode-1
> at
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.restartOzoneManager(MiniOzoneHAClusterImpl.java:229)
> at
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.restartOzoneManager(MiniOzoneHAClusterImpl.java:223)
> at
> org.apache.hadoop.ozone.failure.Failures$OzoneManagerRestartFailure.lambda$fail$0(Failures.java:101)
> at
> org.apache.hadoop.ozone.failure.Failures$OzoneManagerRestartFailure.fail(Failures.java:98)
> ➜ chaos-2020-04-11-21-51-52-IST
> {code}
> This results in the following exception.
> {code}
> 2020-04-11 21:54:24,201 [pool-360-thread-4] ERROR
> loadgenerators.LoadExecutors (LoadExecutors.java:load(67)) -
> FilesystemLoadGenerator LOADGEN: Exiting due to exception
> java.io.IOException: java.io.IOException: Could not determine or connect to
> OM Leader.
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:229)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:199)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:46)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
> at
> org.apache.hadoop.ozone.utils.LoadBucket$WriteOp.doPostOp(LoadBucket.java:176)
> at
> org.apache.hadoop.ozone.utils.LoadBucket$Op.execute(LoadBucket.java:132)
> at
> org.apache.hadoop.ozone.utils.LoadBucket$WriteOp.execute(LoadBucket.java:153)
> at
> org.apache.hadoop.ozone.utils.LoadBucket.writeKey(LoadBucket.java:76)
> at
> org.apache.hadoop.ozone.loadgenerators.FilesystemLoadGenerator.generateLoad(FilesystemLoadGenerator.java:47)
> at
> org.apache.hadoop.ozone.loadgenerators.LoadExecutors.load(LoadExecutors.java:65)
> at
> org.apache.hadoop.ozone.loadgenerators.LoadExecutors.lambda$startLoad$0(LoadExecutors.java:89)
> at
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Could not determine or connect to OM Leader.
> at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:429)
> at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:843)
> at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:71)
> at com.sun.proxy.$Proxy65.allocateBlock(Unknown Source)
> at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:281)
> at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:327)
> at
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:208)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]