[jira] [Commented] (HDDS-3379) Clients unable to failover after the OzoneManager leader is restart in MiniOzoneChaosCluster

Isa Hekmatizadeh (Jira) Mon, 01 Jun 2020 09:08:23 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121130#comment-17121130
 ]


Isa Hekmatizadeh commented on HDDS-3379:
----------------------------------------

This is my understanding of the problem, please correct me if I'm wrong,
 * "write" method in KeyOutputStream (line 187) calls "handleWrite" method with 
parameter retry=false, so it does not expect the handleWrite retry on failure 
(is it Ok? shall we pass  retry field always false in this method?)
 * handleWrite method in KeyOutputStream (line 204) has two points to call the 
cluster, first:

{code:java}
BlockOutputStreamEntry current = 
blockOutputStreamEntryPool.allocateBlockIfNeeded();{code}
{color:#172b4d} which calls the SCM to allocate new block if needed and {color}
{code:java}
int writtenLength =
 writeToOutputStream(current, retry, len, b, expectedWriteLen,
 off, currentPos);{code}
which writes the byte array into the buffer and pass them if needed into 
DataNode,

The second point handle retry internally in writeToOutputStream method, but the 
first one does not retry if it face any exception while allocating new block.

There are several questions upon this issue:
 * Is write method in KeyOutputStream suppose to retry or it should pass any 
exception to its client to handle?
 * Is the allocateBlock in handleWrite method within the scope of retry? in 
another word, shall handleWrite method retry allocate block when its retry 
parameter is true?

I would be glad to work on this issue if someone answers these questions and 
make things more clear to me.

> Clients unable to failover after the OzoneManager leader is restart in 
> MiniOzoneChaosCluster
> --------------------------------------------------------------------------------------------
>
>                 Key: HDDS-3379
>                 URL: https://issues.apache.org/jira/browse/HDDS-3379
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Manager
>            Reporter: Mukul Kumar Singh
>            Priority: Major
>              Labels: MiniOzoneChaosCluster
>
> Clients unable to failover after the OzoneManager leader is restart in 
> MiniOzoneChaosCluster.
> This happens after the following restart events.
> {code}
> ➜  chaos-2020-04-11-21-51-52-IST egrep "iniOzoneHAClusterImp|Failures" 
> complete.log
> 2020-04-11 21:52:08,296 
> [org.apache.hadoop.ozone.TestMiniChaosOzoneCluster.main()] INFO  
> ozone.MiniOzoneHAClusterImpl 
> (MiniOzoneHAClusterImpl.java:createOMService(373)) - Started OzoneManager RPC 
> server at localhost/127.0.0.1:10804
> 2020-04-11 21:52:08,387 
> [org.apache.hadoop.ozone.TestMiniChaosOzoneCluster.main()] INFO  
> ozone.MiniOzoneHAClusterImpl 
> (MiniOzoneHAClusterImpl.java:createOMService(373)) - Started OzoneManager RPC 
> server at localhost/127.0.0.1:10810
> 2020-04-11 21:52:08,485 
> [org.apache.hadoop.ozone.TestMiniChaosOzoneCluster.main()] INFO  
> ozone.MiniOzoneHAClusterImpl 
> (MiniOzoneHAClusterImpl.java:createOMService(373)) - Started OzoneManager RPC 
> server at localhost/127.0.0.1:10816
> 2020-04-11 21:52:22,845 
> [org.apache.hadoop.ozone.TestMiniChaosOzoneCluster.main()] INFO  
> failure.Failures (FailureManager.java:start(66)) - starting failure manager 
> 60 60 SECONDS
> 2020-04-11 21:53:22,850 [pool-59-thread-1] INFO  failure.Failures 
> (FailureManager.java:fail(56)) - time failure with OzoneManagerRestartFailure
> 2020-04-11 21:53:22,853 [pool-59-thread-1] INFO  ozone.MiniOzoneHAClusterImpl 
> (MiniOzoneHAClusterImpl.java:shutdownOzoneManager(211)) - Shutting down 
> OzoneManager omNode-3
> 2020-04-11 21:53:22,988 [pool-59-thread-1] INFO  ozone.MiniOzoneHAClusterImpl 
> (MiniOzoneHAClusterImpl.java:restartOzoneManager(228)) - Restarting 
> OzoneManager omNode-3
>       at 
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.restartOzoneManager(MiniOzoneHAClusterImpl.java:229)
>       at 
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.restartOzoneManager(MiniOzoneHAClusterImpl.java:223)
>       at 
> org.apache.hadoop.ozone.failure.Failures$OzoneManagerRestartFailure.lambda$fail$0(Failures.java:101)
>       at 
> org.apache.hadoop.ozone.failure.Failures$OzoneManagerRestartFailure.fail(Failures.java:98)
> 2020-04-11 21:54:22,849 [pool-59-thread-1] INFO  failure.Failures 
> (FailureManager.java:fail(56)) - time failure with OzoneManagerRestartFailure
> 2020-04-11 21:54:22,850 [pool-59-thread-1] INFO  ozone.MiniOzoneHAClusterImpl 
> (MiniOzoneHAClusterImpl.java:shutdownOzoneManager(211)) - Shutting down 
> OzoneManager omNode-1
> 2020-04-11 21:54:22,895 [pool-59-thread-1] INFO  ozone.MiniOzoneHAClusterImpl 
> (MiniOzoneHAClusterImpl.java:restartOzoneManager(228)) - Restarting 
> OzoneManager omNode-1
>       at 
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.restartOzoneManager(MiniOzoneHAClusterImpl.java:229)
>       at 
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.restartOzoneManager(MiniOzoneHAClusterImpl.java:223)
>       at 
> org.apache.hadoop.ozone.failure.Failures$OzoneManagerRestartFailure.lambda$fail$0(Failures.java:101)
>       at 
> org.apache.hadoop.ozone.failure.Failures$OzoneManagerRestartFailure.fail(Failures.java:98)
> ➜  chaos-2020-04-11-21-51-52-IST
> {code}
> This results in the following exception.
> {code}
> 2020-04-11 21:54:24,201 [pool-360-thread-4] ERROR 
> loadgenerators.LoadExecutors (LoadExecutors.java:load(67)) - 
> FilesystemLoadGenerator LOADGEN: Exiting due to exception
> java.io.IOException: java.io.IOException: Could not determine or connect to 
> OM Leader.
>         at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:229)
>         at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:199)
>         at 
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.write(OzoneFSOutputStream.java:46)
>         at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
>         at java.io.DataOutputStream.write(DataOutputStream.java:107)
>         at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>         at 
> org.apache.hadoop.ozone.utils.LoadBucket$WriteOp.doPostOp(LoadBucket.java:176)
>         at 
> org.apache.hadoop.ozone.utils.LoadBucket$Op.execute(LoadBucket.java:132)
>         at 
> org.apache.hadoop.ozone.utils.LoadBucket$WriteOp.execute(LoadBucket.java:153)
>         at 
> org.apache.hadoop.ozone.utils.LoadBucket.writeKey(LoadBucket.java:76)
>         at 
> org.apache.hadoop.ozone.loadgenerators.FilesystemLoadGenerator.generateLoad(FilesystemLoadGenerator.java:47)
>         at 
> org.apache.hadoop.ozone.loadgenerators.LoadExecutors.load(LoadExecutors.java:65)
>         at 
> org.apache.hadoop.ozone.loadgenerators.LoadExecutors.lambda$startLoad$0(LoadExecutors.java:89)
>         at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Could not determine or connect to OM Leader.
>         at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:429)
>         at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:843)
>         at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:71)
>         at com.sun.proxy.$Proxy65.allocateBlock(Unknown Source)
>         at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:281)
>         at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:327)
>         at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:208)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-3379) Clients unable to failover after the OzoneManager leader is restart in MiniOzoneChaosCluster

Reply via email to