[ 
https://issues.apache.org/jira/browse/HDFS-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151980#comment-16151980
 ] 

Kai Zheng commented on HDFS-9079:
---------------------------------

Thanks [~zhz] for the ping.

Looks like the latest patch broke now and the 1st step would be to rebase your 
work so far.

Did you notice any failure cases that relate to this? Or is there a way to 
repeat the issue we're trying to fix up here?

I noticed quite some failures recently in trunk that look like below, not sure 
if they're related to this. Could you help check? Thanks!

{noformat}
Regression

org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure060.testBlockTokenExpired

Failing for the past 1 build (Since Failed#20973 )
Took 3.2 sec.
Error Message

expected:<1001> but was:<1002>
Stacktrace

java.lang.AssertionError: expected:<1001> but was:<1002>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:743)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at org.junit.Assert.assertEquals(Assert.java:555)
        at org.junit.Assert.assertEquals(Assert.java:542)
        at 
org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure.runTest(TestDFSStripedOutputStreamWithFailure.java:517)
        at 
org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure.testBlockTokenExpired(TestDFSStripedOutputStreamWithFailure.java:273)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{noformat}

> Erasure coding: preallocate multiple generation stamps and serialize updates 
> from data streamers
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9079
>                 URL: https://issues.apache.org/jira/browse/HDFS-9079
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: erasure-coding
>    Affects Versions: HDFS-7285
>            Reporter: Zhe Zhang
>              Labels: hdfs-ec-3.0-nice-to-have
>         Attachments: HDFS-9079.01.patch, HDFS-9079.02.patch, 
> HDFS-9079.03.patch, HDFS-9079.04.patch, HDFS-9079.05.patch, 
> HDFS-9079.06.patch, HDFS-9079.07.patch, HDFS-9079.08.patch, 
> HDFS-9079.09.patch, HDFS-9079.10.patch, HDFS-9079.11.patch, 
> HDFS-9079.12.patch, HDFS-9079.13.patch, HDFS-9079.14.patch, 
> HDFS-9079.15.patch, HDFS-9079-HDFS-7285.00.patch
>
>
> A non-striped DataStreamer goes through the following steps in error handling:
> {code}
> 1) Finds error => 2) Asks NN for new GS => 3) Gets new GS from NN => 4) 
> Applies new GS to DN (createBlockOutputStream) => 5) Ack from DN => 6) 
> Updates block on NN
> {code}
> With multiple streamer threads run in parallel, we need to correctly handle a 
> large number of possible combinations of interleaved thread events. For 
> example, {{streamer_B}} starts step 2 in between events {{streamer_A.2}} and 
> {{streamer_A.3}}.
> HDFS-9040 moves steps 1, 2, 3, 6 from streamer to {{DFSStripedOutputStream}}. 
> This JIRA proposes some further optimizations based on HDFS-9040:
> # We can preallocate GS when NN creates a new striped block group 
> ({{FSN#createNewBlock}}). For each new striped block group we can reserve 
> {{NUM_PARITY_BLOCKS}} GS's. If more than {{NUM_PARITY_BLOCKS}} errors have 
> happened we shouldn't try to further recover anyway.
> # We can use a dedicated event processor to offload the error handling logic 
> from {{DFSStripedOutputStream}}, which is not a long running daemon.
> # We can limit the lifespan of a streamer to be a single block. A streamer 
> ends either after finishing the current block or when encountering a DN 
> failure.
> With the proposed change, a {{StripedDataStreamer}}'s flow becomes:
> {code}
> 1) Finds DN error => 2) Notify coordinator (async, not waiting for response) 
> => terminates
> 1) Finds external error => 2) Applies new GS to DN (createBlockOutputStream) 
> => 3) Ack from DN => 4) Notify coordinator (async, not waiting for response)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to