[ 
https://issues.apache.org/jira/browse/HDFS-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524353#comment-14524353
 ] 

Andrew Wang commented on HDFS-7678:
-----------------------------------

Nits / comment requests:

* extra imports in DFSStripedInputStream
* I noticed these config keys aren't present in hdfs-default.xml, which is 
where we document the behavior of config keys. I bring it up because I was 
thinking about the semantic difference between "overall timeout" vs. "per-op 
timeout". Since there are a couple EC keys that need documentation, I'm okay 
punting this to a quick follow-on.
* These same keys could also use javadoc on the DFSClientConf getters so we 
don't need to hunt in hdfs-default.xml or look at how the code uses the configs
* The importance of the largest read portion is not explained, nor any large 
comment about the overall flow of the recovery logic. Some ascii art would help 
clarify all this, I drew something like this:

{noformat}
      +--+  +--+  |  +--+  +--+
+--+  |  |  |  |  |  |  |  |  |
|  |  |  |  +--+  |  |  |  |  |
+--+  +--+        |  +--+  +--+
                  |
 d1    d2    d3   |   p1    p2
{noformat}

This way it becomes obvious, if d1 fails, you need to lengthen the read to d3 
in addition to grabbing the parity. The TODO to optimize the read also is 
clarified; we always calculate the max span and always do max spans for 
recovery, even though we might only need the part of d3 that's unfetched.

Related, I could see printing the "plan" like this to be nice for debugging. 
Also a nice illustration for the test cases.

Code:
* AFAICT we don't respect deadNodes in scheduleOneStripedRead, or during the 
initial planning phase. I mentioned this a bit in person, but we also need to 
think about the dead node marking policy for EC reads, since the timeout is 
aggressive.
* I don't understand the overall timeout handling, since if the overall timeout 
has elapsed, we still go through the switch and schedule recovery reads. Seems 
like something we should unit test.
* We also probably still want a per-operation timeout in addition to the 
overall timeout. If the per-op timeout hits, we keep trying to recover, until 
the overall timeout runs out (or we run out of blocks).
* The InterruptedException catch should probably rethrow an 
InterruptedIOException or similar, the idea is to get us out of this function 
ASAP while cleaning up state.

Recommended follow-on work:
* fetchBlockByteRange is an outsize function and could use some splitting. We 
have this function 
* There's no abstraction here for cells or stripes, we're mucking directly with 
internal blocks and byte ranges. IMO all the logic should happen on these 
higher-level abstractions, then we turn it into blocks and byte ranges after. 
Max-span for instance is a hack, we should be calculating recovery reads 
cell-by-cell based on what we already have, then combining the cells for an 
internal block together to form the actual read.
* One thought regarding the threadpool. What we really want is to limit the 
concurrency per-DN, i.e. a threadpool per DN. This works nicely with read 
combining mentioned above, and does one better since it can coalesce recovery 
reads with already queued work. This is like how disk IO queues work.

> Erasure coding: DFSInputStream with decode functionality
> --------------------------------------------------------
>
>                 Key: HDFS-7678
>                 URL: https://issues.apache.org/jira/browse/HDFS-7678
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Li Bo
>            Assignee: Zhe Zhang
>         Attachments: BlockGroupReader.patch, HDFS-7678-HDFS-7285.002.patch, 
> HDFS-7678-HDFS-7285.003.patch, HDFS-7678-HDFS-7285.004.patch, 
> HDFS-7678-HDFS-7285.005.patch, HDFS-7678-HDFS-7285.006.patch, 
> HDFS-7678.000.patch, HDFS-7678.001.patch
>
>
> A block group reader will read data from BlockGroup no matter in striping 
> layout or contiguous layout. The corrupt blocks can be known before 
> reading(told by namenode), or just be found during reading. The block group 
> reader needs to do decoding work when some blocks are found corrupt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to