[ 
https://issues.apache.org/jira/browse/HDFS-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531592#comment-14531592
 ] 

Zhe Zhang commented on HDFS-7678:
---------------------------------

Thanks Jing for the in-depth comments! The discussion is very helpful at this 
stage and I'm glad we are uncovering some of the fundamental trade-offs.

bq. The recovery + MaxPortion logic may have some issue since pread requires us 
to provide precise reading range to the BlockReader.
This is a very good catch. I added a simple fix is to sanity check the read 
range against the internal block before sending the request, and will include 
in the next rev.

bq. Currently I'm thinking it may be easier to use cell and stripe as the basic 
reading and recovery unit to avoid complexity.
As we discussed under HDFS-8281 this indeed simplifies implementation. But my 
concern is that pread should be even more random than stateful reads, and 
therefore the overhead of always reading an entire stripe will be more 
significant. Under HDFS-7782 we specifically made optimizations to avoid an 
additional memory copy from intermediate buffer to application buffer. We also 
changed the signature of {{DFSInputStream#actualGetFromOneDataNode}} for to 
avoid creating block reader multiple times. 

So my proposal is that we can use _stripe_ as a read unit. But a _stripe_ is 
not necessarily {{dataBlkNum}} cells. Instead, it may cover multiple cells on 
each internal block, as long as it has the same span on all internal blocks. 
For example (if we refer to the header of {{DFSStripedInputStream}}), a 
_stripe_ can be cell_3 ~ cell_8. And we don't always have to allocation an 
intermediate buffer. If the application read range happens to form a stripe 
(which should be fairly common), we directly read into it. I believe this 
doesn't add too much complexity to using the current {{StripeRange}} as a 
reading unit, but please let me know your opinion. For example, if we apply 
this idea to the current patch, we would first calculate {{maxPortion}} as full 
cells. We then always read {{maxPortion}} even before detecting any failures. 
Finally, we use a buffer if the pread range is smaller than the {{maxPortion}} 
stripe.

bq. In the following code, suppose we already have one failure and one success 
reading tasks before, xxx it will still be better to avoid the unnecessary 
operations here.
The first {{if}} and first {{else if}} are used to get out of the loop quickly 
when we are surely successful or failed. I can change it to 3 independent if 
statements.

bq. But I strongly suggest to have a separate jira to test all the internal 
logic inside of DFSStripedInputStream and fix possible bugs.
This is a great point. A related minor issue is our current 
{{TestDFSStripedInputStream}} is actually the end-2-end test, and should be 
named {{TestWriteReadStripedFile}}, and should eventually subclass 
{{TestWriteRead}}. Current {{TestReadStripedFile}} is actually the internal 
unit testing class for {{DFSStripedInputStream}}. I filed HDFS-8334 to address 
this. Another related issue is that we never tested the parity data writing 
logic in {{DFSStripedOutputStream}}. 
{{TestDFSStripedOutputStream#verifyParity}} doesn't actually fetch the stored 
parity blocks.

I agree with the other issues and will address in the next rev.

> Erasure coding: DFSInputStream with decode functionality (pread)
> ----------------------------------------------------------------
>
>                 Key: HDFS-7678
>                 URL: https://issues.apache.org/jira/browse/HDFS-7678
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Li Bo
>            Assignee: Zhe Zhang
>         Attachments: BlockGroupReader.patch, HDFS-7678-HDFS-7285.002.patch, 
> HDFS-7678-HDFS-7285.003.patch, HDFS-7678-HDFS-7285.004.patch, 
> HDFS-7678-HDFS-7285.005.patch, HDFS-7678-HDFS-7285.006.patch, 
> HDFS-7678-HDFS-7285.007.patch, HDFS-7678-HDFS-7285.008.patch, 
> HDFS-7678-HDFS-7285.009.patch, HDFS-7678-HDFS-7285.010.patch, 
> HDFS-7678.000.patch, HDFS-7678.001.patch
>
>
> A block group reader will read data from BlockGroup no matter in striping 
> layout or contiguous layout. The corrupt blocks can be known before 
> reading(told by namenode), or just be found during reading. The block group 
> reader needs to do decoding work when some blocks are found corrupt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to