[
https://issues.apache.org/jira/browse/HDFS-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531592#comment-14531592
]
Zhe Zhang commented on HDFS-7678:
---------------------------------
Thanks Jing for the in-depth comments! The discussion is very helpful at this
stage and I'm glad we are uncovering some of the fundamental trade-offs.
bq. The recovery + MaxPortion logic may have some issue since pread requires us
to provide precise reading range to the BlockReader.
This is a very good catch. I added a simple fix is to sanity check the read
range against the internal block before sending the request, and will include
in the next rev.
bq. Currently I'm thinking it may be easier to use cell and stripe as the basic
reading and recovery unit to avoid complexity.
As we discussed under HDFS-8281 this indeed simplifies implementation. But my
concern is that pread should be even more random than stateful reads, and
therefore the overhead of always reading an entire stripe will be more
significant. Under HDFS-7782 we specifically made optimizations to avoid an
additional memory copy from intermediate buffer to application buffer. We also
changed the signature of {{DFSInputStream#actualGetFromOneDataNode}} for to
avoid creating block reader multiple times.
So my proposal is that we can use _stripe_ as a read unit. But a _stripe_ is
not necessarily {{dataBlkNum}} cells. Instead, it may cover multiple cells on
each internal block, as long as it has the same span on all internal blocks.
For example (if we refer to the header of {{DFSStripedInputStream}}), a
_stripe_ can be cell_3 ~ cell_8. And we don't always have to allocation an
intermediate buffer. If the application read range happens to form a stripe
(which should be fairly common), we directly read into it. I believe this
doesn't add too much complexity to using the current {{StripeRange}} as a
reading unit, but please let me know your opinion. For example, if we apply
this idea to the current patch, we would first calculate {{maxPortion}} as full
cells. We then always read {{maxPortion}} even before detecting any failures.
Finally, we use a buffer if the pread range is smaller than the {{maxPortion}}
stripe.
bq. In the following code, suppose we already have one failure and one success
reading tasks before, xxx it will still be better to avoid the unnecessary
operations here.
The first {{if}} and first {{else if}} are used to get out of the loop quickly
when we are surely successful or failed. I can change it to 3 independent if
statements.
bq. But I strongly suggest to have a separate jira to test all the internal
logic inside of DFSStripedInputStream and fix possible bugs.
This is a great point. A related minor issue is our current
{{TestDFSStripedInputStream}} is actually the end-2-end test, and should be
named {{TestWriteReadStripedFile}}, and should eventually subclass
{{TestWriteRead}}. Current {{TestReadStripedFile}} is actually the internal
unit testing class for {{DFSStripedInputStream}}. I filed HDFS-8334 to address
this. Another related issue is that we never tested the parity data writing
logic in {{DFSStripedOutputStream}}.
{{TestDFSStripedOutputStream#verifyParity}} doesn't actually fetch the stored
parity blocks.
I agree with the other issues and will address in the next rev.
> Erasure coding: DFSInputStream with decode functionality (pread)
> ----------------------------------------------------------------
>
> Key: HDFS-7678
> URL: https://issues.apache.org/jira/browse/HDFS-7678
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Affects Versions: HDFS-7285
> Reporter: Li Bo
> Assignee: Zhe Zhang
> Attachments: BlockGroupReader.patch, HDFS-7678-HDFS-7285.002.patch,
> HDFS-7678-HDFS-7285.003.patch, HDFS-7678-HDFS-7285.004.patch,
> HDFS-7678-HDFS-7285.005.patch, HDFS-7678-HDFS-7285.006.patch,
> HDFS-7678-HDFS-7285.007.patch, HDFS-7678-HDFS-7285.008.patch,
> HDFS-7678-HDFS-7285.009.patch, HDFS-7678-HDFS-7285.010.patch,
> HDFS-7678.000.patch, HDFS-7678.001.patch
>
>
> A block group reader will read data from BlockGroup no matter in striping
> layout or contiguous layout. The corrupt blocks can be known before
> reading(told by namenode), or just be found during reading. The block group
> reader needs to do decoding work when some blocks are found corrupt.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)