[
https://issues.apache.org/jira/browse/HDFS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271930#comment-14271930
]
Kai Zheng commented on HDFS-7337:
---------------------------------
Hi Zhe, let me address your comments. We have discussed quite a bit offline so
I'm here to summarize and clarify further. If anything I missed please comment,
thanks.
bq. I like the idea of creating an ec package under org.apache.hadoop.hdfs. It
is a good place to host all codec classes.
Glad you like it. I guess we could put all the central EC constructs and
facilities here that's not relevant to client, namenode, and datenode.
Currently codec related stuffs are the best examples.
bq. I think the ec package should focus on codec calculation based on a packet
unit. Below is how I think the functions should be logically divided:
In this work RawErasureCoder focuses on calculation based on a packet unit or
chunk. It won't be much and just simply implements how to encode/decode with a
group of chunks. But why we would not stop here and ask for higher level
construct like ErasureCodec, as I explained to you, because it would be better
to be able to have such as a central place to maintain all the codec specific
logics. The effect would be, customer only need to plugin in one place
(ErasureCodec) instead of in many places to avoid possible inconsistency; to
add support for a new erasure code algorithm to implement an ErasureCodec is
all, we don't need to modify in many places in ECManager, ECWorker and
ECClient. So the question comes to what aspects would be covered and how
they're covered when support a new code algorithm: 1) how to calculate with a
group of bytes, units or chunks, which is covered by ErasureCoder and
RawErasureCoder; 2) how to layout/order the group of chunks, which is covered
by BlockGrouper. The aspects of ErasureCoder and BlockGrouper are abstracted
and can be extended according to a code algorithm or codec. So when add support
a new code, it's expected to: 1) add a new ErasureCoder; 2) add a new
BlockGrouper; 3) add a new ErasureCodec using the former two; 4) Update
hdfs-site.xml or whatever place to register the new ErasureCodec with a name.
Then a customer would simply configure/create a new ec schema by referencing
the new codec name; and using the schema a ec file system zone can be created,
and so on. So as all the code specific logics are extracted into such
ErasureCodec construct, how it to be called or interact with ECManager,
ECWorker and ECClient? Anyway it would all start assuming a schema is known by
whatever means. Using the schema the ErasureCodec can be instanced, using the
codec instance the BlockGrouper can be created and utilized by ECManager to
create a BlockGroup providing necessary information, and the ErasureCoder can
be created then utilized by ECWorker or ECClient to perform encoding/decoding
provided a group of chunks.
Sure it won't be that easy and actually Zhe pointed out a hurdle that a codec
would have to be hard-coded in order to be able to efficiently
maintained/associated by an inode, thus adding to support a new code maybe also
involves changing codes in some places outside of the codec framework. I will
investigate such chances. Anyhow, still, it would be ideal to avoid to change
or add codes in many places besides the new codec itself.
To demonstrate how the codec framework works, as Zhe suggested, we would come
up more than one codecs so that we can compare and see more clearly. Currently
only RS codec is implemented with test case and sample, we're working on
another one using XOR code though it may be never used in production.
> Configurable and pluggable Erasure Codec and schema
> ---------------------------------------------------
>
> Key: HDFS-7337
> URL: https://issues.apache.org/jira/browse/HDFS-7337
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Zhe Zhang
> Assignee: Kai Zheng
> Attachments: HDFS-7337-prototype-v1.patch,
> HDFS-7337-prototype-v2.zip, HDFS-7337-prototype-v3.zip,
> PluggableErasureCodec.pdf
>
>
> According to HDFS-7285 and the design, this considers to support multiple
> Erasure Codecs via pluggable approach. It allows to define and configure
> multiple codec schemas with different coding algorithms and parameters. The
> resultant codec schemas can be utilized and specified via command tool for
> different file folders. While design and implement such pluggable framework,
> it’s also to implement a concrete codec by default (Reed Solomon) to prove
> the framework is useful and workable. Separate JIRA could be opened for the
> RS codec implementation.
> Note HDFS-7353 will focus on the very low level codec API and implementation
> to make concrete vendor libraries transparent to the upper layer. This JIRA
> focuses on high level stuffs that interact with configuration, schema and etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)