[jira] [Commented] (HDFS-7337) Configurable and pluggable Erasure Codec and schema

Kai Zheng (JIRA) Fri, 09 Jan 2015 14:02:56 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271930#comment-14271930
 ]


Kai Zheng commented on HDFS-7337:
---------------------------------

Hi Zhe, let me address your comments. We have discussed quite a bit offline so 
I'm here to summarize and clarify further. If anything I missed please comment, 
thanks.

bq. I like the idea of creating an ec package under org.apache.hadoop.hdfs. It 
is a good place to host all codec classes.
Glad you like it. I guess we could put all the central EC constructs and 
facilities here that's not relevant to client, namenode, and datenode. 
Currently codec related stuffs are the best examples.

bq. I think the ec package should focus on codec calculation based on a packet 
unit. Below is how I think the functions should be logically divided:
In this work RawErasureCoder focuses on calculation based on a packet unit or 
chunk. It won't be much and just simply implements how to encode/decode with a 
group of chunks. But why we would not stop here and ask for higher level 
construct like ErasureCodec, as I explained to you, because it would be better 
to be able to have such as a central place to maintain all the codec specific 
logics. The effect would be, customer only need to plugin in one place 
(ErasureCodec) instead of in many places to avoid possible inconsistency; to 
add support for a new erasure code algorithm to implement an ErasureCodec is 
all, we don't need to modify in many places in ECManager, ECWorker and 
ECClient. So the question comes to what aspects would be covered and how 
they're covered when support a new code algorithm: 1) how to calculate with a 
group of bytes, units or chunks, which is covered by ErasureCoder and 
RawErasureCoder; 2) how to layout/order the group of chunks, which is covered 
by BlockGrouper. The aspects of ErasureCoder and BlockGrouper are abstracted 
and can be extended according to a code algorithm or codec. So when add support 
a new code, it's expected to: 1) add a new ErasureCoder; 2) add a new 
BlockGrouper; 3) add a new ErasureCodec using the former two; 4) Update 
hdfs-site.xml or whatever place to register the new ErasureCodec with a name. 
Then a customer would simply configure/create a new ec schema by referencing 
the new codec name; and using the schema a ec file system zone can be created, 
and so on. So as all the code specific logics are extracted into such 
ErasureCodec construct, how it to be called or interact with ECManager, 
ECWorker and ECClient? Anyway it would all start assuming a schema is known by 
whatever means. Using the schema the ErasureCodec can be instanced, using the 
codec instance the BlockGrouper can be created and utilized by ECManager to 
create a BlockGroup providing necessary information, and the ErasureCoder can 
be created then utilized by ECWorker or ECClient to perform encoding/decoding 
provided a group of chunks. 

Sure it won't be that easy and actually Zhe pointed out a hurdle that a codec 
would have to be hard-coded in order to be able to efficiently 
maintained/associated by an inode, thus adding to support a new code maybe also 
involves changing codes in some places outside of the codec framework. I will 
investigate such chances. Anyhow, still, it would be ideal to avoid to change 
or add codes in many places besides the new codec itself.

To demonstrate how the codec framework works, as Zhe suggested, we would come 
up more than one codecs so that we can compare and see more clearly. Currently 
only RS codec is implemented with test case and sample, we're working on 
another one using XOR code though it may be never used in production. 

> Configurable and pluggable Erasure Codec and schema
> ---------------------------------------------------
>
>                 Key: HDFS-7337
>                 URL: https://issues.apache.org/jira/browse/HDFS-7337
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Zhe Zhang
>            Assignee: Kai Zheng
>         Attachments: HDFS-7337-prototype-v1.patch, 
> HDFS-7337-prototype-v2.zip, HDFS-7337-prototype-v3.zip, 
> PluggableErasureCodec.pdf
>
>
> According to HDFS-7285 and the design, this considers to support multiple 
> Erasure Codecs via pluggable approach. It allows to define and configure 
> multiple codec schemas with different coding algorithms and parameters. The 
> resultant codec schemas can be utilized and specified via command tool for 
> different file folders. While design and implement such pluggable framework, 
> it’s also to implement a concrete codec by default (Reed Solomon) to prove 
> the framework is useful and workable. Separate JIRA could be opened for the 
> RS codec implementation.
> Note HDFS-7353 will focus on the very low level codec API and implementation 
> to make concrete vendor libraries transparent to the upper layer. This JIRA 
> focuses on high level stuffs that interact with configuration, schema and etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7337) Configurable and pluggable Erasure Codec and schema

Reply via email to