[jira] [Commented] (HDFS-7337) Configurable and pluggable Erasure Codec and schema

Zhe Zhang (JIRA) Thu, 12 Mar 2015 23:03:08 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359990#comment-14359990
 ]


Zhe Zhang commented on HDFS-7337:
---------------------------------

Thanks Kai for the explanation! Now I have a much clearer understanding of the 
codec design.

bq. ErasureCodec would be the high level construct in the framework ...
I agree with the high level goal. The reason I think {{ErasureCodec}} seems 
like a utility class is that (at least in the current HADOOP-11643 / 
HADOOP-11645 code) it is pretty much stateless. It creates {{ErasureCoder}} and 
{{BlockGrouper}} based on the given schema type. But as you said we might 
extend its functionalities in the future. So we can revisit this point later.

bq. It's a good pattern. ErasureCodec follows another good pattern, 
CompressionCodec.
My statement was a little confusing. I wasn't suggesting leveraging 
{{BlockStoragePolicySuite}} to build the {{ErasureCodec}} class. I was 
suggesting we build a similar _schema suite_ class to store all schemas.

bq. All the {{ErasureCodec}}s are loaded thru core-site configuration or 
service locators, and kept in map with codec name as the key.
Agreed. Actually the {{ECSchemaSuite}} idea I proposed above is doing the same 
thing: besides a few hard-coded schemas, it can also parse the XML and load 
more schemas in the suite. If we don't use something like a schema suite, where 
should we maintain this map? I see HADOOP-11664 loads schemas from XML. Is 
there another JIRA handling the management of loaded schemas? If not maybe we 
can consider {{ECSchemaSuite}}? It has a simple task of mapping an ID (either a 
byte or a String as you proposed) to the {{ECSchema}} object.

bq.  If we don't want to hard-code all the schemas, then we need to pass schema 
object I guess.
Agreed. Actually even if hard-code all schemas it's still dangerous to pass 
only the schema ID. The DN might be on a different version of Hadoop than NN. 
However, in storing per-dir or per-file schemas, we should only store the IDs.

> Configurable and pluggable Erasure Codec and schema
> ---------------------------------------------------
>
>                 Key: HDFS-7337
>                 URL: https://issues.apache.org/jira/browse/HDFS-7337
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Zhe Zhang
>            Assignee: Kai Zheng
>         Attachments: HDFS-7337-prototype-v1.patch, 
> HDFS-7337-prototype-v2.zip, HDFS-7337-prototype-v3.zip, 
> PluggableErasureCodec-v2.pdf, PluggableErasureCodec.pdf
>
>
> According to HDFS-7285 and the design, this considers to support multiple 
> Erasure Codecs via pluggable approach. It allows to define and configure 
> multiple codec schemas with different coding algorithms and parameters. The 
> resultant codec schemas can be utilized and specified via command tool for 
> different file folders. While design and implement such pluggable framework, 
> it’s also to implement a concrete codec by default (Reed Solomon) to prove 
> the framework is useful and workable. Separate JIRA could be opened for the 
> RS codec implementation.
> Note HDFS-7353 will focus on the very low level codec API and implementation 
> to make concrete vendor libraries transparent to the upper layer. This JIRA 
> focuses on high level stuffs that interact with configuration, schema and etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7337) Configurable and pluggable Erasure Codec and schema

Reply via email to