[ 
https://issues.apache.org/jira/browse/HADOOP-12041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127327#comment-15127327
 ] 

Kai Zheng commented on HADOOP-12041:
------------------------------------

Thanks Zhe for the careful review. The comments are very helpful.
bq. makeValidIndexes should be static and maybe moved to CodecUtil
How about adding {{CoderUtil}} for coder implementation? This is something I 
wanted to do but get missed. As we're going to have various raw erasure coders, 
we'll need a general utility class to embrace such. {{CodecUtil}} can focus on 
caller interface side.
bq. The reported checkstyle issue looks valid.
Sure, pending the fixup for your comments. :)
bq. When you say the new coder is compatible with ISA-L coder, you mean I can 
use new Java coder to decode data encoded with ISA-L, right?
Exactly. Probably that's why we're here. As you may have already noted, there 
does be some thin HDFS client that's not equipped with native hadoop library, 
so pure Java solution in good performance would be needed in case the ISA-L 
coder isn't available.
bq. gftbls is hard to understand by name, does it mean gfTables?
Ah, yes. Will rename it.
bq. Any reason to allocate encodeMatrix, decodeMatrix, and invertMatrix as 1D 
arrays but use them as matrices? Can we use 2D arrays?
I borrow the approach from the ISA-L library as the whole is. I thought it goes 
like that for better performance, because the mentioned matrices are highly 
frequently accessed during encoding/decoding computing thus better to be 
compact together so able to remain in the cache.
bq. The below is not easy to understand. Why don't we need to prepare 
decodeMatrix if the cached indexes haven't changed? 
I thought better to add some comment for easier understanding. The decodeMatrix 
is only relevant to the schema and erased indexes. The schema is bounded during 
initialization phase and won't change; the erased indexes however can change 
during different calls, but if not changed, the decodeMatrix won't need to be 
recomputed. As indicated in HDFS side, we're very likely calling a decoder 
repeatedly for a corrupt block group which is of the fixed erasure indexes, the 
optimization trick here would make some sense.
bq. RSUtil2#genReedSolomonMatrix is unused
OK, let me remove it for now and add it later when support the mode of encode 
matrix generation.
bq. Patch is already very large, I think we should add package-info separately.
Yeah, sure. I think we can add them when doing the coder rename, if we can bear 
the checking style issue complaining the lack of them.
bq. So about the incompatibility between HDFS-RAID coder and the new Java 
coder: is it because they use different GF matrices?
I think so. The implementation approach is also quite different. As you see, in 
this new Java coder (same in the ISA-L library), encode/decode matrix is 
previously calculated, the encoding and decoding are unified in the same way 
that purely does the matrix operation against the input data bytes. The benefit 
is, the decoder and encoder can be optimized together and are of the same high 
performance.

> Implement another Reed-Solomon coder in pure Java
> -------------------------------------------------
>
>                 Key: HADOOP-12041
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12041
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>         Attachments: HADOOP-12041-v1.patch, HADOOP-12041-v2.patch, 
> HADOOP-12041-v3.patch, HADOOP-12041-v4.patch, HADOOP-12041-v5.patch, 
> HADOOP-12041-v6.patch
>
>
> Currently existing Java RS coders based on {{GaloisField}} implementation 
> have some drawbacks or limitations:
> * The decoder computes not really erased units unnecessarily (HADOOP-11871);
> * The decoder requires parity units + data units order for the inputs in the 
> decode API (HADOOP-12040);
> * Need to support or align with native erasure coders regarding concrete 
> coding algorithms and matrix, so Java coders and native coders can be easily 
> swapped in/out and transparent to HDFS (HADOOP-12010);
> * It's unnecessarily flexible but incurs some overhead, as HDFS erasure 
> coding is totally a byte based data system, we don't need to consider other 
> symbol size instead of 256.
> This desires to implement another  RS coder in pure Java, in addition to the 
> existing {{GaliosField}} from HDFS-RAID. The new Java RS coder will be 
> favored and used by default to resolve the related issues. The old HDFS-RAID 
> originated coder will still be there for comparing, and converting old data 
> from HDFS-RAID systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to