[ 
https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243509#comment-15243509
 ] 

Colin Patrick McCabe commented on HADOOP-13010:
-----------------------------------------------

bq. The problem is, a decoder associates expensive coding buffers and computed 
coding matrices, which would be good to stay in CPU core near enough caches for 
the performance. The cached data is per decoder, not only schema specific, but 
also erasure index specific in decode call, so it's not good to keep the cache 
out of decoder, but still makes sense to cache it because in HDFS side it's 
repeatedly called in a loop for a large block size (64k cell size -> 256mb 
block size). You might have a check about the native codes for native coders 
about the expensive buffers and data cached in every decode call. We had 
benchmarked the coders and showed this optimization obtained great speedup. 
Java InputStreams are similar to here, but not exactly because it's pure 
view-only and leverages OS/IO level caches for file reading stuffs.

If I understand correctly, you're making the case that there is data (such as 
matrices) which should be shared between multiple concurrent encode or decode 
operations.  If that's the case, then let's make that data static and share it 
between all instances.  But I still think that Encoder/Decoder should manage 
its own buffers rather than having them passed in on every call.

bq. Having the common base class would allow encoder and decoder to share 
common properties, not just configurations, but also schema info and some 
options. We can also say that encoder and decoder are also coders, which allows 
to write some common behaviors to deal with coders, not encoder or decoder 
specific. I understand it should also work by composition, but right now I 
don't see very much benefits to switch this from one style to the other, or 
troubles if we don't.

Hmm.  The only state in {{AbstractRawErasureCoder.java}} is configuration 
state.  I don't see why we need this class.  Everything in there could and 
should be a utility function.

The benefit of getting rid of this class is that with a shallower inheritance 
hierarchy, it's easier to understand what's going on.  To continue the analogy 
with Java, InputStream and OutputStream don't share a common base class.

bq. It sounds better not to have the interfaces since the benefit is obvious. 
So in summary how about having these classes (no interface) now: still 
AbstractRawErasureCoder, RawErasureEncoder/Decoder (no Abstract prefix now, 
with the original interface combined), and all kinds of concrete inherent 
encoders/decoders. All client codes will declare RawErasureEncoder/Decoder type 
when creating instances.

It seems reasonable, but I don't see the need for AbstractRawErasureCoder.

> Refactor raw erasure coders
> ---------------------------
>
>                 Key: HADOOP-13010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13010
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>             Fix For: 3.0.0
>
>         Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch
>
>
> This will refactor raw erasure coders according to some comments received so 
> far.
> * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to 
> rely class inheritance to reuse the codes, instead they can be moved to some 
> utility.
> * Suggested by [~jingzhao] somewhere quite some time ago, better to have a 
> state holder to keep some checking results for later reuse during an 
> encode/decode call.
> This would not get rid of some inheritance levels as doing so isn't clear yet 
> for the moment and also incurs big impact. I do wish the end result by this 
> refactoring will make all the levels more clear and easier to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to