[ 
https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243883#comment-15243883
 ] 

Colin Patrick McCabe commented on HADOOP-13010:
-----------------------------------------------

bq. Yes you're right I meant some data to be shared between multiple concurrent 
encode or decode operations. The data only makes sense for a coder instance 
(binds a schema) so it's not suitable to be static; on the other hand it's also 
decode call specific so it's also not suitable to reside in the coder instance.

Thanks for the explanation.  It sounds like {{Encoder}} and {{Decoder}} will be 
stateless once they're created.  Basically, they just reflect an algorithm and 
some data required to implement that algorithm.  That is reasonable.  We should 
document this in the JavaDoc for the classes.

I also agree that we should add a new class to represent the state which the 
Decoder / Encoder is manipulating.  The problem with calling this class 
{{DecoderState}} is that this name suggests that it is the state of the coder, 
rather than state manipulated by the coder.  Perhaps calling this 
{{DecoderData}} or {{DecoderStream}} is more appropriate.  Having this new 
class will also avoid the need to manually pass around so many arrays and 
indices.

bq. AbstractRawErasureCoder maintains conf, schema related info like 
numDataUnits, and coderOptions. It provides public methods (9 ones) to access 
these fields. All of these are essentials to a erasure coder and common to both 
encoders and decoders. If we move the variables and methods to a utility class, 
it wouldn't look better, and we have to duplicate the methods across encoder 
and decoder.

These methods and fields are configuration methods and configuration fields.  
They belong in a class named something like "ErasureEncodingConfiguration" or 
something like that.  I also feel that configuration should be immutable once 
it is created, since otherwise things get very messy.  We use this pattern in 
many other cases in Hadoop: for example in {{DfsClientConf}} and 
{{DNConf.java}}.

Why is having the configuration in a separate object better than having the 
configuration in a base class?  A few reasons:
* It's easier to follow the flow of control.  You don't have to jump from file 
to file to figure out which method is actually getting called (any subclass 
could override the base class methods we have now).
* It's obvious what a CoderConfiguration class does.  It manages the 
configuration.  It's not obvious what the base class does without reading all 
of the source.
* The configuration class can have a way of printing itself (object-orientation)

Many of these are reasons why the gang of four recommended "*favor composition 
over inheritance*."

bq. Yes it's interesting. I just thought of an exact match for the current 
codes. In JRE sasl framework, it has interfaces SaslClient and SaslSever, 
abstract classes AbstractSaslImpl and GssKrb5Base, class GssKrb5Client extends 
GssKrb5Base implements SaslClient, and class GssKrb5Server extends GssKrb5Base 
implements SaslServer. I'm not sure we followed the style but I guess it could 
be a common pattern for a bit of complex situation. I thought that's why when 
it initially went in this way people understood the codes and I heard no other 
voice then.

We have to understand the reasons behind using a pattern.  The reason to 
separate interfaces from Abstract classes is that some implementations of the 
interface may not want to use the code in the Abstract class.  Since that is 
not the case here, it's not a good idea to copy this pattern.

bq. Generally and often, I have to admit that I'm more a OOP guy and prefer to 
have clear construct over concept and abstract, rather than mixed utilities. We 
can see many example utilities in the codebase that are rather lengthy and 
messy, which intends to break modularity. That's probably why I'm not feeling 
so well to get rid of the level and replace it with utilities here. I agree 
with you that sometimes composition is good to reuse some codes to avoid 
complex inheritance relationships, but here we do have a coder concept and the 
construct for it wouldn't be bad to have.

Using composition doesn't mean putting everything into utilities.  Often, it 
means grouping related things into objects.  Instead of having 20 fields 
sprayed into a base class that other classes inherit from, you have a small 
number of utility classes such as Configuration, Log, etc. that other classes 
reuse by composition (owning an instance of them).

This also makes it easy to change the code later.  Changing inheritance 
relationships often breaks backwards compatibility.  Removing a field or adding 
a new one almost never does.

> Refactor raw erasure coders
> ---------------------------
>
>                 Key: HADOOP-13010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13010
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>             Fix For: 3.0.0
>
>         Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch
>
>
> This will refactor raw erasure coders according to some comments received so 
> far.
> * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to 
> rely class inheritance to reuse the codes, instead they can be moved to some 
> utility.
> * Suggested by [~jingzhao] somewhere quite some time ago, better to have a 
> state holder to keep some checking results for later reuse during an 
> encode/decode call.
> This would not get rid of some inheritance levels as doing so isn't clear yet 
> for the moment and also incurs big impact. I do wish the end result by this 
> refactoring will make all the levels more clear and easier to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to