[
https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243883#comment-15243883
]
Colin Patrick McCabe commented on HADOOP-13010:
-----------------------------------------------
bq. Yes you're right I meant some data to be shared between multiple concurrent
encode or decode operations. The data only makes sense for a coder instance
(binds a schema) so it's not suitable to be static; on the other hand it's also
decode call specific so it's also not suitable to reside in the coder instance.
Thanks for the explanation. It sounds like {{Encoder}} and {{Decoder}} will be
stateless once they're created. Basically, they just reflect an algorithm and
some data required to implement that algorithm. That is reasonable. We should
document this in the JavaDoc for the classes.
I also agree that we should add a new class to represent the state which the
Decoder / Encoder is manipulating. The problem with calling this class
{{DecoderState}} is that this name suggests that it is the state of the coder,
rather than state manipulated by the coder. Perhaps calling this
{{DecoderData}} or {{DecoderStream}} is more appropriate. Having this new
class will also avoid the need to manually pass around so many arrays and
indices.
bq. AbstractRawErasureCoder maintains conf, schema related info like
numDataUnits, and coderOptions. It provides public methods (9 ones) to access
these fields. All of these are essentials to a erasure coder and common to both
encoders and decoders. If we move the variables and methods to a utility class,
it wouldn't look better, and we have to duplicate the methods across encoder
and decoder.
These methods and fields are configuration methods and configuration fields.
They belong in a class named something like "ErasureEncodingConfiguration" or
something like that. I also feel that configuration should be immutable once
it is created, since otherwise things get very messy. We use this pattern in
many other cases in Hadoop: for example in {{DfsClientConf}} and
{{DNConf.java}}.
Why is having the configuration in a separate object better than having the
configuration in a base class? A few reasons:
* It's easier to follow the flow of control. You don't have to jump from file
to file to figure out which method is actually getting called (any subclass
could override the base class methods we have now).
* It's obvious what a CoderConfiguration class does. It manages the
configuration. It's not obvious what the base class does without reading all
of the source.
* The configuration class can have a way of printing itself (object-orientation)
Many of these are reasons why the gang of four recommended "*favor composition
over inheritance*."
bq. Yes it's interesting. I just thought of an exact match for the current
codes. In JRE sasl framework, it has interfaces SaslClient and SaslSever,
abstract classes AbstractSaslImpl and GssKrb5Base, class GssKrb5Client extends
GssKrb5Base implements SaslClient, and class GssKrb5Server extends GssKrb5Base
implements SaslServer. I'm not sure we followed the style but I guess it could
be a common pattern for a bit of complex situation. I thought that's why when
it initially went in this way people understood the codes and I heard no other
voice then.
We have to understand the reasons behind using a pattern. The reason to
separate interfaces from Abstract classes is that some implementations of the
interface may not want to use the code in the Abstract class. Since that is
not the case here, it's not a good idea to copy this pattern.
bq. Generally and often, I have to admit that I'm more a OOP guy and prefer to
have clear construct over concept and abstract, rather than mixed utilities. We
can see many example utilities in the codebase that are rather lengthy and
messy, which intends to break modularity. That's probably why I'm not feeling
so well to get rid of the level and replace it with utilities here. I agree
with you that sometimes composition is good to reuse some codes to avoid
complex inheritance relationships, but here we do have a coder concept and the
construct for it wouldn't be bad to have.
Using composition doesn't mean putting everything into utilities. Often, it
means grouping related things into objects. Instead of having 20 fields
sprayed into a base class that other classes inherit from, you have a small
number of utility classes such as Configuration, Log, etc. that other classes
reuse by composition (owning an instance of them).
This also makes it easy to change the code later. Changing inheritance
relationships often breaks backwards compatibility. Removing a field or adding
a new one almost never does.
> Refactor raw erasure coders
> ---------------------------
>
> Key: HADOOP-13010
> URL: https://issues.apache.org/jira/browse/HADOOP-13010
> Project: Hadoop Common
> Issue Type: Sub-task
> Reporter: Kai Zheng
> Assignee: Kai Zheng
> Fix For: 3.0.0
>
> Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch
>
>
> This will refactor raw erasure coders according to some comments received so
> far.
> * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to
> rely class inheritance to reuse the codes, instead they can be moved to some
> utility.
> * Suggested by [~jingzhao] somewhere quite some time ago, better to have a
> state holder to keep some checking results for later reuse during an
> encode/decode call.
> This would not get rid of some inheritance levels as doing so isn't clear yet
> for the moment and also incurs big impact. I do wish the end result by this
> refactoring will make all the levels more clear and easier to follow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)