[
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916076#action_12916076
]
Niels Basjes commented on MAPREDUCE-2094:
-----------------------------------------
bq. Changing the default implementation to return false would be an
incompatible change, potentially breaking existing subclasses.
If you mean with "breaking" : "Some subclasses will see an unexpected
performance degradation" . then Yes, that will most likely occur ( first one I
think of is SequenceFileInputFormat ).
I however do not expect any functional "breaking" of the output of these
subclasses.
bq. Making the method abstract would also be incompatible and break subclasses,
but in a way that they'd easily detect.
Yes.
The downside of this option is that if subclasses want to have detection
depending on the compression I expect a lot of code duplication to occur.
This code duplication is already occurring within the main code base in
KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat and their old
API counterparts (I found a total of 5 duplicates of the same isSplittable
implementation).
bq. Perhaps the javadoc should just be clarified to better document this
default?
Definitely an option. However this would not fix the effect in the existing
subclasses.
I just did a quick manual code check of the current trunk and I found that the
following classes are derived from FileInputFormat yet do not implement the
isSplittable method (and thus use "return true").
* ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java
*
./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java
* ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java
I expect that the NLineInputFormat and the AutoInputFormat will affected by
this "large gzip" bug.
So expect that simply fixing the isSplittable documentation would lead to the
need to fix *at least* these two classes.
As far as I understand the SequenceFileInputFormat can only be compressed using
a "splittable" compression, so the "return true;" from FileInputFormat will
work fine there.
Overall I still prefer the clean option of returning the correct value
depending on the compression. That would effectively leave the behavior in most
use cases unchanged. Yet in those cases where splitting is known to cause
problems it would avoid those problems. Thus avoiding major issues like the
ones we had and described in HADOOP-6901.
For the SequenceFileInputFormat it may be needed to implement the isSplittable
as "return true;"
Effectively the set of changes I propose (in both the old and new API versions
of these classes):
1) FileInputFormat . isSplittable gets the implementation as seen in
TextInputFormat
2) The isSplittable implementation is removed from KeyValueTextInputFormat ,
TextInputFormat, CombineFileInputFormat (useless code duplication)
3) The isSplittable implementation "return true" is added to
SequenceFileInputFormat. Given the fact that you cannot gzip a sequencefile I
expect this to be an optional fix.
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements
> unsafe default behaviour that is different from the documented behaviour.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-2094
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: task
> Affects Versions: 0.20.1, 0.20.2, 0.21.0
> Reporter: Niels Basjes
>
> When implementing a custom derivative of FileInputFormat we ran into the
> effect that a large Gzipped input file would be processed several times.
> A near 1GiB file would be processed around 36 times in its entirety. Thus
> producing garbage results and taking up a lot more CPU time than needed.
> It took a while to figure out and what we found is that the default
> implementation of the isSplittable method in
> [org.apache.hadoop.mapreduce.lib.input.FileInputFormat |
> http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
> ] is simply "return true;".
> This is a very unsafe default and is in contradiction with the JavaDoc of the
> method which states: "Is the given filename splitable? Usually, true, but if
> the file is stream compressed, it will not be. " . The actual implementation
> effectively does "Is the given filename splitable? Always true, even if the
> file is stream compressed using an unsplittable compression codec. "
> For our situation (where we always have Gzipped input) we took the easy way
> out and simply implemented an isSplittable in our class that does "return
> false; "
> Now there are essentially 3 ways I can think of for fixing this (in order of
> what I would find preferable):
> # Implement something that looks at the used compression of the file (i.e. do
> migrate the implementation from TextInputFormat to FileInputFormat). This
> would make the method do what the JavaDoc describes.
> # "Force" developers to think about it and make this method abstract.
> # Use a "safe" default (i.e. return false)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.