[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916076#action_12916076
 ] 

Niels Basjes commented on MAPREDUCE-2094:
-----------------------------------------

bq. Changing the default implementation to return false would be an 
incompatible change, potentially breaking existing subclasses. 

If you mean with "breaking" :  "Some subclasses will see an unexpected 
performance degradation" . then Yes, that will most likely occur ( first one I 
think of is SequenceFileInputFormat ).
I however do not expect any functional "breaking" of the output of these 
subclasses.

bq. Making the method abstract would also be incompatible and break subclasses, 
but in a way that they'd easily detect. 

Yes. 
The downside of this option is that if subclasses want to have detection 
depending on the compression I expect a lot of code duplication to occur. 
This code duplication is already occurring within the main code base in 
KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat and their old 
API counterparts (I found a total of 5 duplicates of the same isSplittable 
implementation).

bq. Perhaps the javadoc should just be clarified to better document this 
default?

Definitely an option. However this would not fix the effect in the existing 
subclasses.

I just did a quick manual code check of the current trunk and I found that the 
following classes are derived from FileInputFormat yet do not  implement the 
isSplittable method (and thus use "return true").
* ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java
* 
./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java
* ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java

I expect that the NLineInputFormat and the AutoInputFormat will affected by 
this "large gzip" bug.
So expect that simply fixing the isSplittable documentation would lead to the 
need to fix *at least* these two classes.

As far as I understand the SequenceFileInputFormat can only be compressed using 
a "splittable" compression, so the "return true;" from FileInputFormat will 
work fine there.

Overall I still prefer the clean option of  returning the correct value 
depending on the compression. That would effectively leave the behavior in most 
use cases unchanged. Yet in those cases where splitting is known to cause 
problems it would avoid those problems. Thus avoiding major issues like the 
ones we had and described in HADOOP-6901.
For the SequenceFileInputFormat it may be needed to implement the isSplittable 
as "return true;"

Effectively the set of changes I propose (in both the old and new API versions 
of these classes):
1) FileInputFormat . isSplittable  gets the implementation as seen in 
TextInputFormat
2) The isSplittable implementation is removed from  KeyValueTextInputFormat , 
TextInputFormat, CombineFileInputFormat (useless code duplication)
3) The isSplittable implementation "return true" is added to 
SequenceFileInputFormat. Given the fact that you cannot gzip a sequencefile I 
expect this to be an optional fix.


> org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
> unsafe default behaviour that is different from the documented behaviour.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2094
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0
>            Reporter: Niels Basjes
>
> When implementing a custom derivative of FileInputFormat we ran into the 
> effect that a large Gzipped input file would be processed several times. 
> A near 1GiB file would be processed around 36 times in its entirety. Thus 
> producing garbage results and taking up a lot more CPU time than needed.
> It took a while to figure out and what we found is that the default 
> implementation of the isSplittable method in 
> [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
> http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
>  ] is simply "return true;". 
> This is a very unsafe default and is in contradiction with the JavaDoc of the 
> method which states: "Is the given filename splitable? Usually, true, but if 
> the file is stream compressed, it will not be. " . The actual implementation 
> effectively does "Is the given filename splitable? Always true, even if the 
> file is stream compressed using an unsplittable compression codec. "
> For our situation (where we always have Gzipped input) we took the easy way 
> out and simply implemented an isSplittable in our class that does "return 
> false; "
> Now there are essentially 3 ways I can think of for fixing this (in order of 
> what I would find preferable):
> # Implement something that looks at the used compression of the file (i.e. do 
> migrate the implementation from TextInputFormat to FileInputFormat). This 
> would make the method do what the JavaDoc describes.
> # "Force" developers to think about it and make this method abstract.
> # Use a "safe" default (i.e. return false)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to