On 28 May 2014 20:50, Niels Basjes <ni...@basjes.nl> wrote:

> Hi,
>
> Last week I ran into this problem again
> https://issues.apache.org/jira/browse/MAPREDUCE-2094
>
> What happens here is that the default implementation of the isSplitable
> method in FileInputFormat is so unsafe that just about everyone who
> implements a new subclass is likely to get this wrong. The effect of
> getting this wrong is that all unit tests succeed and running it against
> 'large' input files (>>64MiB) that are compressed using a non-splittable
> compression (often Gzip) will cause the input to be fed into the mappers
> multiple time (i.e. you get garbage results without ever seeing any
> errors).
>
> Last few days I was at Berlin buzzwords talking to someone about this bug
>

that was me, I recall.


> and this resulted in the following proposal which I would like your
> feedback on.
>
> 1) This is a change that will break backwards compatibility (deliberate
> choice).
> 2) The FileInputFormat will get 3 methods (the old isSplitable with the
> typo of one 't' in the name will disappear):
>     (protected) isSplittableContainer --> true unless compressed with
> non-splittable compression.
>     (protected) isSplittableContent --> abstract, MUST be implemented by
> the subclass
>     (public)      isSplittable --> isSplittableContainer &&
> isSplittableContent
>
> The idea is that only the isSplittable is used by other classes to know if
> this is a splittable file.
> The effect I hope to get is that a developer writing their own
> fileinputformat (which I alone have done twice so far) is 'forced' and
> 'helped' getting this right.
>

I could see making the attributes more explicit would be good -but stopping
everything that exists from working isn't going to fly.

what about some subclass, AbstractSplittableFileInputFormat that implements
the container properly, requires that content one -and then calculates
IsSplitable() from the results? Existing code: no change, new formats can
descend from this (and built in ones retrofitted).



> The reason for me to propose this as an incompatible change is that this
> way I hope to eradicate some of the existing bugs in custom implementations
> 'out there'.
>
> P.S. If you agree to this change then I'm willing to put my back into it
> and submit a patch.
>
> --
> Best regards,
>
> Niels Basjes
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to