As someone who doesn't really contribute, just lurks, I could well be misinformed or under-informed, but I don't see why we can't deprecate a method which could cause dangerous side effects? People can still use the deprecated methods for backwards compatibility, but are discouraged by compiler warnings, and any changes they write to their code can start to use the new functionality?
*Apologies if I'm stepping into a Hadoop holy war here On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <ni...@basjes.nl> wrote: > My original proposal (from about 3 years ago) was to change the isSplitable > method to return a safe default ( you can see that in the patch that is > still attached to that Jira issue). > For arguments I still do not fully understand this was rejected by Todd and > Doug. > > So that is why my new proposal is to deprecate (remove!) the old method > with the typo in Hadoop 3.0 and replace it with something correct and less > error prone. > Given the fact that this would happen in a major version jump I thought > that would be the right time to do that. > > Niels > > > On Thu, May 29, 2014 at 11:34 AM, Steve Loughran <ste...@hortonworks.com > >wrote: > > > On 28 May 2014 20:50, Niels Basjes <ni...@basjes.nl> wrote: > > > > > Hi, > > > > > > Last week I ran into this problem again > > > https://issues.apache.org/jira/browse/MAPREDUCE-2094 > > > > > > What happens here is that the default implementation of the isSplitable > > > method in FileInputFormat is so unsafe that just about everyone who > > > implements a new subclass is likely to get this wrong. The effect of > > > getting this wrong is that all unit tests succeed and running it > against > > > 'large' input files (>>64MiB) that are compressed using a > non-splittable > > > compression (often Gzip) will cause the input to be fed into the > mappers > > > multiple time (i.e. you get garbage results without ever seeing any > > > errors). > > > > > > Last few days I was at Berlin buzzwords talking to someone about this > bug > > > > > > > that was me, I recall. > > > > > > > and this resulted in the following proposal which I would like your > > > feedback on. > > > > > > 1) This is a change that will break backwards compatibility (deliberate > > > choice). > > > 2) The FileInputFormat will get 3 methods (the old isSplitable with the > > > typo of one 't' in the name will disappear): > > > (protected) isSplittableContainer --> true unless compressed with > > > non-splittable compression. > > > (protected) isSplittableContent --> abstract, MUST be implemented > by > > > the subclass > > > (public) isSplittable --> isSplittableContainer && > > > isSplittableContent > > > > > > The idea is that only the isSplittable is used by other classes to know > > if > > > this is a splittable file. > > > The effect I hope to get is that a developer writing their own > > > fileinputformat (which I alone have done twice so far) is 'forced' and > > > 'helped' getting this right. > > > > > > > I could see making the attributes more explicit would be good -but > stopping > > everything that exists from working isn't going to fly. > > > > what about some subclass, AbstractSplittableFileInputFormat that > implements > > the container properly, requires that content one -and then calculates > > IsSplitable() from the results? Existing code: no change, new formats can > > descend from this (and built in ones retrofitted). > > > > > > > > > The reason for me to propose this as an incompatible change is that > this > > > way I hope to eradicate some of the existing bugs in custom > > implementations > > > 'out there'. > > > > > > P.S. If you agree to this change then I'm willing to put my back into > it > > > and submit a patch. > > > > > > -- > > > Best regards, > > > > > > Niels Basjes > > > > > > > -- > > CONFIDENTIALITY NOTICE > > NOTICE: This message is intended for the use of the individual or entity > to > > which it is addressed and may contain information that is confidential, > > privileged and exempt from disclosure under applicable law. If the reader > > of this message is not the intended recipient, you are hereby notified > that > > any printing, copying, dissemination, distribution, disclosure or > > forwarding of this communication is strictly prohibited. If you have > > received this communication in error, please contact the sender > immediately > > and delete it from your system. Thank You. > > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes > -- [image: cid:1CBF4038-3F0F-4FC2-A1FF-6DC81B8B6F94] First Option Software Ltd Signal House Jacklyns Lane Alresford SO24 9JJ Tel: +44 (0)1962 738232 Mob: +44 (0)7710 160458 Fax: +44 (0)1962 600112 Web: www.b <http://www.fosolutions.co.uk/>espokesoftware.com <http://bespokesoftware.com/> -- ____________________________________________________ This is confidential, non-binding and not company endorsed - see full terms at www.fosolutions.co.uk/emailpolicy.html First Option Software Ltd Registered No. 06340261 Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K. ____________________________________________________