Over the last few days, we have had lots of discussions that have intertwined several major themes:
# When/why do we make major Hadoop releases? # When/how do we move to major JDK versions? # To a lesser extent, we have debated another theme: what do we do about trunk? For now, let's park JDK & trunk to treat them in a separate thread(s). For a while now, I've had a couple of lampposts in my head which I used for guidance - apologize for not sharing this broadly prior to this discussion, maybe putting it out here will help - certainly hope so. Major Releases Hadoop continues to benefit tremendously by the investment in stability, validation etc. put in by its *anchor* users: Yahoo, Facebook, Twitter, eBay, LinkedIn etc. A historical perspective... In it's lifetime, Apache Hadoop went from monthly to quarterly releases because, as Hadoop became more and more of a production system (starting with hadoop-0.16 and more so with hadoop 0.18), users could not absorb the torrid pace of change. IMHO, we didn't go far enough in addressing the competing pressures of stability v/s rapid innovation. We paid for it by losing one of our anchor users - Facebook - around the time of hadoop-0.19 - they just forked. Around the same time, Yahoo hit the same problem (I know, I lived through it painfully) and got stuck with hadoop-0.20 for a *very* long time and forked to add Security rather than deal with the next major release (hadoop-0.21). Later on, Facebook did the same, and, unfortunately for the community, is stuck - probably forever - on their fork of hadoop-0.20. Overall, these were dark days for the community: every anchor user was on their own fork, and it took a toll on the project. Recently, thankfully for Hadoop, we have had a period of relative stability with hadoop-1.x and hadoop-2.x. Even so, there were close shaves: Yahoo was on hadoop-0.23 for a *very* long time - in fact, they are only just now finishing their migration to hadoop-2.x. I think the major lessons here are the obvious ones: # Compatibility matters # Maintaining ?multiple major releases, in parallel, is a big problem - it leads to an unproductive, and risky, split in community investment along different lines. Looking Ahead Given the above, here are some thoughts for looking ahead: # Be very conservative about major releases - a major benefit is required (features) for the cost. Let's not compel our anchor users like Yahoo, Twitter, eBay, and LinkedIn to invest in previous releases rather than the latest one. Let's hear more from them - and let's be very accommodating to them - for they play a key role in keeping Hadoop healthy & stable. # Be conservative about dropping support for JDKs. In particular, let's hear from our anchor users on their plans for adoption jdk-1.8. LinkedIn has already moved to jdk-1.8, which is great for the validation , but let's wait for the rest of our anchor users to move before we drop jdk-1.7. We did the same thing with jdk-1.6 - waited for them to move before we drop support for jdk-1.7. Overall, I'd love to hear more from Twitter, Yahoo, eBay and other anchor users on their plans for jdk-1.8 specifically, and on their overall appetite for hadoop-3. Let's not finalize our plans for moving forward until this input has been considered. Thoughts? thanks, Arun Unfortunate that it's necessary disclaimers: # Before people point out vendor affiliations to lend unnecessary color to my opinions, let me state that hadoop-2 v/s hadoop-3 is a non-issue for us. For major HDP versions the key is, just, compatibility?... e.g. we ship major, but compatible, community releases such as hive-0.13/hive-0.14 in HDP-2.x/HDP-2.x+1 etc. # Also, release management is a similar non-issue - we have already had several individuals step up in hadoop-2.x line. Expect more of the same from folks like Andrew, Karthik, Vinod, Steve etc.