Over the last few days, we have had lots of discussions that have intertwined 
several major themes:



# When/why do we make major Hadoop releases?

# When/how do we move to major JDK versions?

# To a lesser extent, we have debated another theme: what do we do about trunk?



For now, let's park JDK & trunk to treat them in a separate thread(s).



For a while now, I've had a couple of lampposts in my head which I used for 
guidance - apologize for not sharing this broadly prior to this discussion, 
maybe putting it out here will help - certainly hope so.





Major Releases



Hadoop continues to benefit tremendously by the investment in stability, 
validation etc. put in by its *anchor* users: Yahoo, Facebook, Twitter, eBay, 
LinkedIn etc.



A historical perspective...



In it's lifetime, Apache Hadoop went from monthly to quarterly releases 
because, as Hadoop became more and more of a production system (starting with 
hadoop-0.16 and more so with hadoop 0.18), users could not absorb the torrid 
pace of change.



IMHO, we didn't go far enough in addressing the competing pressures of 
stability v/s rapid innovation.  We paid for it by losing one of our anchor 
users - Facebook - around the time of hadoop-0.19 - they just forked.



Around the same time, Yahoo hit the same problem (I know, I lived through it 
painfully) and got stuck with hadoop-0.20 for a *very* long time and forked to 
add Security rather than deal with the next major release (hadoop-0.21). Later 
on, Facebook did the same, and, unfortunately for the community, is stuck - 
probably forever - on their fork of hadoop-0.20.



Overall, these were dark days for the community: every anchor user was on their 
own fork, and it took a toll on the project.



Recently, thankfully for Hadoop, we have had a period of relative stability 
with hadoop-1.x and hadoop-2.x. Even so, there were close shaves: Yahoo was on 
hadoop-0.23 for a *very* long time - in fact, they are only just now finishing 
their migration to hadoop-2.x.



I think the major lessons here are the obvious ones:



# Compatibility matters

# Maintaining ?multiple major releases, in parallel, is a big problem - it 
leads to an unproductive, and risky, split in community investment along 
different lines.





Looking Ahead



Given the above, here are some thoughts for looking ahead:



# Be very conservative about major releases - a major benefit is required 
(features) for the cost. Let's not compel our anchor users like Yahoo, Twitter, 
eBay, and LinkedIn to invest in previous releases rather than the latest one. 
Let's hear more from them - and let's be very accommodating to them - for they 
play a key role in keeping Hadoop healthy & stable.



# Be conservative about dropping support for JDKs. In particular, let's hear 
from our anchor users on their plans for adoption jdk-1.8. LinkedIn has already 
moved to jdk-1.8, which is great for the validation , but let's wait for the 
rest of our anchor users to move before we drop jdk-1.7. We did the same thing 
with jdk-1.6 - waited for them to move before we drop support for jdk-1.7.



Overall, I'd love to hear more from Twitter, Yahoo, eBay and other anchor users 
on their plans for jdk-1.8 specifically, and on their overall appetite for 
hadoop-3.  Let's not finalize our plans for moving forward until this input has 
been considered.



Thoughts?


thanks,
Arun



Unfortunate that it's necessary disclaimers:

# Before people point out vendor affiliations to lend unnecessary color to my 
opinions, let me state that hadoop-2 v/s hadoop-3 is a non-issue for us. For 
major HDP versions the key is, just, compatibility?... e.g. we ship major, but 
compatible, community releases such as hive-0.13/hive-0.14 in HDP-2.x/HDP-2.x+1 
etc.

# Also, release management is a similar non-issue - we have already had several 
individuals step up in hadoop-2.x line. Expect more of the same from folks like 
Andrew, Karthik, Vinod, Steve etc.

Reply via email to