Even with the work on hadoop-0.22 (trunk) starting in earnest it is fairly obvious, given our past history, that it will take a while for us to get it stable and deployable - for e.g. it took us nearly 6 months to deploy hadoop-0.20.

In the interim I'd like to propose we push a hadoop-0.20-security release off the Yahoo! patchset (http://github.com/yahoo/hadoop- common). This will ensure the community benefits from all the work done at Yahoo! for over 12 months *now*, and ensures that we do not have to wait until hadoop-0.22 which has all of these patches.

Some salient aspects:
a) Full-fledged security implementation deployed at scale (4000 nodes) in production. b) Lots of work on the stabilizing and optimizing the NameNode and JobTracker for over 12 months. This has been critical in deploying Hadoop at scale i.e. clusters of 4000 nodes. For e.g. we have a 50% improvement in CPU utilization on the JobTracker vis-a-vis the hadoop-0.20.2 release. c) Several new features in the scheduler (CapacityScheduler), Map- Reduce framework, better support for multi-tenancy etc. d) Several performance and stability improvements to the system e.g. iterative ls, robustness against rogue clients/jobs/users etc.

Also, given the huge number of features and enhancements I'd like to propose we create a new 0.20-security branch and commit the Yahoo patchset there for the release.

This has been proposed earlier by Doug and did not get far due to concerns about the effect this would have on development on trunk. However, I believe, we have a case for demonstrable progress on trunk now, and it would be useful to have an interim, fully-tested Apache Hadoop release available to the community.

Conceivably, one could imagine a Hadoop Security + Append release soon after. At this point a Hadoop Security release alone would add tremendous value for the reasons above. Presently we would like to get this release out quickly to focus the majority of our efforts on trunk.

Thoughts?

Arun

Reply via email to