I'm going to split this out and raise it as a separate issue. On 29 August 2012 19:35, Jun Ping Du <j...@vmware.com> wrote:
> Hi Chris and all, > Thanks for initiating the discussion. Can I say something in a > prospective of contributor but not a committer or PMC member? > First, I have a feeling that current hadoop project process is good for > contributors to deliver a bug fix but not so easy to deliver a big feature. > I have great experience in bug fixing work that can get quickly response > from committers and checked in. However, I feel a little frustrated in > delivering a feature (~5K LOC, very important for hadoop running well on > virtualization infrastructure) across common, hdfs, map reduce and yarn. > Firstly, you have to figure out different committers you should turn for > help on each component, then convince them your ideas and work with them in > reviewing and committing the code. Each committers should understand the > completed story and learn the code pending on review as well as that > already checked in. If some committers are super busy, then the feature > looks like pending forever. Thus, due to my current experience, I may have > to say this process is not so friendly to contributors who come from > different organizations with different backgrounds but have the same wish > to contribute more to Apache hadoop. > One of the problems here is that a 5KLOC patch is a major change -and regardless of whether you are a committer or not, you're going to hit a lot of inertia. My fairly large service lifecycle patch(https://issues.apache.org/jira/browse/HDFS-326 ) never survived, and I put a lot of effort in there as a committer. That was with something that I was visibly doing in a branch of apache SVN, merging and regression testing every week, syncing things, testing on my own infrastructure, etc. Turning up with a large diff without any previous involvement in the project or collaborative development is going to hit a wall in pretty much every OSS project, the big issues not just being "why" and "what does it break", but "how is a patch this big going to be maintained?" and "how is it going to be tested on anything other than the specific platform it's been worked on". Any test plan that requires custom hardware, infrastructure &c is tricky. It's hard enough making the jump from the normal test suite to testing with real workloads on production-scale clusters, if you start needing specific CPU designs, GPUs, non-standard OS/JVM, etc, then it becomes impossible to regression test these for a release. To make things worse, Hadoop is a critical piece of so many companies infrastructure; Yahoo!, Facebook, Twitter, LinkedIn, &c. The value of the code is not the cost of implementation, it is the value of all the data stored in HDFS, This is why the barrier to entry of code is much, much lower in contrib/ than it is into the core -and the normal way to isolate work is to design another extension point into which these things can go, where people can be confident that changes won't break things, and where someone else takes on the costs of maintenance and testing their custom extensions. > Based on this, for spinning out hadoop sub-project to TLPs, I would > glad to see we will have concisely committer list for each projects then > committers can be more focus (more bandwidth may be?) and contributors can > know who they should turn to get quick response and help there. On the > other hand, I would concern it may take more complexity to dependencies for > features that across sub-project today as you should figure out branches > for each TLP but it is hard to estimate when code can come alive in each > branch of TLP (may take the similar complexity to committers as well). > I don't have many good suggestions but would be glad to see the process > can be more smoothly for contributor's work no matter what decision we are > making today. Just 2 cents. I do agree we need a better way of having larger activities that span more of the system being developed and then successfully committed. Some of the what-not-to-do & what-to-do has been hinted at in the bottom of Defining Hadoop ( http://wiki.apache.org/hadoop/Defining%20Hadoop ), but there's no formalisation of how to do more major works within the Hadoop codebase. Of the big changes that have worked, they are 1. HDFS 2's HA and ongoing improvements: collaborative dev on the list with incremental changes going on in trunk, RTC with lots of tests. This isn't finished, and the test problem there is that functional testing of all failure modes requires software-controlled fencing devices and switches -and tests to generated the expected failure space. 2. YARN: Arun on his own branch, CTR, merge once mostly stable, and completely replacing MRv1. How then do we get (a) more dev projects working and integrated by the current committers, and (b) a process in which people who are not yet contributors/committers can develop non-trivial changes to the project in a way that it is done with the knowledge, support and mentorship of the rest of the community? This topic has arisen before -and never reached a good answer. How can we incubate new pieces of work in the project and mentor external contributions? -steve