On 05/05/11 10:51, Tony Valderrama wrote:
Hi, I just wanted to drop in a few thoughts from a new developer
working outside of the Hadoop developer community.
On Wed, May 4, 2011 at 7:39 PM, Eric Yang<[email protected]> wrote:
While the world demand agility, the "review then commit" process is preventing
progress
from happening. People end up having to generate multiple version of patches
to ensure
the code can be applied. The large lag time between patch generation and
reviewed
is taking significant toll on the community and progress.
Yahoo have a great team of developers who improves Hadoop at faster pace with
its own
fork of the source code. The reason that Yahoo was able to achieve faster
improvement with
features was due to the ability to use source code repository tools properly.
Unfortunate
for Yahoo, their source code repository was not Apache svn trunk.
I agree that the review process is broken. However, the current
situation is exactly the result of a lack of adherence to this and
other processes. Various subgroups within the community have
(intentionally or unintentionally) hijacked the project at different
times by avoiding community processes in the interest of agility or
commercial benefit, and the result is a highly fragmented project with
no clear direction.
From the outside, Hadoop looks like a Yahoo/Cloudera project which
occasionally gets an Apache stamp. Given the lack of adherence to
processes, as a non-Yahoo/Cloudera developer I have no way of breaking
into the development community. Who's going to review or commit
patches I submit? And which of the myriad versions should I even be
trying to patch against? And given the speed with which undocumented
changes are being made, how am I supposed to figure out if my changes
are going to be relevant or viable next week? We'd love to contribute
back, but it's just not clear that we or other small players have any
place within the Hadoop developer community.
As someone who has commit rights but undercommits, here are my issues
-I am not full time on hadoop, I have little time to keep my own code
up to date, let alone review patches
-I am not fully up to date with all the changes or subtleties in what
is a big, complicated system
-I don't want to break the big systems (Y!, Facebook) by introducing
changes that work on my network and my (small, dynamic) clusters but
which place limitations on scale. It's why I prefer review by those
people who do work on large scale projects.
Use JIRA, if there is large feature set that requires brain storming, and
developers
should have the ability to make small incremental changes without RTC. This
will ensure developers
help each other rather than policing each other.
As an outsider, JIRA is the only way I've been able to follow the
changes to Hadoop's code and guess where the project is heading.
Permitting developers to commit without review or documentation will
just further exclude anyone who can't walk down the hall and knock on
an office door to ask about a commit.
I've worked in other ASF projects (Axis) where some large dev teams
(IBM) used to make decisions in team meetings and propagate them. It's
faster, but less community centric, and when a large dev team (IBM) get
re-assigned internally everyone is left not just scrambling to catch up
engineering-wise, but also to make sense of big chunks of
under-documented code. At least the JIRA-based review process not only
provides a discussion log, Hudson/Jenkins checks that there are tests,
no extra warnings, etc.
What could be interesting would be
-a move to Git to make it easier to pull in patches from other
branches, and for people like Tony to have their own fork under SCM.
-adoption of Gerrit for having each JIRA issue move from being a patch
to a branch (local or remote), so that people can develop the code for
an issue, others can pull it in and merge it, and so that the issue
tracks live code, not dead patches
-more testing of trunk in bigger real/virtual clusters
I don't know how we can do this, I'd love to hear about experiences
others have with such a process.