Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread bc Wong
Hi Matei,

Good call on scaling the project itself. Identifying domain experts in
different areas is a good thing. But I have some questions about the
implementation. Here's my understanding of the proposal:

(1) The PMC votes on a list of components and their maintainers. Changes to
that list requires PMC approval.
(2) No committer shall commit changes to a component without a +1 from a
maintainer of that component.

I see good reasons for #1, to help people navigate the project and identify
expertise. For #2, I'd like to understand what problem it's trying to
solve. Do we have rogue committers committing to areas that they don't know
much about? If that's the case, we should address it directly, instead of
adding new processes.

To point out the obvious, it completely changes what committers means in
Spark. Do we have clear promotion criteria from committer to
maintainer? Is there a max number of maintainers per area Currently, as
committers gains expertise in new areas, they could start reviewing code in
those areas and give +1. This encourages more contributions and
cross-component knowledge sharing. Under the new proposal, they now have to
be promoted to maintainers first. That reduces our review bandwidth.

Again, if there is a quality issue with code reviews, let's talk to those
committers and help them do better. There are non-process ways to solve the
problem.

So I think we shouldn't require maintainer +1. I do like the idea of
having explicit maintainers on a volunteer basis. These maintainers should
watch their jira and PR traffic, and be very active in design  API
discussions. That leads to better consistency and long-term design choices.

Cheers,
bc

On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi all,

 I wanted to share a discussion we've been having on the PMC list, as well
 as call for an official vote on it on a public list. Basically, as the
 Spark project scales up, we need to define a model to make sure there is
 still great oversight of key components (in particular internal
 architecture and public APIs), and to this end I've proposed implementing a
 maintainer model for some of these components, similar to other large
 projects.

 As background on this, Spark has grown a lot since joining Apache. We've
 had over 80 contributors/month for the past 3 months, which I believe makes
 us the most active project in contributors/month at Apache, as well as over
 500 patches/month. The codebase has also grown significantly, with new
 libraries for SQL, ML, graphs and more.

 In this kind of large project, one common way to scale development is to
 assign maintainers to oversee key components, where each patch to that
 component needs to get sign-off from at least one of its maintainers. Most
 existing large projects do this -- at Apache, some large ones with this
 model are CloudStack (the second-most active project overall), Subversion,
 and Kafka, and other examples include Linux and Python. This is also
 by-and-large how Spark operates today -- most components have a de-facto
 maintainer.

 IMO, adopting this model would have two benefits:

 1) Consistent oversight of design for that component, especially regarding
 architecture and API. This process would ensure that the component's
 maintainers see all proposed changes and consider them to fit together in a
 good way.

 2) More structure for new contributors and committers -- in particular, it
 would be easy to look up who’s responsible for each module and ask them for
 reviews, etc, rather than having patches slip between the cracks.

 We'd like to start with in a light-weight manner, where the model only
 applies to certain key components (e.g. scheduler, shuffle) and user-facing
 APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
 it if we deem it useful. The specific mechanics would be as follows:

 - Some components in Spark will have maintainers assigned to them, where
 one of the maintainers needs to sign off on each patch to the component.
 - Each component with maintainers will have at least 2 maintainers.
 - Maintainers will be assigned from the most active and knowledgeable
 committers on that component by the PMC. The PMC can vote to add / remove
 maintainers, and maintained components, through consensus.
 - Maintainers are expected to be active in responding to patches for their
 components, though they do not need to be the main reviewers for them (e.g.
 they might just sign off on architecture / API). To prevent inactive
 maintainers from blocking the project, if a maintainer isn't responding in
 a reasonable time period (say 2 weeks), other committers can merge the
 patch, and the PMC will want to discuss adding another maintainer.

 If you'd like to see examples for this model, check out the following
 projects:
 - CloudStack:
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
 
 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread bc Wong
On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
​snip

 Ultimately, the core motivation is that the project has grown to the point
 where it's hard to expect every committer to have full understanding of
 every component. Some committers know a ton about systems but little about
 machine learning, some are algorithmic whizzes but may not realize the
 implications of changing something on the Python API, etc. This is just a
 way to make sure that a domain expert has looked at the areas where it is
 most likely for something to go wrong.


​Hi Matei,

I understand where you're coming from. My suggestion is to solve this
without adding a new process. In the example above, those algo whizzes
committers should realize that they're touching the Python API, and loop in
some Python maintainers​. Those Python maintainers would then respond and
help move the PR along. This is good hygiene and should already be
happening. For example, HDFS committers have commit rights to all of
Hadoop. But none of them would check in YARN code without getting agreement
from the YARN folks.

I think the majority of the effort here will be education and building the
convention. We have to ask committers to watch out for API changes, know
their own limits, and involve the component domain experts. We need that
anyways, which btw also seems to solve the problem. It's not clear what the
new process would add.

It'd be good to know the details, too. What are the exact criteria for a
committer to get promoted to be a maintainer? How often does the PMC
re-evaluate the list of maintainers? Is there an upper bound on the number
of maintainers for a component? Can we have an automatic rule for a
maintainer promotion after X patches or Y lines of code in that area?

Cheers,
bc

On Nov 6, 2014, at 10:53 AM, bc Wong bcwal...@cloudera.com wrote:

 Hi Matei,

 Good call on scaling the project itself. Identifying domain experts in
 different areas is a good thing. But I have some questions about the
 implementation. Here's my understanding of the proposal:

 (1) The PMC votes on a list of components and their maintainers. Changes
 to that list requires PMC approval.
 (2) No committer shall commit changes to a component without a +1 from a
 maintainer of that component.

 I see good reasons for #1, to help people navigate the project and
 identify expertise. For #2, I'd like to understand what problem it's trying
 to solve. Do we have rogue committers committing to areas that they don't
 know much about? If that's the case, we should address it directly, instead
 of adding new processes.

 To point out the obvious, it completely changes what committers means in
 Spark. Do we have clear promotion criteria from committer to
 maintainer? Is there a max number of maintainers per area Currently, as
 committers gains expertise in new areas, they could start reviewing code in
 those areas and give +1. This encourages more contributions and
 cross-component knowledge sharing. Under the new proposal, they now have to
 be promoted to maintainers first. That reduces our review bandwidth.

 Again, if there is a quality issue with code reviews, let's talk to those
 committers and help them do better. There are non-process ways to solve the
 problem.

 So I think we shouldn't require maintainer +1. I do like the idea of
 having explicit maintainers on a volunteer basis. These maintainers should
 watch their jira and PR traffic, and be very active in design  API
 discussions. That leads to better consistency and long-term design choices.

 Cheers,
 bc

 On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Hi all,

 I wanted to share a discussion we've been having on the PMC list, as well
 as call for an official vote on it on a public list. Basically, as the
 Spark project scales up, we need to define a model to make sure there is
 still great oversight of key components (in particular internal
 architecture and public APIs), and to this end I've proposed implementing a
 maintainer model for some of these components, similar to other large
 projects.

 As background on this, Spark has grown a lot since joining Apache. We've
 had over 80 contributors/month for the past 3 months, which I believe makes
 us the most active project in contributors/month at Apache, as well as over
 500 patches/month. The codebase has also grown significantly, with new
 libraries for SQL, ML, graphs and more.

 In this kind of large project, one common way to scale development is to
 assign maintainers to oversee key components, where each patch to that
 component needs to get sign-off from at least one of its maintainers. Most
 existing large projects do this -- at Apache, some large ones with this
 model are CloudStack (the second-most active project overall), Subversion,
 and Kafka, and other examples include Linux and Python. This is also
 by-and-large how Spark operates today -- most components have a de-facto
 maintainer