Re: [VOTE] Designating maintainers for some Spark components

Hari Shreedharan Thu, 06 Nov 2014 16:19:49 -0800

In Cloudstack, I believe one becomes a maintainer first for a subset of 
modules, before he/she becomes a proven maintainter who has commit rights on 
the entire source tree.





So would it make sense to go that route, and have committers voted in as 
maintainers for certain parts of the codebase and then eventually become proven 
maintainers (though this might have be honor code based, since I don’t think 
git allows per module commit rights).


Thanks,
Hari

On Thu, Nov 6, 2014 at 3:45 PM, Patrick Wendell <[email protected]>
wrote:

> I think new committers might or might not be maintainers (it would
> depend on the PMC vote). I don't think it would affect what you could
> merge, you can merge in any part of the source tree, you just need to
> get sign off if you want to touch a public API or make major
> architectural changes. Most projects already require code review from
> other committers before you commit something, so it's just a version
> of that where you have specific people appointed to specific
> components for review.
> If you look, most large software projects have a maintainer model,
> both in Apache and outside of it. Cloudstack is probably the best
> example in Apache since they are the second most active project
> (roughly) after Spark. They have two levels of maintainers and much
> strong language - their language: "In general, maintainers only have
> commit rights on the module for which they are responsible.".
> I'd like us to start with something simpler and lightweight as
> proposed here. Really the proposal on the table is just to codify the
> current de-facto process to make sure we stick by it as we scale. If
> we want to add more formality to it or strictness, we can do it later.
> - Patrick
> On Thu, Nov 6, 2014 at 3:29 PM, Hari Shreedharan
> <[email protected]> wrote:
>> How would this model work with a new committer who gets voted in? Does it 
>> mean that a new committer would be a maintainer for at least one area -- 
>> else we could end up having committers who really can't merge anything 
>> significant until he becomes a maintainer.
>>
>>
>> Thanks,
>> Hari
>>
>> On Thu, Nov 6, 2014 at 3:00 PM, Matei Zaharia <[email protected]>
>> wrote:
>>
>>> I think you're misunderstanding the idea of "process" here. The point of 
>>> process is to make sure something happens automatically, which is useful to 
>>> ensure a certain level of quality. For example, all our patches go through 
>>> Jenkins, and nobody will make the mistake of merging them if they fail 
>>> tests, or RAT checks, or API compatibility checks. The idea is to get the 
>>> same kind of automation for design on these components. This is a very 
>>> common process for large software projects, and it's essentially what we 
>>> had already, but formalizing it will make clear that this is the process we 
>>> want. It's important to do it early in order to be able to refine the 
>>> process as the project grows.
>>> In terms of scope, again, the maintainers are *not* going to be the only 
>>> reviewers for that component, they are just a second level of sign-off 
>>> required for architecture and API. Being a maintainer is also not a 
>>> "promotion", it's a responsibility. Since we don't have much experience yet 
>>> with this model, I didn't propose automatic rules beyond that the PMC can 
>>> add / remove maintainers -- presumably the PMC is in the best position to 
>>> know what the project needs. I think automatic rules are exactly the kind 
>>> of "process" you're arguing against. The "process" here is about ensuring 
>>> certain checks are made for every code change, not about automating 
>>> personnel and development decisions.
>>> In any case, I appreciate your input on this, and we're going to evaluate 
>>> the model to see how it goes. It might be that we decide we don't want it 
>>> at all. However, from what I've seen of other projects (not Hadoop but 
>>> projects with an order of magnitude more contributors, like Python or 
>>> Linux), this is one of the best ways to have consistently great releases 
>>> with a large contributor base and little room for error. With all due 
>>> respect to what Hadoop's accomplished, I wouldn't use Hadoop as the best 
>>> example to strive for; in my experience there I've seen patches reverted 
>>> because of architectural disagreements, new APIs released and abandoned, 
>>> and generally an experience that's been painful for users. A lot of the 
>>> decisions we've made in Spark (e.g. time-based release cycle, built-in 
>>> libraries, API stability rules, etc) were based on lessons learned there, 
>>> in an attempt to define a better model.
>>> Matei
>>>> On Nov 6, 2014, at 2:18 PM, bc Wong <[email protected]> wrote:
>>>>
>>>> On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> <snip>
>>>> Ultimately, the core motivation is that the project has grown to the point 
>>>> where it's hard to expect every committer to have full understanding of 
>>>> every component. Some committers know a ton about systems but little about 
>>>> machine learning, some are algorithmic whizzes but may not realize the 
>>>> implications of changing something on the Python API, etc. This is just a 
>>>> way to make sure that a domain expert has looked at the areas where it is 
>>>> most likely for something to go wrong.
>>>>
>>>> Hi Matei,
>>>>
>>>> I understand where you're coming from. My suggestion is to solve this 
>>>> without adding a new process. In the example above, those "algo whizzes" 
>>>> committers should realize that they're touching the Python API, and loop 
>>>> in some Python maintainers. Those Python maintainers would then respond 
>>>> and help move the PR along. This is good hygiene and should already be 
>>>> happening. For example, HDFS committers have commit rights to all of 
>>>> Hadoop. But none of them would check in YARN code without getting 
>>>> agreement from the YARN folks.
>>>>
>>>> I think the majority of the effort here will be education and building the 
>>>> convention. We have to ask committers to watch out for API changes, know 
>>>> their own limits, and involve the component domain experts. We need that 
>>>> anyways, which btw also seems to solve the problem. It's not clear what 
>>>> the new process would add.
>>>>
>>>> It'd be good to know the details, too. What are the exact criteria for a 
>>>> committer to get promoted to be a maintainer? How often does the PMC 
>>>> re-evaluate the list of maintainers? Is there an upper bound on the number 
>>>> of maintainers for a component? Can we have an automatic rule for a 
>>>> maintainer promotion after X patches or Y lines of code in that area?
>>>>
>>>> Cheers,
>>>> bc
>>>>
>>>>> On Nov 6, 2014, at 10:53 AM, bc Wong <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>> Hi Matei,
>>>>>
>>>>> Good call on scaling the project itself. Identifying domain experts in 
>>>>> different areas is a good thing. But I have some questions about the 
>>>>> implementation. Here's my understanding of the proposal:
>>>>>
>>>>> (1) The PMC votes on a list of components and their maintainers. Changes 
>>>>> to that list requires PMC approval.
>>>>> (2) No committer shall commit changes to a component without a +1 from a 
>>>>> maintainer of that component.
>>>>>
>>>>> I see good reasons for #1, to help people navigate the project and 
>>>>> identify expertise. For #2, I'd like to understand what problem it's 
>>>>> trying to solve. Do we have rogue committers committing to areas that 
>>>>> they don't know much about? If that's the case, we should address it 
>>>>> directly, instead of adding new processes.
>>>>>
>>>>> To point out the obvious, it completely changes what "committers" means 
>>>>> in Spark. Do we have clear promotion criteria from "committer" to 
>>>>> "maintainer"? Is there a max number of maintainers per area Currently, as 
>>>>> committers gains expertise in new areas, they could start reviewing code 
>>>>> in those areas and give +1. This encourages more contributions and 
>>>>> cross-component knowledge sharing. Under the new proposal, they now have 
>>>>> to be promoted to "maintainers" first. That reduces our review bandwidth.
>>>>>
>>>>> Again, if there is a quality issue with code reviews, let's talk to those 
>>>>> committers and help them do better. There are non-process ways to solve 
>>>>> the problem.
>>>>>
>>>>> So I think we shouldn't require "maintainer +1". I do like the idea of 
>>>>> having explicit maintainers on a volunteer basis. These maintainers 
>>>>> should watch their jira and PR traffic, and be very active in design & 
>>>>> API discussions. That leads to better consistency and long-term design 
>>>>> choices.
>>>>>
>>>>> Cheers,
>>>>> bc
>>>>>
>>>>> On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Hi all,
>>>>>
>>>>> I wanted to share a discussion we've been having on the PMC list, as well 
>>>>> as call for an official vote on it on a public list. Basically, as the 
>>>>> Spark project scales up, we need to define a model to make sure there is 
>>>>> still great oversight of key components (in particular internal 
>>>>> architecture and public APIs), and to this end I've proposed implementing 
>>>>> a maintainer model for some of these components, similar to other large 
>>>>> projects.
>>>>>
>>>>> As background on this, Spark has grown a lot since joining Apache. We've 
>>>>> had over 80 contributors/month for the past 3 months, which I believe 
>>>>> makes us the most active project in contributors/month at Apache, as well 
>>>>> as over 500 patches/month. The codebase has also grown significantly, 
>>>>> with new libraries for SQL, ML, graphs and more.
>>>>>
>>>>> In this kind of large project, one common way to scale development is to 
>>>>> assign "maintainers" to oversee key components, where each patch to that 
>>>>> component needs to get sign-off from at least one of its maintainers. 
>>>>> Most existing large projects do this -- at Apache, some large ones with 
>>>>> this model are CloudStack (the second-most active project overall), 
>>>>> Subversion, and Kafka, and other examples include Linux and Python. This 
>>>>> is also by-and-large how Spark operates today -- most components have a 
>>>>> de-facto maintainer.
>>>>>
>>>>> IMO, adopting this model would have two benefits:
>>>>>
>>>>> 1) Consistent oversight of design for that component, especially 
>>>>> regarding architecture and API. This process would ensure that the 
>>>>> component's maintainers see all proposed changes and consider them to fit 
>>>>> together in a good way.
>>>>>
>>>>> 2) More structure for new contributors and committers -- in particular, 
>>>>> it would be easy to look up who's responsible for each module and ask 
>>>>> them for reviews, etc, rather than having patches slip between the cracks.
>>>>>
>>>>> We'd like to start with in a light-weight manner, where the model only 
>>>>> applies to certain key components (e.g. scheduler, shuffle) and 
>>>>> user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, 
>>>>> we can expand it if we deem it useful. The specific mechanics would be as 
>>>>> follows:
>>>>>
>>>>> - Some components in Spark will have maintainers assigned to them, where 
>>>>> one of the maintainers needs to sign off on each patch to the component.
>>>>> - Each component with maintainers will have at least 2 maintainers.
>>>>> - Maintainers will be assigned from the most active and knowledgeable 
>>>>> committers on that component by the PMC. The PMC can vote to add / remove 
>>>>> maintainers, and maintained components, through consensus.
>>>>> - Maintainers are expected to be active in responding to patches for 
>>>>> their components, though they do not need to be the main reviewers for 
>>>>> them (e.g. they might just sign off on architecture / API). To prevent 
>>>>> inactive maintainers from blocking the project, if a maintainer isn't 
>>>>> responding in a reasonable time period (say 2 weeks), other committers 
>>>>> can merge the patch, and the PMC will want to discuss adding another 
>>>>> maintainer.
>>>>>
>>>>> If you'd like to see examples for this model, check out the following 
>>>>> projects:
>>>>> - CloudStack: 
>>>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
>>>>>  
>>>>> <https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide><https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
>>>>>  
>>>>> <https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide>>
>>>>> - Subversion: 
>>>>> https://subversion.apache.org/docs/community-guide/roles.html 
>>>>> <https://subversion.apache.org/docs/community-guide/roles.html><https://subversion.apache.org/docs/community-guide/roles.html
>>>>>  <https://subversion.apache.org/docs/community-guide/roles.html>>
>>>>>
>>>>> Finally, I wanted to list our current proposal for initial components and 
>>>>> maintainers. It would be good to get feedback on other components we 
>>>>> might add, but please note that personnel discussions (e.g. "I don't 
>>>>> think Matei should maintain *that* component) should only happen on the 
>>>>> private list. The initial components were chosen to include all public 
>>>>> APIs and the main core components, and the maintainers were chosen from 
>>>>> the most active contributors to those modules.
>>>>>
>>>>> - Spark core public API: Matei, Patrick, Reynold
>>>>> - Job scheduler: Matei, Kay, Patrick
>>>>> - Shuffle and network: Reynold, Aaron, Matei
>>>>> - Block manager: Reynold, Aaron
>>>>> - YARN: Tom, Andrew Or
>>>>> - Python: Josh, Matei
>>>>> - MLlib: Xiangrui, Matei
>>>>> - SQL: Michael, Reynold
>>>>> - Streaming: TD, Matei
>>>>> - GraphX: Ankur, Joey, Reynold
>>>>>
>>>>> I'd like to formally call a [VOTE] on this model, to last 72 hours. The 
>>>>> [VOTE] will end on Nov 8, 2014 at 6 PM PST.
>>>>>
>>>>> Matei
>>>>
>>>>

Re: [VOTE] Designating maintainers for some Spark components

Reply via email to