Re: [VOTE] Designating maintainers for some Spark components

Hari Shreedharan Thu, 06 Nov 2014 15:30:56 -0800

How would this model work with a new committer who gets voted in? Does it mean 
that a new committer would be a maintainer for at least one area — else we 
could end up having committers who really can’t merge anything significant 
until he becomes a maintainer.



Thanks,
Hari

On Thu, Nov 6, 2014 at 3:00 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> I think you're misunderstanding the idea of "process" here. The point of 
> process is to make sure something happens automatically, which is useful to 
> ensure a certain level of quality. For example, all our patches go through 
> Jenkins, and nobody will make the mistake of merging them if they fail tests, 
> or RAT checks, or API compatibility checks. The idea is to get the same kind 
> of automation for design on these components. This is a very common process 
> for large software projects, and it's essentially what we had already, but 
> formalizing it will make clear that this is the process we want. It's 
> important to do it early in order to be able to refine the process as the 
> project grows.
> In terms of scope, again, the maintainers are *not* going to be the only 
> reviewers for that component, they are just a second level of sign-off 
> required for architecture and API. Being a maintainer is also not a 
> "promotion", it's a responsibility. Since we don't have much experience yet 
> with this model, I didn't propose automatic rules beyond that the PMC can add 
> / remove maintainers -- presumably the PMC is in the best position to know 
> what the project needs. I think automatic rules are exactly the kind of 
> "process" you're arguing against. The "process" here is about ensuring 
> certain checks are made for every code change, not about automating personnel 
> and development decisions.
> In any case, I appreciate your input on this, and we're going to evaluate the 
> model to see how it goes. It might be that we decide we don't want it at all. 
> However, from what I've seen of other projects (not Hadoop but projects with 
> an order of magnitude more contributors, like Python or Linux), this is one 
> of the best ways to have consistently great releases with a large contributor 
> base and little room for error. With all due respect to what Hadoop's 
> accomplished, I wouldn't use Hadoop as the best example to strive for; in my 
> experience there I've seen patches reverted because of architectural 
> disagreements, new APIs released and abandoned, and generally an experience 
> that's been painful for users. A lot of the decisions we've made in Spark 
> (e.g. time-based release cycle, built-in libraries, API stability rules, etc) 
> were based on lessons learned there, in an attempt to define a better model.
> Matei
>> On Nov 6, 2014, at 2:18 PM, bc Wong <bcwal...@cloudera.com> wrote:
>> 
>> On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia <matei.zaha...@gmail.com 
>> <mailto:matei.zaha...@gmail.com>> wrote:
>> <snip> 
>> Ultimately, the core motivation is that the project has grown to the point 
>> where it's hard to expect every committer to have full understanding of 
>> every component. Some committers know a ton about systems but little about 
>> machine learning, some are algorithmic whizzes but may not realize the 
>> implications of changing something on the Python API, etc. This is just a 
>> way to make sure that a domain expert has looked at the areas where it is 
>> most likely for something to go wrong.
>> 
>> Hi Matei,
>> 
>> I understand where you're coming from. My suggestion is to solve this 
>> without adding a new process. In the example above, those "algo whizzes" 
>> committers should realize that they're touching the Python API, and loop in 
>> some Python maintainers. Those Python maintainers would then respond and 
>> help move the PR along. This is good hygiene and should already be 
>> happening. For example, HDFS committers have commit rights to all of Hadoop. 
>> But none of them would check in YARN code without getting agreement from the 
>> YARN folks.
>> 
>> I think the majority of the effort here will be education and building the 
>> convention. We have to ask committers to watch out for API changes, know 
>> their own limits, and involve the component domain experts. We need that 
>> anyways, which btw also seems to solve the problem. It's not clear what the 
>> new process would add.
>> 
>> It'd be good to know the details, too. What are the exact criteria for a 
>> committer to get promoted to be a maintainer? How often does the PMC 
>> re-evaluate the list of maintainers? Is there an upper bound on the number 
>> of maintainers for a component? Can we have an automatic rule for a 
>> maintainer promotion after X patches or Y lines of code in that area?
>> 
>> Cheers,
>> bc
>> 
>>> On Nov 6, 2014, at 10:53 AM, bc Wong <bcwal...@cloudera.com 
>>> <mailto:bcwal...@cloudera.com>> wrote:
>>> 
>>> Hi Matei,
>>> 
>>> Good call on scaling the project itself. Identifying domain experts in 
>>> different areas is a good thing. But I have some questions about the 
>>> implementation. Here's my understanding of the proposal:
>>> 
>>> (1) The PMC votes on a list of components and their maintainers. Changes to 
>>> that list requires PMC approval.
>>> (2) No committer shall commit changes to a component without a +1 from a 
>>> maintainer of that component.
>>> 
>>> I see good reasons for #1, to help people navigate the project and identify 
>>> expertise. For #2, I'd like to understand what problem it's trying to 
>>> solve. Do we have rogue committers committing to areas that they don't know 
>>> much about? If that's the case, we should address it directly, instead of 
>>> adding new processes.
>>> 
>>> To point out the obvious, it completely changes what "committers" means in 
>>> Spark. Do we have clear promotion criteria from "committer" to 
>>> "maintainer"? Is there a max number of maintainers per area Currently, as 
>>> committers gains expertise in new areas, they could start reviewing code in 
>>> those areas and give +1. This encourages more contributions and 
>>> cross-component knowledge sharing. Under the new proposal, they now have to 
>>> be promoted to "maintainers" first. That reduces our review bandwidth.
>>> 
>>> Again, if there is a quality issue with code reviews, let's talk to those 
>>> committers and help them do better. There are non-process ways to solve the 
>>> problem.
>>> 
>>> So I think we shouldn't require "maintainer +1". I do like the idea of 
>>> having explicit maintainers on a volunteer basis. These maintainers should 
>>> watch their jira and PR traffic, and be very active in design & API 
>>> discussions. That leads to better consistency and long-term design choices.
>>> 
>>> Cheers,
>>> bc
>>> 
>>> On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia <matei.zaha...@gmail.com 
>>> <mailto:matei.zaha...@gmail.com>> wrote:
>>> Hi all,
>>> 
>>> I wanted to share a discussion we've been having on the PMC list, as well 
>>> as call for an official vote on it on a public list. Basically, as the 
>>> Spark project scales up, we need to define a model to make sure there is 
>>> still great oversight of key components (in particular internal 
>>> architecture and public APIs), and to this end I've proposed implementing a 
>>> maintainer model for some of these components, similar to other large 
>>> projects.
>>> 
>>> As background on this, Spark has grown a lot since joining Apache. We've 
>>> had over 80 contributors/month for the past 3 months, which I believe makes 
>>> us the most active project in contributors/month at Apache, as well as over 
>>> 500 patches/month. The codebase has also grown significantly, with new 
>>> libraries for SQL, ML, graphs and more.
>>> 
>>> In this kind of large project, one common way to scale development is to 
>>> assign "maintainers" to oversee key components, where each patch to that 
>>> component needs to get sign-off from at least one of its maintainers. Most 
>>> existing large projects do this -- at Apache, some large ones with this 
>>> model are CloudStack (the second-most active project overall), Subversion, 
>>> and Kafka, and other examples include Linux and Python. This is also 
>>> by-and-large how Spark operates today -- most components have a de-facto 
>>> maintainer.
>>> 
>>> IMO, adopting this model would have two benefits:
>>> 
>>> 1) Consistent oversight of design for that component, especially regarding 
>>> architecture and API. This process would ensure that the component's 
>>> maintainers see all proposed changes and consider them to fit together in a 
>>> good way.
>>> 
>>> 2) More structure for new contributors and committers -- in particular, it 
>>> would be easy to look up who’s responsible for each module and ask them for 
>>> reviews, etc, rather than having patches slip between the cracks.
>>> 
>>> We'd like to start with in a light-weight manner, where the model only 
>>> applies to certain key components (e.g. scheduler, shuffle) and user-facing 
>>> APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand 
>>> it if we deem it useful. The specific mechanics would be as follows:
>>> 
>>> - Some components in Spark will have maintainers assigned to them, where 
>>> one of the maintainers needs to sign off on each patch to the component.
>>> - Each component with maintainers will have at least 2 maintainers.
>>> - Maintainers will be assigned from the most active and knowledgeable 
>>> committers on that component by the PMC. The PMC can vote to add / remove 
>>> maintainers, and maintained components, through consensus.
>>> - Maintainers are expected to be active in responding to patches for their 
>>> components, though they do not need to be the main reviewers for them (e.g. 
>>> they might just sign off on architecture / API). To prevent inactive 
>>> maintainers from blocking the project, if a maintainer isn't responding in 
>>> a reasonable time period (say 2 weeks), other committers can merge the 
>>> patch, and the PMC will want to discuss adding another maintainer.
>>> 
>>> If you'd like to see examples for this model, check out the following 
>>> projects:
>>> - CloudStack: 
>>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
>>>  
>>> <https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide><https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
>>>  
>>> <https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide>>
>>> - Subversion: https://subversion.apache.org/docs/community-guide/roles.html 
>>> <https://subversion.apache.org/docs/community-guide/roles.html><https://subversion.apache.org/docs/community-guide/roles.html
>>>  <https://subversion.apache.org/docs/community-guide/roles.html>>
>>> 
>>> Finally, I wanted to list our current proposal for initial components and 
>>> maintainers. It would be good to get feedback on other components we might 
>>> add, but please note that personnel discussions (e.g. "I don't think Matei 
>>> should maintain *that* component) should only happen on the private list. 
>>> The initial components were chosen to include all public APIs and the main 
>>> core components, and the maintainers were chosen from the most active 
>>> contributors to those modules.
>>> 
>>> - Spark core public API: Matei, Patrick, Reynold
>>> - Job scheduler: Matei, Kay, Patrick
>>> - Shuffle and network: Reynold, Aaron, Matei
>>> - Block manager: Reynold, Aaron
>>> - YARN: Tom, Andrew Or
>>> - Python: Josh, Matei
>>> - MLlib: Xiangrui, Matei
>>> - SQL: Michael, Reynold
>>> - Streaming: TD, Matei
>>> - GraphX: Ankur, Joey, Reynold
>>> 
>>> I'd like to formally call a [VOTE] on this model, to last 72 hours. The 
>>> [VOTE] will end on Nov 8, 2014 at 6 PM PST.
>>> 
>>> Matei
>> 
>>

Re: [VOTE] Designating maintainers for some Spark components

Reply via email to