How would this model work with a new committer who gets voted in? Does it mean that a new committer would be a maintainer for at least one area — else we could end up having committers who really can’t merge anything significant until he becomes a maintainer.
Thanks, Hari On Thu, Nov 6, 2014 at 3:00 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > I think you're misunderstanding the idea of "process" here. The point of > process is to make sure something happens automatically, which is useful to > ensure a certain level of quality. For example, all our patches go through > Jenkins, and nobody will make the mistake of merging them if they fail tests, > or RAT checks, or API compatibility checks. The idea is to get the same kind > of automation for design on these components. This is a very common process > for large software projects, and it's essentially what we had already, but > formalizing it will make clear that this is the process we want. It's > important to do it early in order to be able to refine the process as the > project grows. > In terms of scope, again, the maintainers are *not* going to be the only > reviewers for that component, they are just a second level of sign-off > required for architecture and API. Being a maintainer is also not a > "promotion", it's a responsibility. Since we don't have much experience yet > with this model, I didn't propose automatic rules beyond that the PMC can add > / remove maintainers -- presumably the PMC is in the best position to know > what the project needs. I think automatic rules are exactly the kind of > "process" you're arguing against. The "process" here is about ensuring > certain checks are made for every code change, not about automating personnel > and development decisions. > In any case, I appreciate your input on this, and we're going to evaluate the > model to see how it goes. It might be that we decide we don't want it at all. > However, from what I've seen of other projects (not Hadoop but projects with > an order of magnitude more contributors, like Python or Linux), this is one > of the best ways to have consistently great releases with a large contributor > base and little room for error. With all due respect to what Hadoop's > accomplished, I wouldn't use Hadoop as the best example to strive for; in my > experience there I've seen patches reverted because of architectural > disagreements, new APIs released and abandoned, and generally an experience > that's been painful for users. A lot of the decisions we've made in Spark > (e.g. time-based release cycle, built-in libraries, API stability rules, etc) > were based on lessons learned there, in an attempt to define a better model. > Matei >> On Nov 6, 2014, at 2:18 PM, bc Wong <bcwal...@cloudera.com> wrote: >> >> On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia <matei.zaha...@gmail.com >> <mailto:matei.zaha...@gmail.com>> wrote: >> <snip> >> Ultimately, the core motivation is that the project has grown to the point >> where it's hard to expect every committer to have full understanding of >> every component. Some committers know a ton about systems but little about >> machine learning, some are algorithmic whizzes but may not realize the >> implications of changing something on the Python API, etc. This is just a >> way to make sure that a domain expert has looked at the areas where it is >> most likely for something to go wrong. >> >> Hi Matei, >> >> I understand where you're coming from. My suggestion is to solve this >> without adding a new process. In the example above, those "algo whizzes" >> committers should realize that they're touching the Python API, and loop in >> some Python maintainers. Those Python maintainers would then respond and >> help move the PR along. This is good hygiene and should already be >> happening. For example, HDFS committers have commit rights to all of Hadoop. >> But none of them would check in YARN code without getting agreement from the >> YARN folks. >> >> I think the majority of the effort here will be education and building the >> convention. We have to ask committers to watch out for API changes, know >> their own limits, and involve the component domain experts. We need that >> anyways, which btw also seems to solve the problem. It's not clear what the >> new process would add. >> >> It'd be good to know the details, too. What are the exact criteria for a >> committer to get promoted to be a maintainer? How often does the PMC >> re-evaluate the list of maintainers? Is there an upper bound on the number >> of maintainers for a component? Can we have an automatic rule for a >> maintainer promotion after X patches or Y lines of code in that area? >> >> Cheers, >> bc >> >>> On Nov 6, 2014, at 10:53 AM, bc Wong <bcwal...@cloudera.com >>> <mailto:bcwal...@cloudera.com>> wrote: >>> >>> Hi Matei, >>> >>> Good call on scaling the project itself. Identifying domain experts in >>> different areas is a good thing. But I have some questions about the >>> implementation. Here's my understanding of the proposal: >>> >>> (1) The PMC votes on a list of components and their maintainers. Changes to >>> that list requires PMC approval. >>> (2) No committer shall commit changes to a component without a +1 from a >>> maintainer of that component. >>> >>> I see good reasons for #1, to help people navigate the project and identify >>> expertise. For #2, I'd like to understand what problem it's trying to >>> solve. Do we have rogue committers committing to areas that they don't know >>> much about? If that's the case, we should address it directly, instead of >>> adding new processes. >>> >>> To point out the obvious, it completely changes what "committers" means in >>> Spark. Do we have clear promotion criteria from "committer" to >>> "maintainer"? Is there a max number of maintainers per area Currently, as >>> committers gains expertise in new areas, they could start reviewing code in >>> those areas and give +1. This encourages more contributions and >>> cross-component knowledge sharing. Under the new proposal, they now have to >>> be promoted to "maintainers" first. That reduces our review bandwidth. >>> >>> Again, if there is a quality issue with code reviews, let's talk to those >>> committers and help them do better. There are non-process ways to solve the >>> problem. >>> >>> So I think we shouldn't require "maintainer +1". I do like the idea of >>> having explicit maintainers on a volunteer basis. These maintainers should >>> watch their jira and PR traffic, and be very active in design & API >>> discussions. That leads to better consistency and long-term design choices. >>> >>> Cheers, >>> bc >>> >>> On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia <matei.zaha...@gmail.com >>> <mailto:matei.zaha...@gmail.com>> wrote: >>> Hi all, >>> >>> I wanted to share a discussion we've been having on the PMC list, as well >>> as call for an official vote on it on a public list. Basically, as the >>> Spark project scales up, we need to define a model to make sure there is >>> still great oversight of key components (in particular internal >>> architecture and public APIs), and to this end I've proposed implementing a >>> maintainer model for some of these components, similar to other large >>> projects. >>> >>> As background on this, Spark has grown a lot since joining Apache. We've >>> had over 80 contributors/month for the past 3 months, which I believe makes >>> us the most active project in contributors/month at Apache, as well as over >>> 500 patches/month. The codebase has also grown significantly, with new >>> libraries for SQL, ML, graphs and more. >>> >>> In this kind of large project, one common way to scale development is to >>> assign "maintainers" to oversee key components, where each patch to that >>> component needs to get sign-off from at least one of its maintainers. Most >>> existing large projects do this -- at Apache, some large ones with this >>> model are CloudStack (the second-most active project overall), Subversion, >>> and Kafka, and other examples include Linux and Python. This is also >>> by-and-large how Spark operates today -- most components have a de-facto >>> maintainer. >>> >>> IMO, adopting this model would have two benefits: >>> >>> 1) Consistent oversight of design for that component, especially regarding >>> architecture and API. This process would ensure that the component's >>> maintainers see all proposed changes and consider them to fit together in a >>> good way. >>> >>> 2) More structure for new contributors and committers -- in particular, it >>> would be easy to look up who’s responsible for each module and ask them for >>> reviews, etc, rather than having patches slip between the cracks. >>> >>> We'd like to start with in a light-weight manner, where the model only >>> applies to certain key components (e.g. scheduler, shuffle) and user-facing >>> APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand >>> it if we deem it useful. The specific mechanics would be as follows: >>> >>> - Some components in Spark will have maintainers assigned to them, where >>> one of the maintainers needs to sign off on each patch to the component. >>> - Each component with maintainers will have at least 2 maintainers. >>> - Maintainers will be assigned from the most active and knowledgeable >>> committers on that component by the PMC. The PMC can vote to add / remove >>> maintainers, and maintained components, through consensus. >>> - Maintainers are expected to be active in responding to patches for their >>> components, though they do not need to be the main reviewers for them (e.g. >>> they might just sign off on architecture / API). To prevent inactive >>> maintainers from blocking the project, if a maintainer isn't responding in >>> a reasonable time period (say 2 weeks), other committers can merge the >>> patch, and the PMC will want to discuss adding another maintainer. >>> >>> If you'd like to see examples for this model, check out the following >>> projects: >>> - CloudStack: >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide >>> >>> <https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide><https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide >>> >>> <https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide>> >>> - Subversion: https://subversion.apache.org/docs/community-guide/roles.html >>> <https://subversion.apache.org/docs/community-guide/roles.html><https://subversion.apache.org/docs/community-guide/roles.html >>> <https://subversion.apache.org/docs/community-guide/roles.html>> >>> >>> Finally, I wanted to list our current proposal for initial components and >>> maintainers. It would be good to get feedback on other components we might >>> add, but please note that personnel discussions (e.g. "I don't think Matei >>> should maintain *that* component) should only happen on the private list. >>> The initial components were chosen to include all public APIs and the main >>> core components, and the maintainers were chosen from the most active >>> contributors to those modules. >>> >>> - Spark core public API: Matei, Patrick, Reynold >>> - Job scheduler: Matei, Kay, Patrick >>> - Shuffle and network: Reynold, Aaron, Matei >>> - Block manager: Reynold, Aaron >>> - YARN: Tom, Andrew Or >>> - Python: Josh, Matei >>> - MLlib: Xiangrui, Matei >>> - SQL: Michael, Reynold >>> - Streaming: TD, Matei >>> - GraphX: Ankur, Joey, Reynold >>> >>> I'd like to formally call a [VOTE] on this model, to last 72 hours. The >>> [VOTE] will end on Nov 8, 2014 at 6 PM PST. >>> >>> Matei >> >>