Bringing this to dev@, mid-thread, per Grant's suggestion. There was a brief and fruitful thread on private@ to discuss project governance, but the topic has shifted such that it's useful to just talk on dev@.
If I may paraphrase: I expressed concern about the sprawl of code and algorithms, aging of JIRA issues, and said I thought we probably had too few "zookeepers" (to borrow Benson's metaphor) for this big a zoo, and too many "breeders" adding new bags of code here and there. People seem to have interest in new non-Hadoop platforms in particular, but are not giving much attention to what's there now. I expressed concern this would lead to an increasingly difficult mess of code and project identity, even before 1.0. I proposed narrowing the scope of the project -- while not rejecting all new code, strongly weighting towards contributions that enhance and fix existing code rather than new green-field code through the 0.6 and 1.0 release. I also proposed a concerted effort to clean up JIRA in the short term. Most discussion centered around the final proposal: in exchange for putting off a lot of new, different stuff, I proposed thinking of "Mahout2" as a place for those ideas -- perhaps even as not Hadoop-based. It could be a mostly green-field rewrite. Now, keep in mind this would be quite a ways away -- just a point of discussion now. But it might be a good conceptual rationalization for nearly freezing scope now and polishing -- because we'd have a future bucket to put these in and time to talk it over. So: comments are certainly welcome on the above! And now I'd like to rejoin the thread by replying to Grant below: My broader concern is that an unhealthy project is the bigger danger to the community. And, I see some patterns that feel harmful, to me. I do see large patches sit in JIRA for 6 months and get cancelled. That's not good -- either it was a good patch and should have been picked up, or it wasn't suitable, and it should have been clearer it wasn't suitable before the contributor went to the trouble. I see JIRAs tagged for version X, and then untouched and slipped to version X+1, X+2. This means that the community doesn't have credible information about when an issue is going to be addressed. In fact they don't have info about *if* the issue is even good to address, and so worth working in (see point above). Finally I see a lot of "Someone should do X at some point" JIRAs, and Someone rarely does them. While these feel like work and progress, I think they're harmful: it shows the community that to-dos aren't always done. It condones a culture of post-it-notes for future work. I assert that even in open source (perhaps all the more?) we do need enough project coordination such that these problems don't crop up. We should be able to meet basic expectations about scope, process, and roadmap -- not nearly as much as a corporate software project, but some semblance of it. This is a separate thing from providing space for ideas, to-dos, thoughts, bits of code, etc. And maybe we're just disagreeing about how to implement those two things. I think JIRA is for project coordination, and I think the mailing list or wiki, or Github if you like are for ideas, open-ended brainstorms, to-dos. Taking something into JIRA and letting it sit there, to me, is therefore hurting and not helping (see above). If you see JIRA as a place for ideas and loose ends -- then of course this doesn't look like any problem! But then I'd ask, where's the project plan? Because we need that. I don't think that it's wrong to close an issue that hasn't been touched for 9 months as WontFix. I'm not being anti-community. I'm a messenger of slightly bad news, that's all. The issue already wasn't going to be fixed, I'll bet you. And there's some reason for that -- which is what I'm trying to address. My answer of course is simple: rein in scope to match effort available. It's simplistic but sure works. On Sat, Oct 22, 2011 at 8:47 AM, Grant Ingersoll <[email protected]> wrote: > Whew, lots to read and a great conversation. It seems the number 1 rule of > open source is these kinds of questions happen while on vacation or at a > conference. > > So, here are some random thoughts, hopefully trying to take into account this > thread and what I feel we have learned in the past few years: > > 1. First off, the majority of this conversation should be happening on dev@. > With my "Mahout marketing hat on", it probably should be a different subject > line like "Moving towards 1.0 and beyond", but I don't care that much, just > defaulting to looking forward instead of back. This very much needs to > happen as there is little here that is private other than the Chair > discussion. Sean, do you want to start the thread? Otherwise, I am happy to > do so. > > 2. I have a slightly different take on JIRA state. On the one hand, it is > bad that we are not committing issues and I totally agree with you. But I > also sense that you are demoralized by things being left open as I know you > do a lot of clean up work. Personally, I don't think left open is bad. In > fact, I don't see much reason to declare something as "Won't Fix" unless it > truly is a wrong idea/concept and we can outright reject it, which is rarely > the case. If it is marked as "Won't Fix" simply because someone hasn't taken > up the work to do it in a while, then I would argue that is anti-community. > Leaving it open says to the community, "We haven't ruled this out. If you so > desire, please come offer a fix." It encourages contribution and itch > scratching. I've seen issues completed years later in Lucene because of this > and I believe it is a good thing. I know that runs against common closed > source engineering practices, but it is one I think is valuable in open > source. So, how do we get over the clutter factor? Selecting what patches > will be in what version and maintaining "Fix Version". JIRA has enough > filtering capabilities that is quite simple to remove clutter by filtering it > out. In other words, I would say we leave most things open and get much > better about saying what issues need to be in what version. > > I also still think we need to get auto patch checking implemented. We have > all the instructions, we just need to work through it from a Jenkins > standpoint. I think this will help quite a bit. > > 3. I totally agree we should start culling some things. Again, though, that > discussion needs to happen on dev@ and maybe even ask on user@ for input. In > a perfect world, we could write ML code that ran, frictionless, on a layer of > abstraction that allowed people to plug in the underlying engine (Hadoop, > local, Spark, etc.) and it just worked. In the meantime, we should just be > practical and cut what isn't used knowing we can resurrect it later. I think > our modules more or less make sense and our focus on the three C's make sense > and our overarching goal of creating "scalable machine learning algorithms" > makes sense. Let's work from there. I'll save other thoughts on this stuff > for the dev@ conversation. > > The hard part that we really need to overcome is that this ensuing > conversation is likely to be a long one, filled with opinions. This is a > good thing. We need to have that discussion over the course of a week or two > and out of it needs to come a concrete proposal that we can then vote on. > And once that vote is done, we act on it. The last point there being the one > that matters most.
