Sorry for being late to the discussion. I think Joseph, Sean and others
have covered the issues well.

Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
As for the actual critical roadmap items mentioned on SPARK-18813, I think
it makes sense and will comment a bit further on that JIRA.

I would like to encourage votes & watching for issues to give a sense of
what the community wants (I guess Vote is more explicit yet passive, while
actually Watching an issue is more informative as it may indicate a real
use case dependent on the issue?!).

I think if used well this is valuable information for contributors. Of
course not everything on that list can get done. But if I look through the
top votes or watch list, while not all of those are likely to go in, a
great many of the issues are fairly non-contentious in terms of being good
additions to the project.

Things like these are good examples IMO (I just sample a few of them, not
exhaustive):
- sample weights for RF / DT
- multi-model and/or parallel model selection
- make sharedParams public?
- multi-column support for various transformers
- incremental model training
- tree algorithm enhancements

Now, whether these can be prioritised in terms of bandwidth available to
reviewers and committers is a totally different thing. But as Sean mentions
there is some process there for trying to find the balance of the issue
being a "good thing to add", a shepherd with bandwidth & interest in the
issue to review, and the maintenance burden imposed.

Let's take Deep Learning / NN for example. Here's a good example of
something that has a lot of votes/watchers and as Sean mentions it is
something that "everyone wants someone else to implement". In this case,
much of the interest may in fact be "stale" - 2 years ago it would have
been very interesting to have a strong DL impl in Spark. Now, because there
are a plethora of very good DL libraries out there, how many of those Votes
would be "deleted"? Granted few are well integrated with Spark but that can
and is changing (DL4J, BigDL, the "XonSpark" flavours etc).

So this is something that I dare say will not be in Spark any time in the
foreseeable future or perhaps ever given the current status. Perhaps it's
worth seriously thinking about just closing these kind of issues?



On Fri, 27 Jan 2017 at 05:53 Joseph Bradley <jos...@databricks.com> wrote:

> Sean has given a great explanation.  A few more comments:
>
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
>
> In general, there are many process improvements we could make.  A few in
> my mind are:
> * Visibility: Let the community know what committers are focusing on.
> This was the primary purpose of the "MLlib roadmap proposal."
> * Community initiatives: This is currently very organic.  Some of the
> organic process could be improved, such as encouraging Votes/Watchers
> (though I agree with Sean about these being one-sided metrics).  Cody's SIP
> work is a great step towards adding more clarity and structure for major
> initiatives.
> * JIRA hygiene: Always a challenge, and always requires some manual
> prodding.  But it's great to push for efforts on this.
>
>
> On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen <so...@cloudera.com> wrote:
>
> On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach <il...@microsoft.com> wrote:
>
> My confusion was that the ML 2.2 roadmap critical features (
> https://issues.apache.org/jira/browse/SPARK-18813) did not line up with
> the top ML/MLLIB JIRAs by Votes
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520votes%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=%2FtFB0LY%2BIxLoEf%2FPr1i1%2FgvrjlpXPuYLSLbpnd89Tkg%3D&reserved=0>or
> Watchers
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520Watchers%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=XkPfFiB2T%2FoVnJcdr3jf12dQjes7w%2BVJMrbhgx3ELRs%3D&reserved=0>
> .
>
> Your explanation that they do not have to and there is a more complex
> process to choosing the changes that will make it into the next release
> makes sense to me.
>
>
> For Spark ML, Joseph is the de facto leader and does publish a tentative
> roadmap. (We could also use JIRA mechanisms for this but any scheme is
> better than none.) Yes, not based on Votes -- nothing here is. Votes are
> noisy signal because it is usually measures: what would you like done if
> you didn't have to do it and there were no downsides for you?
>
>
>
> My only humble recommendation would be to cleanup the top JIRAs by closing
> the ones which have spark packages for them (eg the NN one which already
> has several packages as you explained), noting or somehow marking on some
> that they will not be resolved, and changing the component on the ones not
> related to ML/MLLIB (eg https://issues.apache.org/jira/browse/SPARK-12965
> ).
>
>
> We do that. It occasionally generates protests, so, I find myself erring
> on the side of ignoring. You can comment on any JIRA you think should be
> closed. That's helpful.
>
> That particular JIRA seems potentially legitimate. I wouldn't close it. It
> also won't get fixed until someone proposes a resolution. I'd strongly
> encourage people saying "I have this problem too" to try to fix it. I tend
> to ignore these otherwise, myself, in favor of reviewing ones where someone
> has gone to the trouble of proposing a working fix.
>
>
>
> Also, I would love to do this if I had the permissions, but it would be
> great to change the JIRAs that are marked as “in progress” but where the
> corresponding pull request was closed/cancelled, for example
> https://issues.apache.org/jira/browse/SPARK-4638.  That JIRA is
>
>
> Yes, flag these. I or others can close them if appropriate. Anyone who
> consistently does this well, we could give JIRA permissions to.
>
> Opening a PR automatically makes it "In Progress" but there's no
> complementary process to un-mark it. You can ignore the Open / In Progress
> distinction really.
>
> This one is interesting because it does seem like a plausible feature to
> add. The original PR was abandoned by the author and nobody else submitted
> one -- despite the Votes. I hesitate to signal that no PRs would be
> considered, but, doesn't seem like it's in demand enough for someone to
> work on?
>
>
> I think one of my messages is that, de facto, here, like in many Apache
> projects, committers do not take requests. They pursue the work they
> believe needs doing, and shepherd work initiated by others (a clear bug
> report, a PR) to a resolution. Things get done by doing them, or by
> building influence by doing other things the project needs doing. It isn't
> a mechanical, objective process, and can't be. But it does work in a
> recognizable way.
>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>

Reply via email to