Thanks Sean, this is a really helpful overview, and contains good guidance for new contributors to ML/MLLIB. My confusion was that the ML 2.2 roadmap critical features (https://issues.apache.org/jira/browse/SPARK-18813) did not line up with the top ML/MLLIB JIRAs by Votes <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520votes%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=%2FtFB0LY%2BIxLoEf%2FPr1i1%2FgvrjlpXPuYLSLbpnd89Tkg%3D&reserved=0> or Watchers<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520Watchers%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=XkPfFiB2T%2FoVnJcdr3jf12dQjes7w%2BVJMrbhgx3ELRs%3D&reserved=0>. Your explanation that they do not have to and there is a more complex process to choosing the changes that will make it into the next release makes sense to me. My only humble recommendation would be to cleanup the top JIRAs by closing the ones which have spark packages for them (eg the NN one which already has several packages as you explained), noting or somehow marking on some that they will not be resolved, and changing the component on the ones not related to ML/MLLIB (eg https://issues.apache.org/jira/browse/SPARK-12965). Also, I would love to do this if I had the permissions, but it would be great to change the JIRAs that are marked as “in progress” but where the corresponding pull request was closed/cancelled, for example https://issues.apache.org/jira/browse/SPARK-4638. That JIRA is actually one of the top ones by number of watches (adding kernels like Radial Basis Function to SVM, and I can imagine why it’s one of the top ones), and seeing it marked as in progress with a pull request is somewhat confusing. I’ve seen several other JIRAs similar to this one, where the pull request was closed but the JIRA status was not updated – and if the pull request was closed for a good reason, the corresponding JIRA should probably be closed as well. Thank you, Ilya
From: Sean Owen [mailto:so...@cloudera.com] Sent: Tuesday, January 24, 2017 11:23 AM To: Ilya Matiach <il...@microsoft.com> Cc: dev@spark.apache.org Subject: Re: Feedback on MLlib roadmap process proposal On Tue, Jan 24, 2017 at 3:58 PM Ilya Matiach <il...@microsoft.com<mailto:il...@microsoft.com>> wrote: Just a few questions with regards to the MLLIB process: 1. Is there a list of committers who can/are shepherds and what code they own? I’ve seen this page: http://spark.apache.org/committers.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fcommitters.html&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=L6pZhfpFVoiAIUHXQjCP%2FhFZ3zINP4jhkYdiJPRQOj4%3D&reserved=0> but I’m not sure if it is up to date and it doesn’t mention what code the committers own. It would be useful to know who owns ML or MLLIB. From my limited personal experience this seems to be Joseph K. Bradley, Yanbo Liang and Sean Owen. There is no such list because there's no formal notion of ownership or access to subsets of the project. Tracking an informal notion would be process mostly for its own sake, and probably just go out of date. We sort of tried this with 'maintainers' and it didn't actually do anything. I am not active much in ML, but will occasionally help commit simple changes. What you see organically is pretty much what is, at any given time. People you see responding are the active ones, and influencers, commit bit or no. 1. 2. Based on both user votes and watchers, the top issue currently is “SPARK-5575: Artificial neural networks for MLlib deep learning”. However, it looks like it has been opened for almost 2 years and not a lot of progress is being made. There seem to be other top issues which aren’t getting addressed as well on these pages mentioned in the roadmap: MLlib, sorted by: Votes <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520votes%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=%2FtFB0LY%2BIxLoEf%2FPr1i1%2FgvrjlpXPuYLSLbpnd89Tkg%3D&reserved=0> or Watchers <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520Watchers%2520DESC&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=XkPfFiB2T%2FoVnJcdr3jf12dQjes7w%2BVJMrbhgx3ELRs%3D&reserved=0> . Is my perception incorrect, or is there a very good reason for not addressing the top issues voted for by the community? If there is a good reason, is there a way to filter such JIRAs out from the sorted lists, to know which JIRAs really should be taken/worked on? JIRA votes and watchers don't mean anything, formally. This isn't a product company where one group might give another group a list of top priorities to work on. There's a general statement about this at http://spark.apache.org/contributing.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fcontributing.html&data=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106&sdata=0nK2BOC49dlx74fwkS0ihbJFotccBKG7TS2Z3Q4TNvs%3D&reserved=0> under "Code Review Criteria". In practice, it's a soft process of convincing other people that change X does more good than harm, is worth taking the burden of supporting, matters to users, etc. I ignore 80% of issues, that don't seem to fit these criteria, and choose to help with the 20% that do, which are usually simple and/or important bug fixes. ANNs? that's a tangent but my snap reaction are: It's something Everybody wants Somebody Else to create, which may explain the votes vs activity? There is one basic ANN implementation in Spark actually. There are others outside Spark, so may be something people get elsewhere like dl4j or BigDL, or strapping TF to Spark in various ways. DL is also not an obviously-great fit for the data-parallel computation model here. It's not a goal to implement everything in Spark. It could be a good idea, but, no need to tether it to the core project, to the exclusion of "unblessed" third-party packages. 1. 2. Also, this might be a newbie question, but for new contributors to spark, is there a process to convince a committer to be assigned to a JIRA that we are working on. It would be useful if there was a clear threshold for whether a committer can reject to work on a JIRA ahead of time, so contributors won’t waste time working on issues that aren’t important to spark and focus on making progress on the issues that the spark committers would like us to fix. No, there's no concept of being tasked to work on something by someone else here. I can't imagine we could establish a clear objective threshold for such a subjective thing. It's not a satisfying answer but it is the most realistic one. All of these OSS projects work on soft power, persuasion and cooperation. I think the good news is that all the intuitive ways to gain soft power do work: give time to others' problems if you want time on your own, help review, make thoughtful careful changes, etc. My general guidance is: don't bother doing significant feature work unless you have some clear buy-in from someone who can commit. I completely agree that issues should be closed more aggressively for the reason you give. On the flip-side this often ruffles feathers. We are still overrun with issues but it's gotten a lot better culture-wise about honestly rejecting lots of inbound stuff quickly.