Thank you Ram and Joseph. I am also hoping to contribute to MLib once my Scala gets up to snuff, this is the guidance I needed for how to proceed when ready.
Best wishes, Trevor On Wed, May 20, 2015 at 1:55 PM, Joseph Bradley <jos...@databricks.com> wrote: > Hi Trevor, > > I may be repeating what Ram said, but to 2nd it, a few points: > > We do want MLlib to become an extensive and rich ML library; as you said, > scikit-learn is a great example. To make that happen, we of course need to > include important algorithms. "Important" is hazy, but roughly means being > useful to a large number of users, improving a large number of use cases > (above what is currently available), and being well-established and tested. > > Others and I may not be familiar with Tarek's algorithm (since it is so > new), so it will be important to discuss details on JIRA to establish the > cases in which the algorithm improves over current PCA. That may require > discussion, community testing, etc. If we establish that it is a clear > improvement in a large domain, then it could be valuable to have in MLlib > proper. It's always going to be hard to tell where to draw the line, so > less common algorithms will require more testing before we commit to > including them in MLlib. > > I like the Spark package suggestion since it would allow users immediately > start using the code, while the discussion on JIRA happens. (Plus, if > package users find it useful, they can report that on the JIRA.) > > Joseph > > On Wed, May 20, 2015 at 10:01 AM, Ram Sriharsha <sriharsha....@gmail.com> > wrote: > >> Hi Trevor >> >> I'm attaching the MLLib contribution guideline here: >> >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines >> >> It speaks to widely known and accepted algorithms but not to whether an >> algorithm has to be better than another in every scenario etc >> >> I think the guideline explains what a good contribution to the core >> library should look like better than I initially attempted to ! >> >> Sent from my iPhone >> >> On May 20, 2015, at 9:31 AM, Ram Sriharsha <sriharsha....@gmail.com> >> wrote: >> >> Hi Trevor >> >> Good point, I didn't mean that some algorithm has to be clearly better >> than another in every scenario to be included in MLLib. However, even if >> someone is willing to be the maintainer of a piece of code, it does not >> make sense to accept every possible algorithm into the core library. >> >> That said, the specific algorithms should be discussed in the JIRA: as >> you point out, there is no clear way to decide what algorithm to include >> and what not to, and usually mature algorithms that serve a wide variety of >> scenarios are easier to argue about but nothing prevents anyone from >> opening a ticket to discuss any specific machine learning algorithm. >> >> My suggestion was simply that for purposes of making experimental or >> newer algorithms available to Spark users, it doesn't necessarily have to >> be in the core library. Spark packages are good enough in this respect. >> >> Isn't it better for newer algorithms to take this route and prove >> themselves before we bring them into the core library? Especially given the >> barrier to using spark packages is very low. >> >> Ram >> >> >> >> On Wed, May 20, 2015 at 9:05 AM, Trevor Grant <trevor.d.gr...@gmail.com> >> wrote: >> >>> Hey Ram, >>> >>> I'm not speaking to Tarek's package specifically but to the spirit of >>> MLib. There are a number of method/algorithms for PCA, I'm not sure by >>> what criterion the current one is considered 'standard'. >>> >>> It is rare to find ANY machine learning algo that is 'clearly better' >>> than any other. They are all tools, they have their place and time. I >>> agree that it makes sense to field new algorithms as packages and then >>> integrate into MLib once they are 'proven' (in terms of >>> stability/performance/anyone cares). That being said, if MLib takes the >>> stance that 'what we have is good enough unless something is *clearly* >>> better', then it will never grow into a suite with the depth and richness >>> of sklearn. From a practitioner's stand point, its nice to have everything >>> I could ever want ready in an 'off-the-shelf' form. >>> >>> 'A large number of use cases better than existing' shouldn't be a >>> criteria when selecting what to include in MLib. The important question >>> should be, 'Are you willing to take on responsibility for maintaining this >>> because you may be the only person on earth who understands the mechanics >>> AND how to code it?'. Obviously we don't want any random junk algo >>> included. But trying to say, 'this way of doing PCA is better than that >>> way in a large class of cases' is like trying to say 'geometry is more >>> important than calculus in large class of cases", maybe its true- but >>> geometry won't help you if you are in a case where you need calculus. >>> >>> This all relies on the assumption that MLib is destined to be a rich >>> data science/machine learning package. It may be that the goal is to make >>> the project as lightweight and parsimonious as possible, if so excuse me >>> for speaking out of turn. >>> >>> >>> On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <sriharsha....@gmail.com >>> > wrote: >>> >>>> Hi Trevor, Tarek >>>> >>>> You make non standard algorithms (PCA or otherwise) available to users >>>> of Spark as Spark Packages. >>>> http://spark-packages.org >>>> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html >>>> >>>> With the availability of spark packages, adding powerful experimental / >>>> alternative machine learning algorithms to the pipeline has never been >>>> easier. I would suggest that route in scenarios where one machine learning >>>> algorithm is not clearly better in the common scenarios than an existing >>>> implementation in MLLib. >>>> >>>> If your algorithm is for a large class of use cases better than the >>>> existing PCA implementation, then we should open a JIRA and discuss the >>>> relative strengths/ weaknesses (perhaps with some benchmarks) so we can >>>> better understand if it makes sense to switch out the existing PCA >>>> implementation and make yours the default. >>>> >>>> Ram >>>> >>>> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <trevor.d.gr...@gmail.com >>>> > wrote: >>>> >>>>> There are most likely advantages and disadvantages to Tarek's >>>>> algorithm against the current implementation, and different scenarios >>>>> where >>>>> each is more appropriate. >>>>> >>>>> Would we not offer multiple PCA algorithms and let the user choose? >>>>> >>>>> Trevor >>>>> >>>>> Trevor Grant >>>>> Data Scientist >>>>> >>>>> *"Fortunate is he, who is able to know the causes of things." -Virgil* >>>>> >>>>> >>>>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jos...@databricks.com >>>>> > wrote: >>>>> >>>>>> Hi Tarek, >>>>>> >>>>>> Thanks for your interest & for checking the guidelines first! On 2 >>>>>> points: >>>>>> >>>>>> Algorithm: PCA is of course a critical algorithm. The main question >>>>>> is how your algorithm/implementation differs from the current PCA. If >>>>>> it's >>>>>> different and potentially better, I'd recommend opening up a JIRA for >>>>>> explaining & discussing it. >>>>>> >>>>>> Java/Scala: We really do require that algorithms be in Scala, for the >>>>>> sake of maintainability. The conversion should be doable if you're >>>>>> willing >>>>>> since Scala is a pretty friendly language. If you create the JIRA, you >>>>>> could also ask for help there to see if someone can collaborate with you >>>>>> to >>>>>> convert the code to Scala. >>>>>> >>>>>> Thanks! >>>>>> Joseph >>>>>> >>>>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal < >>>>>> tarek.elga...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I would like to contribute an algorithm to the MLlib project. I have >>>>>>> implemented a scalable PCA algorithm on spark. It is scalable for both >>>>>>> tall >>>>>>> and fat matrices and the paper around it is accepted for publication in >>>>>>> SIGMOD 2015 conference. I looked at the guidelines in the following >>>>>>> link: >>>>>>> >>>>>>> >>>>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines >>>>>>> >>>>>>> I believe that most of the guidelines applies in my case, however, >>>>>>> the code is written in java and it was not clear in the guidelines >>>>>>> whether >>>>>>> MLLib project accepts java code or not. >>>>>>> My algorithm can be found under this repository: >>>>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA >>>>>>> >>>>>>> Any help on how to make it suitable for MLlib project will be >>>>>>> greatly appreciated. >>>>>>> >>>>>>> Best Regards, >>>>>>> Tarek Elgamal >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >