Hi Trevor I'm attaching the MLLib contribution guideline here: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
It speaks to widely known and accepted algorithms but not to whether an algorithm has to be better than another in every scenario etc I think the guideline explains what a good contribution to the core library should look like better than I initially attempted to ! Sent from my iPhone > On May 20, 2015, at 9:31 AM, Ram Sriharsha <[email protected]> wrote: > > Hi Trevor > > Good point, I didn't mean that some algorithm has to be clearly better than > another in every scenario to be included in MLLib. However, even if someone > is willing to be the maintainer of a piece of code, it does not make sense to > accept every possible algorithm into the core library. > > That said, the specific algorithms should be discussed in the JIRA: as you > point out, there is no clear way to decide what algorithm to include and what > not to, and usually mature algorithms that serve a wide variety of scenarios > are easier to argue about but nothing prevents anyone from opening a ticket > to discuss any specific machine learning algorithm. > > My suggestion was simply that for purposes of making experimental or newer > algorithms available to Spark users, it doesn't necessarily have to be in the > core library. Spark packages are good enough in this respect. > > Isn't it better for newer algorithms to take this route and prove themselves > before we bring them into the core library? Especially given the barrier to > using spark packages is very low. > > Ram > > > >> On Wed, May 20, 2015 at 9:05 AM, Trevor Grant <[email protected]> >> wrote: >> Hey Ram, >> >> I'm not speaking to Tarek's package specifically but to the spirit of MLib. >> There are a number of method/algorithms for PCA, I'm not sure by what >> criterion the current one is considered 'standard'. >> >> It is rare to find ANY machine learning algo that is 'clearly better' than >> any other. They are all tools, they have their place and time. I agree >> that it makes sense to field new algorithms as packages and then integrate >> into MLib once they are 'proven' (in terms of stability/performance/anyone >> cares). That being said, if MLib takes the stance that 'what we have is >> good enough unless something is clearly better', then it will never grow >> into a suite with the depth and richness of sklearn. From a practitioner's >> stand point, its nice to have everything I could ever want ready in an >> 'off-the-shelf' form. >> >> 'A large number of use cases better than existing' shouldn't be a criteria >> when selecting what to include in MLib. The important question should be, >> 'Are you willing to take on responsibility for maintaining this because you >> may be the only person on earth who understands the mechanics AND how to >> code it?'. Obviously we don't want any random junk algo included. But >> trying to say, 'this way of doing PCA is better than that way in a large >> class of cases' is like trying to say 'geometry is more important than >> calculus in large class of cases", maybe its true- but geometry won't help >> you if you are in a case where you need calculus. >> >> This all relies on the assumption that MLib is destined to be a rich data >> science/machine learning package. It may be that the goal is to make the >> project as lightweight and parsimonious as possible, if so excuse me for >> speaking out of turn. >> >> >>> On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <[email protected]> >>> wrote: >>> Hi Trevor, Tarek >>> >>> You make non standard algorithms (PCA or otherwise) available to users of >>> Spark as Spark Packages. >>> http://spark-packages.org >>> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html >>> >>> With the availability of spark packages, adding powerful experimental / >>> alternative machine learning algorithms to the pipeline has never been >>> easier. I would suggest that route in scenarios where one machine learning >>> algorithm is not clearly better in the common scenarios than an existing >>> implementation in MLLib. >>> >>> If your algorithm is for a large class of use cases better than the >>> existing PCA implementation, then we should open a JIRA and discuss the >>> relative strengths/ weaknesses (perhaps with some benchmarks) so we can >>> better understand if it makes sense to switch out the existing PCA >>> implementation and make yours the default. >>> >>> Ram >>> >>>> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <[email protected]> >>>> wrote: >>>> There are most likely advantages and disadvantages to Tarek's algorithm >>>> against the current implementation, and different scenarios where each is >>>> more appropriate. >>>> >>>> Would we not offer multiple PCA algorithms and let the user choose? >>>> >>>> Trevor >>>> >>>> Trevor Grant >>>> Data Scientist >>>> >>>> "Fortunate is he, who is able to know the causes of things." -Virgil >>>> >>>> >>>>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <[email protected]> >>>>> wrote: >>>>> Hi Tarek, >>>>> >>>>> Thanks for your interest & for checking the guidelines first! On 2 >>>>> points: >>>>> >>>>> Algorithm: PCA is of course a critical algorithm. The main question is >>>>> how your algorithm/implementation differs from the current PCA. If it's >>>>> different and potentially better, I'd recommend opening up a JIRA for >>>>> explaining & discussing it. >>>>> >>>>> Java/Scala: We really do require that algorithms be in Scala, for the >>>>> sake of maintainability. The conversion should be doable if you're >>>>> willing since Scala is a pretty friendly language. If you create the >>>>> JIRA, you could also ask for help there to see if someone can collaborate >>>>> with you to convert the code to Scala. >>>>> >>>>> Thanks! >>>>> Joseph >>>>> >>>>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <[email protected]> >>>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> I would like to contribute an algorithm to the MLlib project. I have >>>>>> implemented a scalable PCA algorithm on spark. It is scalable for both >>>>>> tall and fat matrices and the paper around it is accepted for >>>>>> publication in SIGMOD 2015 conference. I looked at the guidelines in the >>>>>> following link: >>>>>> >>>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines >>>>>> >>>>>> I believe that most of the guidelines applies in my case, however, the >>>>>> code is written in java and it was not clear in the guidelines whether >>>>>> MLLib project accepts java code or not. >>>>>> My algorithm can be found under this repository: >>>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA >>>>>> >>>>>> Any help on how to make it suitable for MLlib project will be greatly >>>>>> appreciated. >>>>>> >>>>>> Best Regards, >>>>>> Tarek Elgamal >
