Re: Contribute code to MLlib

Trevor Grant Thu, 21 May 2015 14:12:07 -0700

Thank you Ram and Joseph.

I am also hoping to contribute to MLib once my Scala gets up to snuff, this
is the guidance I needed for how to proceed when ready.


Best wishes,
Trevor



On Wed, May 20, 2015 at 1:55 PM, Joseph Bradley <jos...@databricks.com>
wrote:

> Hi Trevor,
>
> I may be repeating what Ram said, but to 2nd it, a few points:
>
> We do want MLlib to become an extensive and rich ML library; as you said,
> scikit-learn is a great example.  To make that happen, we of course need to
> include important algorithms.  "Important" is hazy, but roughly means being
> useful to a large number of users, improving a large number of use cases
> (above what is currently available), and being well-established and tested.
>
> Others and I may not be familiar with Tarek's algorithm (since it is so
> new), so it will be important to discuss details on JIRA to establish the
> cases in which the algorithm improves over current PCA.  That may require
> discussion, community testing, etc.  If we establish that it is a clear
> improvement in a large domain, then it could be valuable to have in MLlib
> proper.  It's always going to be hard to tell where to draw the line, so
> less common algorithms will require more testing before we commit to
> including them in MLlib.
>
> I like the Spark package suggestion since it would allow users immediately
> start using the code, while the discussion on JIRA happens.  (Plus, if
> package users find it useful, they can report that on the JIRA.)
>
> Joseph
>
> On Wed, May 20, 2015 at 10:01 AM, Ram Sriharsha <sriharsha....@gmail.com>
> wrote:
>
>> Hi Trevor
>>
>> I'm attaching the MLLib contribution guideline here:
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>
>> It speaks to widely known and accepted algorithms but not to whether an
>> algorithm has to be better than another in every scenario etc
>>
>> I think the guideline explains what a good contribution to the core
>> library should look like better than I initially attempted to !
>>
>> Sent from my iPhone
>>
>> On May 20, 2015, at 9:31 AM, Ram Sriharsha <sriharsha....@gmail.com>
>> wrote:
>>
>> Hi Trevor
>>
>> Good point, I didn't mean that some algorithm has to be clearly better
>> than another in every scenario to be included in MLLib. However, even if
>> someone is willing to be the maintainer of a piece of code, it does not
>> make sense to accept every possible algorithm into the core library.
>>
>> That said, the specific algorithms should be discussed in the JIRA: as
>> you point out, there is no clear way to decide what algorithm to include
>> and what not to, and usually mature algorithms that serve a wide variety of
>> scenarios are easier to argue about but nothing prevents anyone from
>> opening a ticket to discuss any specific machine learning algorithm.
>>
>> My suggestion was simply that for purposes of making experimental or
>> newer algorithms available to Spark users, it doesn't necessarily have to
>> be in the core library. Spark packages are good enough in this respect.
>>
>> Isn't it better for newer algorithms to take this route and prove
>> themselves before we bring them into the core library? Especially given the
>> barrier to using spark packages is very low.
>>
>> Ram
>>
>>
>>
>> On Wed, May 20, 2015 at 9:05 AM, Trevor Grant <trevor.d.gr...@gmail.com>
>> wrote:
>>
>>> Hey Ram,
>>>
>>> I'm not speaking to Tarek's package specifically but to the spirit of
>>> MLib.  There are a number of method/algorithms for PCA, I'm not sure by
>>> what criterion the current one is considered 'standard'.
>>>
>>> It is rare to find ANY machine learning algo that is 'clearly better'
>>> than any other.  They are all tools, they have their place and time.  I
>>> agree that it makes sense to field new algorithms as packages and then
>>> integrate into MLib once they are 'proven' (in terms of
>>> stability/performance/anyone cares).  That being said, if MLib takes the
>>> stance that 'what we have is good enough unless something is *clearly*
>>> better', then it will never grow into a suite with the depth and richness
>>> of sklearn. From a practitioner's stand point, its nice to have everything
>>> I could ever want ready in an 'off-the-shelf' form.
>>>
>>> 'A large number of use cases better than existing' shouldn't be a
>>> criteria when selecting what to include in MLib.  The important question
>>> should be, 'Are you willing to take on responsibility for maintaining this
>>> because you may be the only person on earth who understands the mechanics
>>> AND how to code it?'.   Obviously we don't want any random junk algo
>>> included.  But trying to say, 'this way of doing PCA is better than that
>>> way in a large class of cases' is like trying to say 'geometry is more
>>> important than calculus in large class of cases", maybe its true- but
>>> geometry won't help you if you are in a case where you need calculus.
>>>
>>> This all relies on the assumption that MLib is destined to be a rich
>>> data science/machine learning package.  It may be that the goal is to make
>>> the project as lightweight and parsimonious as possible, if so excuse me
>>> for speaking out of turn.
>>>
>>>
>>> On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <sriharsha....@gmail.com
>>> > wrote:
>>>
>>>> Hi Trevor, Tarek
>>>>
>>>> You make non standard algorithms (PCA or otherwise) available to users
>>>> of Spark as Spark Packages.
>>>> http://spark-packages.org
>>>> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
>>>>
>>>> With the availability of spark packages, adding powerful experimental /
>>>> alternative machine learning algorithms to the pipeline has never been
>>>> easier. I would suggest that route in scenarios where one machine learning
>>>> algorithm is not clearly better in the common scenarios than an existing
>>>> implementation in MLLib.
>>>>
>>>> If your algorithm is for a large class of use cases better than the
>>>> existing PCA implementation, then we should open a JIRA and discuss the
>>>> relative strengths/ weaknesses (perhaps with some benchmarks) so we can
>>>> better understand if it makes sense to switch out the existing PCA
>>>> implementation and make yours the default.
>>>>
>>>> Ram
>>>>
>>>> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <trevor.d.gr...@gmail.com
>>>> > wrote:
>>>>
>>>>>  There are most likely advantages and disadvantages to Tarek's
>>>>> algorithm against the current implementation, and different scenarios 
>>>>> where
>>>>> each is more appropriate.
>>>>>
>>>>> Would we not offer multiple PCA algorithms and let the user choose?
>>>>>
>>>>> Trevor
>>>>>
>>>>> Trevor Grant
>>>>> Data Scientist
>>>>>
>>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>>
>>>>>
>>>>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jos...@databricks.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Tarek,
>>>>>>
>>>>>> Thanks for your interest & for checking the guidelines first!  On 2
>>>>>> points:
>>>>>>
>>>>>> Algorithm: PCA is of course a critical algorithm.  The main question
>>>>>> is how your algorithm/implementation differs from the current PCA.  If 
>>>>>> it's
>>>>>> different and potentially better, I'd recommend opening up a JIRA for
>>>>>> explaining & discussing it.
>>>>>>
>>>>>> Java/Scala: We really do require that algorithms be in Scala, for the
>>>>>> sake of maintainability.  The conversion should be doable if you're 
>>>>>> willing
>>>>>> since Scala is a pretty friendly language.  If you create the JIRA, you
>>>>>> could also ask for help there to see if someone can collaborate with you 
>>>>>> to
>>>>>> convert the code to Scala.
>>>>>>
>>>>>> Thanks!
>>>>>> Joseph
>>>>>>
>>>>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <
>>>>>> tarek.elga...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I would like to contribute an algorithm to the MLlib project. I have
>>>>>>> implemented a scalable PCA algorithm on spark. It is scalable for both 
>>>>>>> tall
>>>>>>> and fat matrices and the paper around it is accepted for publication in
>>>>>>> SIGMOD 2015 conference. I looked at the guidelines in the following 
>>>>>>> link:
>>>>>>>
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>>>>>
>>>>>>> I believe that most of the guidelines applies in my case, however,
>>>>>>> the code is written in java and it was not clear in the guidelines 
>>>>>>> whether
>>>>>>> MLLib project accepts java code or not.
>>>>>>> My algorithm can be found under this repository:
>>>>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>>>>>
>>>>>>> Any help on how to make it suitable for MLlib project will be
>>>>>>> greatly appreciated.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Tarek Elgamal
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Contribute code to MLlib

Reply via email to