Re: Contributing GMM and Perceptron to MADLib

Roman Shaposhnik Mon, 28 Mar 2016 21:37:51 -0700

Awesome!


On Mon, Mar 28, 2016 at 9:18 PM, Frank McQuillan <[email protected]> wrote:
> Thanks Roman.  I was able to do it just now.
>
> Frank
>
> On Mon, Mar 28, 2016 at 9:12 PM, Roman Shaposhnik <[email protected]> wrote:
>>
>> I can help with that -- stay tuned.
>>
>> On Mon, Mar 28, 2016 at 8:29 PM, Frank McQuillan <[email protected]>
>> wrote:
>> > Let me figure out how to do this and add Aditya as the owner of that
>> > JIRA.
>> > My initial attempts in ASF infra-land were not quite successful.
>> >
>> > Frank
>> >
>> > On Mon, Mar 28, 2016 at 4:54 PM, Rahul Iyer <[email protected]> wrote:
>> >>
>> >> @Frank, Roman: I believe Aditya needs to be added as a developer to the
>> >> MADlib project to assign a JIRA to him? Is this only available to the
>> >> lead/owner?
>> >>
>> >> On Mon, Mar 28, 2016 at 3:49 PM, Aditya Nain <[email protected]>
>> >> wrote:
>> >>>
>> >>> Hi Rahul,
>> >>>
>> >>> I didn't have an id, so I created one now.
>> >>> My id is : Aditya Nain
>> >>>
>> >>> Thanks,
>> >>> Aditya
>> >>>
>> >>> On Mon, Mar 28, 2016 at 6:40 PM, Rahul Iyer <[email protected]> wrote:
>> >>>
>> >>> > I can assign this to you, but you need to have an account in
>> >>> > https://issues.apache.org.
>> >>> > If you already have an account, then please send your id - I wasn't
>> >>> > able to
>> >>> > find you just using your name.
>> >>> >
>> >>> > On Mon, Mar 28, 2016 at 3:31 PM, Aditya Nain <[email protected]>
>> >>> > wrote:
>> >>> >
>> >>> > > Hi Rahul,
>> >>> > >
>> >>> > > Thanks for the reply!
>> >>> > >
>> >>> > > I am working on implementing Gaussian Mixture Model assuming that
>> >>> > > the
>> >>> > > co-variance matrix is same for all the Gaussians.
>> >>> > > The JIRA which deals GMM is MADBLIB-410:
>> >>> > >
>> >>> >
>> >>> >
>> >>> > https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB
>> >>> > >
>> >>> > > Can this be assigned to me, or how do I get it assigned to me?
>> >>> > >
>> >>> > > Thanks,
>> >>> > > Aditya
>> >>> > >
>> >>> > > On Mon, Mar 21, 2016 at 3:41 PM, Rahul Iyer <[email protected]>
>> >>> > > wrote:
>> >>> > >
>> >>> > > > Hi Aditya,
>> >>> > > >
>> >>> > > > Welcome to the MADlib community!
>> >>> > > >
>> >>> > > > Gaussian Mixture models is extrememly useful and we would
>> >>> > > > heartily
>> >>> > > welcome
>> >>> > > > a contribution for it. The SQLEM paper might be oversimplifying
>> >>> > > > the
>> >>> > > > capabilities of the database (e.g. assuming there is no array
>> >>> > > > type
>> >>> > > > is
>> >>> > > > unnecessary for Postgresql). You could speed things (both dev
>> >>> > > > time
>> >>> > > > and
>> >>> > > > execution time) by writing some of the functions in C++. K-means
>> >>> > > > is
>> >>> > > > an
>> >>> > > > example of how clustering is implemented.
>> >>> > > > IMO, assuming the same covariance matrix is reasonable. We could
>> >>> > > > extend
>> >>> > > the
>> >>> > > > capabilities after the initial implementation is complete.
>> >>> > > >
>> >>> > > > There was some work started a long time ago that built
>> >>> > > > perceptrons
>> >>> > using
>> >>> > > > the convex framework (link
>> >>> > > > <https://github.com/iyerr3/madlib/tree/mlp
>> >>> > >).
>> >>> > > > There are still some bugs in that code since the trained network
>> >>> > > > isn't
>> >>> > > > converging. You could start there or build a new module - either
>> >>> > > > ways
>> >>> > an
>> >>> > > > MLP module is frequently demanded by the data science community.
>> >>> > > >
>> >>> > > > I would suggest starting with Gaussian mixtures and then moving
>> >>> > > > to
>> >>> > > > perceptrons if GMM work is completed.
>> >>> > > >
>> >>> > > > Feel free to ask questions on this forum. Looking forward to
>> >>> > > collaborating
>> >>> > > > with you.
>> >>> > > >
>> >>> > > > Best,
>> >>> > > > Rahul
>> >>> > > >
>> >>> > > > On Thu, Mar 17, 2016 at 2:08 PM, Aditya Nain
>> >>> > > > <[email protected]>
>> >>> > > > wrote:
>> >>> > > >
>> >>> > > > > Hi,
>> >>> > > > >
>> >>> > > > > My name is Aditya Nain, and I am a graduate student at
>> >>> > > > > University
>> >>> > > > > of
>> >>> > > > > Florida.
>> >>> > > > > I have been learning MADLib for a while and want to contribute
>> >>> > > > > to
>> >>> > > MADLib.
>> >>> > > > > I went through some of the open stories in JIRA and started
>> >>> > > > > working
>> >>> > on
>> >>> > > > > MADLIB-410  :
>> >>> > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>> >
>> >>> > https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB
>> >>> > > > >
>> >>> > > > > which is about implementing Gaussian Mixture Model using
>> >>> > > > > Expectation
>> >>> > > > > Maximization (EM) algorithm.
>> >>> > > > >
>> >>> > > > > I came across the following paper while searching for
>> >>> > > > > distributed
>> >>> > > > > EM
>> >>> > > > > algorithm which can be implemented in MADLib.
>> >>> > > > >
>> >>> > > > > Carlos Ordonez, Paul Cereghini "SQLEM: fast clustering in SQL
>> >>> > > > > using
>> >>> > the
>> >>> > > > EM
>> >>> > > > > algorithm" ACM SIGMOD Record, Volume 29 Issue 2, June 2000
>> >>> > > > > Pages
>> >>> > > 559-570.
>> >>> > > > >
>> >>> > > > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.7564
>> >>> > > > >
>> >>> > > > > I thought of implementing the approach discussed in the paper,
>> >>> > > > > but
>> >>> > the
>> >>> > > > > paper makes an assumption that the covariance martix is the
>> >>> > > > > same
>> >>> > > > > for
>> >>> > > all
>> >>> > > > > the clusters ( i.e covariance matrix is same for all the
>> >>> > > > > Gaussian
>> >>> > > > > distributions). So, I wanted to know the opinion of the
>> >>> > > > > community
>> >>> > > > > if
>> >>> > > it's
>> >>> > > > > fine to go with the assumption made in the paper and implement
>> >>> > > > > it
>> >>> > > > > in
>> >>> > > > > MADLib.
>> >>> > > > >
>> >>> > > > > Also, currently MADLib doesn't have an implementation of a
>> >>> > perceptron,
>> >>> > > > nor
>> >>> > > > > did I find any open story related to it in JIRA. I came across
>> >>> > > > > the
>> >>> > > > > following paper, which talks about a distributed algorithm for
>> >>> > > > perceptron :
>> >>> > > > >
>> >>> > > > > Ryan McDonald, Keith Hall, Gideon Mann "Distributed training
>> >>> > strategies
>> >>> > > > for
>> >>> > > > > the structured perceptron"
>> >>> > > > > http://dl.acm.org/citation.cfm?id=1858068
>> >>> > > > >
>> >>> > > > > Would it useful to have a distributed implementaion of
>> >>> > > > > perceptron
>> >>> > > > > in
>> >>> > > > > MADlib?
>> >>> > > > >
>> >>> > > > > Thanks,
>> >>> > > > > Aditya
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>
>> >>
>> >
>
>

Re: Contributing GMM and Perceptron to MADLib

Reply via email to