Re: [GSOC 2014] Uniform API for Mahout Clustering

Suneel Marthi Wed, 19 Mar 2014 00:14:41 -0700



On Wednesday, March 19, 2014 3:09 AM, Dmitriy Lyubimov <[email protected]> 
wrote:
 
On Tue, Mar 18, 2014 at 11:56 PM, chalitha udara Perera <
[email protected]> wrote:

> Hi Dmitriy,
>
> I agree with you that i need to be more specific on this matter. Here I was
> referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b and
> c.
>
He mainly speaks of test coverage there and REST exposition.  What you
saying is a bit more ambitious IMO.

This was long before the discussion of H2O and Spark had come up. In a later 
email, I had also mentioned uniform interfaces for API and porting stuff to 
Spark.
>
> For example this is one thing i have experienced while using mahout
> clustering. I have used both simple kmeans and spectral kmeans and for
> simple kmeans input is the sequence file containing the tfidf vectors of
> the documents while for spectral kmeans it is a csv file defining the
> similarity matrix. It would have been much easier for users if spectral
> kmeans also takes the tfidf vectors and create the similarity matrix
> internally. I think that would improve the usability.
>
I don't think clustering is tf-idf specic. I think this is a chance for
proper componentization of concerns here.

Agree with Dmitriy here.

>
> And most of these algorithms are designed to run via the command line. I
> know currently lot of programmers just use run(String []) method for
> programming. I am not saying it is impossible to use Mahout clustering
> algorithms as required. but it takes some effort, most of the you need to
> dive into the code internals to use it properly and most of the people are
> not going to do that. Please provide your valuable insight on this.
>
> I also really interested in the new direction mahout is heading with Spark
> given that interest for Spark will only grow largely in near future. If you
> think implementing some of clustering algorithms for example simple kmeans
> to support spark is more important for next release, I would be happy to
> work on that.
>

I would be happy to see you give a try there, too.


>
> Regards,
> Chalitha
>
> [1]
>
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201402.mbox/%[email protected]%3E
>
>
>
> On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov <[email protected]
> >wrote:
>
> > I think you need to be a little bit more specific as to what you are
> > proposing exactly.  I think "uniform clustering api" needs a bit of
> > elaboration. I, generally, cannot say that I experienced any pain calling
> > out clustering algorithms say in R as a well-documented function. In
> Mahout
> > just doing the same was primarily a pain; but assuming one can call it
> with
> > ease and even interactively, I can't say I experienced any major
> > inconvenience with just doing this.
> >
> > I guess one can see that one can abstract away notions of clusters and
> > clustering output, but I don't have enough experience to tell whether it
> is
> > a good idea to cover _any_ possible clustering methodology.
> >
> >
> > On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera <
> > [email protected]> wrote:
> >
> > > Hi everyone,
> > >
> > > Greatly appreciate your interest on this issue. I have gone through the
> > > document ScalaSparkBindings [1] . In this project my initial idea was
> to
> > > provide high level API for end user programmers so that they have the
> > > flexibility of plugin in different types of algorithms without
> concerning
> > > about underline details of different types of inputs or outputs. Also I
> > > consider providing proper test coverage for all clustering algorithm
> is a
> > > must for the 1.0 release.
> > >
> > > Would like to get your opinion regarding this and little more detail on
> > > current requirements for clustering would help me to improve proposal.
> > >
> > > Thanks,
> > > Chalitha
> > >
> > >
> > >
> > > On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov <[email protected]
> > > >wrote:
> > >
> > > > Yes. there's interest.
> > > > Note that we are trying to unify linear algebra primitives and
> > > optimization
> > > > on Spark as well. All new linear algebra and interaction with spark
> > > context
> > > > should probably go thru this layer. This is ongoing thing but some
> > stuff
> > > is
> > > > working [1]
> > > >
> > > > [1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346
> > > >
> > > >
> > > > On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera <
> > > > [email protected]> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Going through the mail tread Mahout 1.0 goals, I found that the
> main
> > > > focus
> > > > > of mahout is now towards the code re-factoring and integration with
> > > Spark
> > > > > rather than implementing new algorithms. Recently I have used
> mahout
> > > for
> > > > > implementing document clustering module a Content Management
> System.
> > > > >
> > > > > To be honest we had some problems with lack of uniformity among
> > > different
> > > > > clustering algorithms. For example simple Kmeans takes input as the
> > > > > sequence file with document TF-IDF vectors, while Spectral Kmeans
> > takes
> > > > the
> > > > > csv file that defines the similarity matrix.
> > > > >
> > > > > I think if we can provide a uniform clustering API as mentioned in
> > 1.0
> > > > > goals, it would be very useful for end user developers.
> > > > >
> > > > > I would like to proceed with this idea as my GSOC 2014 project.
> > Please
> > > > let
> > > > > me know if you are interested in this project
> > > > > --
> > > > > J.M Chalitha Udara Perera
> > > > >
> > > > > *Department of Computer Science and Engineering,*
> > > > > *University of Moratuwa,*
> > > > > *Sri Lanka*
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > J.M Chalitha Udara Perera
> > >
> > > *Department of Computer Science and Engineering,*
> > > *University of Moratuwa,*
> > > *Sri Lanka*
> > >
> >
>
>
>
> --
> J.M Chalitha Udara Perera
>
> *Department of Computer Science and Engineering,*
> *University of Moratuwa,*
> *Sri Lanka*
>
Re: [GSOC 2014] Uniform API for Mahout Clustering

Reply via email to