Re: [GSOC 2014] Uniform API for Mahout Clustering

chalitha udara Perera Tue, 18 Mar 2014 23:57:07 -0700

Hi Dmitriy,

I agree with you that i need to be more specific on this matter. Here I was
referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b and
c.


For example this is one thing i have experienced while using mahout
clustering. I have used both simple kmeans and spectral kmeans and for
simple kmeans input is the sequence file containing the tfidf vectors of
the documents while for spectral kmeans it is a csv file defining the
similarity matrix. It would have been much easier for users if spectral
kmeans also takes the tfidf vectors and create the similarity matrix
internally. I think that would improve the usability.

And most of these algorithms are designed to run via the command line. I
know currently lot of programmers just use run(String []) method for
programming. I am not saying it is impossible to use Mahout clustering
algorithms as required. but it takes some effort, most of the you need to
dive into the code internals to use it properly and most of the people are
not going to do that. Please provide your valuable insight on this.

I also really interested in the new direction mahout is heading with Spark
given that interest for Spark will only grow largely in near future. If you
think implementing some of clustering algorithms for example simple kmeans
to support spark is more important for next release, I would be happy to
work on that.

Regards,
Chalitha

[1]
http://mail-archives.apache.org/mod_mbox/mahout-dev/201402.mbox/%[email protected]%3E



On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov <[email protected]>wrote:

> I think you need to be a little bit more specific as to what you are
> proposing exactly.  I think "uniform clustering api" needs a bit of
> elaboration. I, generally, cannot say that I experienced any pain calling
> out clustering algorithms say in R as a well-documented function. In Mahout
> just doing the same was primarily a pain; but assuming one can call it with
> ease and even interactively, I can't say I experienced any major
> inconvenience with just doing this.
>
> I guess one can see that one can abstract away notions of clusters and
> clustering output, but I don't have enough experience to tell whether it is
> a good idea to cover _any_ possible clustering methodology.
>
>
> On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera <
> [email protected]> wrote:
>
> > Hi everyone,
> >
> > Greatly appreciate your interest on this issue. I have gone through the
> > document ScalaSparkBindings [1] . In this project my initial idea was to
> > provide high level API for end user programmers so that they have the
> > flexibility of plugin in different types of algorithms without concerning
> > about underline details of different types of inputs or outputs. Also I
> > consider providing proper test coverage for all clustering algorithm is a
> > must for the 1.0 release.
> >
> > Would like to get your opinion regarding this and little more detail on
> > current requirements for clustering would help me to improve proposal.
> >
> > Thanks,
> > Chalitha
> >
> >
> >
> > On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov <[email protected]
> > >wrote:
> >
> > > Yes. there's interest.
> > > Note that we are trying to unify linear algebra primitives and
> > optimization
> > > on Spark as well. All new linear algebra and interaction with spark
> > context
> > > should probably go thru this layer. This is ongoing thing but some
> stuff
> > is
> > > working [1]
> > >
> > > [1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346
> > >
> > >
> > > On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera <
> > > [email protected]> wrote:
> > >
> > > > Hi All,
> > > >
> > > > Going through the mail tread Mahout 1.0 goals, I found that the main
> > > focus
> > > > of mahout is now towards the code re-factoring and integration with
> > Spark
> > > > rather than implementing new algorithms. Recently I have used mahout
> > for
> > > > implementing document clustering module a Content Management System.
> > > >
> > > > To be honest we had some problems with lack of uniformity among
> > different
> > > > clustering algorithms. For example simple Kmeans takes input as the
> > > > sequence file with document TF-IDF vectors, while Spectral Kmeans
> takes
> > > the
> > > > csv file that defines the similarity matrix.
> > > >
> > > > I think if we can provide a uniform clustering API as mentioned in
> 1.0
> > > > goals, it would be very useful for end user developers.
> > > >
> > > > I would like to proceed with this idea as my GSOC 2014 project.
> Please
> > > let
> > > > me know if you are interested in this project
> > > > --
> > > > J.M Chalitha Udara Perera
> > > >
> > > > *Department of Computer Science and Engineering,*
> > > > *University of Moratuwa,*
> > > > *Sri Lanka*
> > > >
> > >
> >
> >
> >
> > --
> > J.M Chalitha Udara Perera
> >
> > *Department of Computer Science and Engineering,*
> > *University of Moratuwa,*
> > *Sri Lanka*
> >
>



-- 
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*

Re: [GSOC 2014] Uniform API for Mahout Clustering

Reply via email to