Re: [GSOC 2014] Uniform API for Mahout Clustering

Sebastian Schelter Wed, 19 Mar 2014 00:53:32 -0700

It's not about directly porting algorithms to Spark, its about portingthem to a DSL that executes on top of Spark. This page has informationabout it:


https://mahout.apache.org/users/sparkbindings/home.html


--sebastian

On 03/19/2014 08:43 AM, chalitha udara Perera wrote:

Thanks a lot everyone for valuable insights. Since now the main focus is on
porting to Spark, I would be really happy to get involved with it. Can you
give me more information on current progress with porting, specially
regrading clustering component.

Regards,
Chalitha


On Wed, Mar 19, 2014 at 12:43 PM, Suneel Marthi <[email protected]>wrote:






On Wednesday, March 19, 2014 3:09 AM, Dmitriy Lyubimov <[email protected]>
wrote:

On Tue, Mar 18, 2014 at 11:56 PM, chalitha udara Perera <
[email protected]> wrote:

Hi Dmitriy,

I agree with you that i need to be more specific on this matter. Here I

was

referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b

and

c.

He mainly speaks of test coverage there and REST exposition.  What you
saying is a bit more ambitious IMO.

This was long before the discussion of H2O and Spark had come up. In a
later email, I had also mentioned uniform interfaces for API and porting
stuff to Spark.


For example this is one thing i have experienced while using mahout
clustering. I have used both simple kmeans and spectral kmeans and for
simple kmeans input is the sequence file containing the tfidf vectors of
the documents while for spectral kmeans it is a csv file defining the
similarity matrix. It would have been much easier for users if spectral
kmeans also takes the tfidf vectors and create the similarity matrix
internally. I think that would improve the usability.

I don't think clustering is tf-idf specic. I think this is a chance for
proper componentization of concerns here.

Agree with Dmitriy here.


Totally Agree. I was just trying to give an example of uniformity.


And most of these algorithms are designed to run via the command line. I
know currently lot of programmers just use run(String []) method for
programming. I am not saying it is impossible to use Mahout clustering
algorithms as required. but it takes some effort, most of the you need to
dive into the code internals to use it properly and most of the people

are

not going to do that. Please provide your valuable insight on this.

I also really interested in the new direction mahout is heading with

Spark

given that interest for Spark will only grow largely in near future. If

you

think implementing some of clustering algorithms for example simple

kmeans

to support spark is more important for next release, I would be happy to
work on that.


I would be happy to see you give a try there, too.


Regards,
Chalitha

[1]

http://mail-archives.apache.org/mod_mbox/mahout-dev/201402.mbox/%[email protected]%3E




On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov <[email protected]

wrote:

I think you need to be a little bit more specific as to what you are
proposing exactly.  I think "uniform clustering api" needs a bit of
elaboration. I, generally, cannot say that I experienced any pain

calling

out clustering algorithms say in R as a well-documented function. In

Mahout

just doing the same was primarily a pain; but assuming one can call it

with

ease and even interactively, I can't say I experienced any major
inconvenience with just doing this.

I guess one can see that one can abstract away notions of clusters and
clustering output, but I don't have enough experience to tell whether

it

is

a good idea to cover _any_ possible clustering methodology.


On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera <
[email protected]> wrote:

Hi everyone,

Greatly appreciate your interest on this issue. I have gone through

the

document ScalaSparkBindings [1] . In this project my initial idea was

to

provide high level API for end user programmers so that they have the
flexibility of plugin in different types of algorithms without

concerning

about underline details of different types of inputs or outputs.

Also I

consider providing proper test coverage for all clustering algorithm

is a

must for the 1.0 release.

Would like to get your opinion regarding this and little more detail

on

current requirements for clustering would help me to improve

proposal.


Thanks,
Chalitha



On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov <

[email protected]

wrote:

Yes. there's interest.
Note that we are trying to unify linear algebra primitives and

optimization

on Spark as well. All new linear algebra and interaction with spark

context

should probably go thru this layer. This is ongoing thing but some

stuff

is

working [1]

[1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346


On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera <
[email protected]> wrote:

Hi All,

Going through the mail tread Mahout 1.0 goals, I found that the

main

focus

of mahout is now towards the code re-factoring and integration

with

Spark

rather than implementing new algorithms. Recently I have used

mahout

for

implementing document clustering module a Content Management

System.


To be honest we had some problems with lack of uniformity among

different

clustering algorithms. For example simple Kmeans takes input as

the

sequence file with document TF-IDF vectors, while Spectral Kmeans

takes

the

csv file that defines the similarity matrix.

I think if we can provide a uniform clustering API as mentioned

in

1.0

goals, it would be very useful for end user developers.

I would like to proceed with this idea as my GSOC 2014 project.

Please

let

me know if you are interested in this project
--
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*




--
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*




--
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*

Re: [GSOC 2014] Uniform API for Mahout Clustering

Reply via email to