It's not about directly porting algorithms to Spark, its about porting them to a DSL that executes on top of Spark. This page has information about it:

https://mahout.apache.org/users/sparkbindings/home.html

--sebastian

On 03/19/2014 08:43 AM, chalitha udara Perera wrote:
Thanks a lot everyone for valuable insights. Since now the main focus is on
porting to Spark, I would be really happy to get involved with it. Can you
give me more information on current progress with porting, specially
regrading clustering component.

Regards,
Chalitha


On Wed, Mar 19, 2014 at 12:43 PM, Suneel Marthi <[email protected]>wrote:






On Wednesday, March 19, 2014 3:09 AM, Dmitriy Lyubimov <[email protected]>
wrote:

On Tue, Mar 18, 2014 at 11:56 PM, chalitha udara Perera <
[email protected]> wrote:

Hi Dmitriy,

I agree with you that i need to be more specific on this matter. Here I
was
referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b
and
c.

He mainly speaks of test coverage there and REST exposition.  What you
saying is a bit more ambitious IMO.

This was long before the discussion of H2O and Spark had come up. In a
later email, I had also mentioned uniform interfaces for API and porting
stuff to Spark.

For example this is one thing i have experienced while using mahout
clustering. I have used both simple kmeans and spectral kmeans and for
simple kmeans input is the sequence file containing the tfidf vectors of
the documents while for spectral kmeans it is a csv file defining the
similarity matrix. It would have been much easier for users if spectral
kmeans also takes the tfidf vectors and create the similarity matrix
internally. I think that would improve the usability.

I don't think clustering is tf-idf specic. I think this is a chance for
proper componentization of concerns here.

Agree with Dmitriy here.


Totally Agree. I was just trying to give an example of uniformity.



And most of these algorithms are designed to run via the command line. I
know currently lot of programmers just use run(String []) method for
programming. I am not saying it is impossible to use Mahout clustering
algorithms as required. but it takes some effort, most of the you need to
dive into the code internals to use it properly and most of the people
are
not going to do that. Please provide your valuable insight on this.

I also really interested in the new direction mahout is heading with
Spark
given that interest for Spark will only grow largely in near future. If
you
think implementing some of clustering algorithms for example simple
kmeans
to support spark is more important for next release, I would be happy to
work on that.


I would be happy to see you give a try there, too.



Regards,
Chalitha

[1]


http://mail-archives.apache.org/mod_mbox/mahout-dev/201402.mbox/%[email protected]%3E



On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov <[email protected]
wrote:

I think you need to be a little bit more specific as to what you are
proposing exactly.  I think "uniform clustering api" needs a bit of
elaboration. I, generally, cannot say that I experienced any pain
calling
out clustering algorithms say in R as a well-documented function. In
Mahout
just doing the same was primarily a pain; but assuming one can call it
with
ease and even interactively, I can't say I experienced any major
inconvenience with just doing this.

I guess one can see that one can abstract away notions of clusters and
clustering output, but I don't have enough experience to tell whether
it
is
a good idea to cover _any_ possible clustering methodology.


On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera <
[email protected]> wrote:

Hi everyone,

Greatly appreciate your interest on this issue. I have gone through
the
document ScalaSparkBindings [1] . In this project my initial idea was
to
provide high level API for end user programmers so that they have the
flexibility of plugin in different types of algorithms without
concerning
about underline details of different types of inputs or outputs.
Also I
consider providing proper test coverage for all clustering algorithm
is a
must for the 1.0 release.

Would like to get your opinion regarding this and little more detail
on
current requirements for clustering would help me to improve
proposal.

Thanks,
Chalitha



On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov <
[email protected]
wrote:

Yes. there's interest.
Note that we are trying to unify linear algebra primitives and
optimization
on Spark as well. All new linear algebra and interaction with spark
context
should probably go thru this layer. This is ongoing thing but some
stuff
is
working [1]

[1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346


On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera <
[email protected]> wrote:

Hi All,

Going through the mail tread Mahout 1.0 goals, I found that the
main
focus
of mahout is now towards the code re-factoring and integration
with
Spark
rather than implementing new algorithms. Recently I have used
mahout
for
implementing document clustering module a Content Management
System.

To be honest we had some problems with lack of uniformity among
different
clustering algorithms. For example simple Kmeans takes input as
the
sequence file with document TF-IDF vectors, while Spectral Kmeans
takes
the
csv file that defines the similarity matrix.

I think if we can provide a uniform clustering API as mentioned
in
1.0
goals, it would be very useful for end user developers.

I would like to proceed with this idea as my GSOC 2014 project.
Please
let
me know if you are interested in this project
--
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*





--
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*





--
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*






Reply via email to