RE: Mahout on Spark

Saikat Kanjilal Wed, 26 Mar 2014 09:10:28 -0700

+1, in fact I would be very much indebted if someone (namely Dmitry :) ) could 
do a google hangout focused on spark where folks can ask questions and learn 
more, to this end I want to bring up something else, it'd be great if mahout 
itself either through the apache project foundation or through committer means 
have a hadoop cluster to test algorithms, it seems like folks have their own 
cluster to test on but I think it'd be a benefit to the community to have a 
cluster that everyone can leverage.


> Subject: Mahout on Spark
> From: [email protected]
> Date: Wed, 26 Mar 2014 09:05:02 -0700
> To: [email protected]; [email protected]
> 
> New name for a new thread.
> 
> A lot of the discussion on MAHOUT-1464 has been around integrating that 
> feature with the Scala DSL. As Saikat says this is of general interest since 
> people seem to agree that this is a good place to integrate efforts.
> 
> I’m interested in what I think Dmitriy called data frames. Being a complete 
> noob on Spark I may have gotten this wrong but let me take a shot so he can 
> correct me.
> 
> There are a lot of problems that require a pipeline. The text input pipeline 
> is an example, but almost any input to Mahout requires at least an id 
> translation step. What I though Dmitriy was suggesting was that by avoiding 
> the disk write + read between steps we might get significant speedups. This 
> has many implications, I’m sure.
> 
> For one I think it means the non-serialized objects are being used by 
> multiple parts of the pipeline and so are not subject to “translation”.
> 
> Dmitriy can you explain more? You mentioned a talk you have given, do you 
> have slides somewhere or a PDF?
> 
> 
> On Mar 26, 2014, at 7:15 AM, Ted Dunning <[email protected]> wrote:
> 
> It would be great to have you.
> 
> 
> (go ahead and start new threads when appropriate ... better than hijacking)
> 
> 
> On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <[email protected]>wrote:
> 
> > Sorry to hijack the thread,
> > 
> > this seems like first steps of mahout geeting it to work on spark
> > 
> > there are similar efforts going on with R+Spark aka Spark R
> > 
> > not sure if this helpos, played with spark ec2 scripts and it brings up
> > multinode cluster using mesos and its configurable - willing to contribute
> > donations for mahout-dev
> > 
> > 
> > 
> > 
> > 
> > On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <[email protected]
> >> wrote:
> > 
> >> 
> >>    [
> >> 
> > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710
> > ]
> >> 
> >> Saikat Kanjilal commented on MAHOUT-1464:
> >> -----------------------------------------
> >> 
> >> +1 on Andrew's suggestion on using AWS to do this.  Andrew is it possible
> >> to have a shared account so mahout contributors can use this, I 'd even
> > be
> >> willing to chip in donations :) to have a shared AWS account
> >> 
> >>> RowSimilarityJob on Spark
> >>> -------------------------
> >>> 
> >>>                Key: MAHOUT-1464
> >>>                URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> >>>            Project: Mahout
> >>>         Issue Type: Improvement
> >>>         Components: Collaborative Filtering
> >>>   Affects Versions: 0.9
> >>>        Environment: hadoop, spark
> >>>           Reporter: Pat Ferrel
> >>>             Labels: performance
> >>>            Fix For: 1.0
> >>> 
> >>>        Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> >> MAHOUT-1464.patch
> >>> 
> >>> 
> >>> Create a version of RowSimilarityJob that runs on Spark. Ssc has a
> >> prototype here: https://gist.github.com/sscdotopen/8314254. This should
> >> be compatible with Mahout Spark DRM DSL so a DRM can be used as input.
> >>> Ideally this would extend to cover MAHOUT-1422 which is a feature
> >> request for RSJ on two inputs to calculate the similarity of rows of one
> >> DRM with those of another. This cross-similarity has several applications
> >> including cross-action recommendations.
> >> 
> >> 
> >> 
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.2#6252)
> >> 
> > 
>

RE: Mahout on Spark

Reply via email to