[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

Pat Ferrel (JIRA) Tue, 15 Apr 2014 09:53:42 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969722#comment-13969722
 ]


Pat Ferrel edited comment on MAHOUT-1464 at 4/15/14 4:51 PM:
-------------------------------------------------------------

Silence indicates: you don't know how to? it can't be done because jars aren't 
created for it yet? you'd rather launch from the Scala shell?  If the later, 
that's fine I just want to get IDEA out of the equation so instructions for 
running in the Scala shell would be helpful.

I plan to move on to using HDFS for storage but still have a local storage 
failure below.

Concentrating on local storage for now I get the following from my dev machine 
launching in IDEA:

input,         output,        mahoutSparkContext(masterUrl = ,  Success?
local path, local path,   "local",                                              
 yes
local path, local path,   "spark://Maclaurin:7077",                  yes
local path local path,   "spark://occam4:7077",                     no, 
computation finishes correctly but the last stage dump/write
                                                                                
           the DRM fails, the spark master is a remote machine who 
                                                                                
           is also the HDFS master and is managing three Spark 
                                                                                
           slaves, all is OK in the WebUI, no errors in the Spark logs
, 
This last case I have tried various forms of the "local path" for output and 
suspect that using the correct form of the URI may be the problem so if someone 
sees the mistake please let me know:
1) "tmp/co-occurrence-on-epinions/indicators-item-item/" relative path to the 
IDEA working directory, which works for input.
2) "/Users/pat/hdfs-mirror/tmp/co-occurrence-on-epinions/indicators-item-item/" 
absolute path so no IDEA working directory
3) 
"file:///Users/pat/hdfs-mirror/tmp/co-occurrence-on-epinions/indicators-item-item/"
 URI form of full local path

Code for #3 is:

RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0),
      
"file:///Users/pat/hdfs-mirror/tmp/co-occurrence-on-epinions/indicators-item-item/")

For #3 I get the following exception message. The _temporary dir does exist, 
there is just nothing in it:

14/04/15 09:07:03 INFO scheduler.DAGScheduler: Failed to run saveAsTextFile at 
Recommendations.scala:178
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
8.0:0 failed 4 times (most recent failure: Exception failure: 
java.io.IOException: The temporary job-output directory 
file:/Users/pat/hdfs-mirror/tmp/co-occurrence-on-epinions/indicators-item-item/_temporary
 doesn't exist!)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
        at scala.Option.foreach(Option.scala:236)
        at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Disconnected from the target VM, address: '127.0.0.1:58830', transport: 'socket'



was (Author: pferrel):
Silence indicates: you don't know how to? it can't be done because jars aren't 
created for it yet? you'd rather launch from the Scala shell?  If the later, 
that's fine I just want to get IDEA out of the equation so instructions for 
running in the Scala shell would be helpful.

I plan to move on to using HDFS for storage but still have a local storage 
failure below.

Concentrating on local storage for now I get the following from my dev machine 
launching in IDEA:

input         output        mahoutSparkContext(masterUrl =   Success?
local path local path   "local"                                               
yes
local path local path   "spark://Maclaurin:7077"                  yes
local path local path   "spark://occam4:7077"                     no, 
computation finishes correctly but the last stage dump/write
                                                                                
           the DRM fails, the spark master is a remote machine who 
                                                                                
           is also the HDFS master and is managing three Spark 
                                                                                
           slaves, all is OK in the WebUI, no errors in the Spark logs
, 
This last case I have tried various forms of the "local path" for output and 
suspect that using the correct form of the URI may be the problem so if someone 
sees the mistake please let me know:
1) "tmp/co-occurrence-on-epinions/indicators-item-item/" relative path to the 
IDEA working directory, which works for input.
2) "/Users/pat/hdfs-mirror/tmp/co-occurrence-on-epinions/indicators-item-item/" 
absolute path so no IDEA working directory
3) 
"file:///Users/pat/hdfs-mirror/tmp/co-occurrence-on-epinions/indicators-item-item/"
 URI form of full local path

Code for #3 is:

RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0),
      
"file:///Users/pat/hdfs-mirror/tmp/co-occurrence-on-epinions/indicators-item-item/")

For #3 I get the following exception message. The _temporary dir does exist, 
there is just nothing in it:

14/04/15 09:07:03 INFO scheduler.DAGScheduler: Failed to run saveAsTextFile at 
Recommendations.scala:178
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
8.0:0 failed 4 times (most recent failure: Exception failure: 
java.io.IOException: The temporary job-output directory 
file:/Users/pat/hdfs-mirror/tmp/co-occurrence-on-epinions/indicators-item-item/_temporary
 doesn't exist!)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
        at scala.Option.foreach(Option.scala:236)
        at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Disconnected from the target VM, address: '127.0.0.1:58830', transport: 'socket'


> Cooccurrence Analysis on Spark
> ------------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

Reply via email to