matchError:null in ALS.train

2014-07-04 Thread Honey Joshi
 Original Message 
Subject: matchError:null in ALS.train
From:Honey Joshi honeyjo...@ideata-analytics.com
Date:Thu, July 3, 2014 8:12 am
To:  user@spark.apache.org
--

Hi All,

We are using ALS.train to generate a model for predictions. We are using
DStream[] to collect the predicted output and then trying to dump in a
text file using these two approaches dstream.saveAsTextFiles() and
dstream.foreachRDD(rdd=rdd.saveAsTextFile).But both these approaches are
giving us the following error :


Exception in thread main org.apache.spark.SparkException: Job aborted
due to stage failure: Task 1.0:0 failed 1 times, most recent failure:
Exception failure in TID 0 on host localhost: scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571)
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:43)
MyOperator$$anonfun$7.apply(MyOperator.scala:213)
MyOperator$$anonfun$7.apply(MyOperator.scala:180)
scala.collection.Iterator$$anon$11.next(Ite
rator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)

We tried it in both spark 0.9.1 as well as 1.0.0 ;scala:2.10.3. Can
anybody help me with the issue.

Thank You
Regards

Honey Joshi
Ideata-Analytics



matchError:null in ALS.train

2014-07-03 Thread Honey Joshi
Hi All,

We are using ALS.train to generate a model for predictions. We are using
DStream[] to collect the predicted output and then trying to dump in a
text file using these two approaches dstream.saveAsTextFiles() and
dstream.foreachRDD(rdd=rdd.saveAsTextFile).But both these approaches are
giving us the following error :


Exception in thread main org.apache.spark.SparkException: Job aborted
due to stage failure: Task 1.0:0 failed 1 times, most recent failure:
Exception failure in TID 0 on host localhost: scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571)
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:43)
MyOperator$$anonfun$7.apply(MyOperator.scala:213)
MyOperator$$anonfun$7.apply(MyOperator.scala:180)
scala.collection.Iterator$$anon$11.next(Ite
rator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)

We tried it in both spark 0.9.1 as well as 1.0.0 ;scala:2.10.3. Can
anybody help me with the issue.

Thank You
Regards

Honey Joshi
Ideata-Analytics


Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-03 Thread Honey Joshi
On Wed, July 2, 2014 2:00 am, Mayur Rustagi wrote:
 two job context cannot share data, are you collecting the data to the
 master  then sending it to the other context?

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi




 On Wed, Jul 2, 2014 at 11:57 AM, Honey Joshi 
 honeyjo...@ideata-analytics.com wrote:

 On Wed, July 2, 2014 1:11 am, Mayur Rustagi wrote:

 Ideally you should be converting RDD to schemardd ?
 You are creating UnionRDD to join across dstream rdd?




 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi





 On Tue, Jul 1, 2014 at 3:11 PM, Honey Joshi
 honeyjo...@ideata-analytics.com


 wrote:



 Hi,
 I am trying to run a project which takes data as a DStream and dumps
 the data in the Shark table after various operations. I am getting
 the following error :

 Exception in thread main org.apache.spark.SparkException: Job
 aborted:
 Task 0.0:0 failed 1 times (most recent failure: Exception failure:
 java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition
 cannot be cast to org.apache.spark.rdd.HadoopPartition) at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$s
 ched uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$s
 ched uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArra
 y.sc ala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala
 :102
 6)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.ap
 ply( DAGScheduler.scala:619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.ap
 ply( DAGScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236) at

 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.s
 cala :619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$a
 nonf un$receive$1.applyOrElse(DAGScheduler.scala:207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at
 akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at
 akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(Ab
 stra ctDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260
 )
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPo
 ol.j ava:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:
 1979
 )
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerTh
 read .java:107)



 Can someone please explain the cause of this error, I am also using
 a Spark Context with the existing Streaming Context.





 I am using spark 0.9.0-Incubating, so it doesnt have anything to do
 with schemaRDD.This error is probably coming when I am trying to use one
 spark context and one shark context in the same job.Is there any way to
 incorporate two context in one job? Regards


 Honey Joshi
 Ideata-Analytics




Both of these contexts are independently executing but they were still
giving us issues, mostly because of the lazy evaluation in scala.This
error is probably coming when I am trying to use one spark context and one
shark context in the same job.Got it resolved by stopping the existing
spark context before calling the shark context. Thanks for your help
Mayur.



Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-02 Thread Honey Joshi
On Wed, July 2, 2014 1:11 am, Mayur Rustagi wrote:
 Ideally you should be converting RDD to schemardd ?
 You are creating UnionRDD to join across dstream rdd?



 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi




 On Tue, Jul 1, 2014 at 3:11 PM, Honey Joshi
 honeyjo...@ideata-analytics.com

 wrote:


 Hi,
 I am trying to run a project which takes data as a DStream and dumps the
  data in the Shark table after various operations. I am getting the
 following error :

 Exception in thread main org.apache.spark.SparkException: Job
 aborted:
 Task 0.0:0 failed 1 times (most recent failure: Exception failure:
 java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition cannot
  be cast to org.apache.spark.rdd.HadoopPartition) at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$sched
 uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$sched
 uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.sc
 ala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:102
 6)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(
 DAGScheduler.scala:619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(
 DAGScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236) at

 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala
 :619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonf
 un$receive$1.applyOrElse(DAGScheduler.scala:207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at
 akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at
 akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(Abstra
 ctDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.j
 ava:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979
 )
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread
 .java:107)


 Can someone please explain the cause of this error, I am also using a
 Spark Context with the existing Streaming Context.




I am using spark 0.9.0-Incubating, so it doesnt have anything to do with
schemaRDD.This error is probably coming when I am trying to use one spark
context and one shark context in the same job.Is there any way to
incorporate two context in one job?
Regards

Honey Joshi
Ideata-Analytics



Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-01 Thread Honey Joshi
Hi,
I am trying to run a project which takes data as a DStream and dumps the
data in the Shark table after various operations. I am getting the
following error :

Exception in thread main org.apache.spark.SparkException: Job aborted:
Task 0.0:0 failed 1 times (most recent failure: Exception failure:
java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition cannot
be cast to org.apache.spark.rdd.HadoopPartition)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Can someone please explain the cause of this error, I am also using a
Spark Context with the existing Streaming Context.


java.io.FileNotFoundException: http://IP/broadcast_1

2014-07-01 Thread Honey Joshi
Hi All,

We are using shark table to dump the data, we are getting the following
error :

Exception in thread main org.apache.spark.SparkException: Job aborted:
Task 1.0:0 failed 1 times (most recent failure: Exception failure:
java.io.FileNotFoundException: http://IP/broadcast_1)

We dont know where the error is coming from, can anyone please explain me
the casue of this error and how to handle it. The spark.cleaner.ttl is set
to 4600, which i guess is more than enough to run the application.
Spark Version : 0.9.0-incubating
Shark : 0.9.0 - SNAPSHOT
Scala : 2.10.3

Thank You
Honey Joshi
Ideata Analytics