Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
I figured out the issuethe driver memory was at 512 MB and for our datasets, the following code needed more memory... // Materialize usersOut and productsOut. usersOut.count() productsOut.count() Thanks. Deb On Sat, Aug 9, 2014 at 6:12 PM, Debasish Das debasish.da...@gmail.com wrote: Actually nope it did not work fine... With multiple ALS iteration, I am getting the same error (with or without my mllib changes) Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 206 in stage 52.0 failed 4 times, most recent failure: Lost task 206.3 in stage 52.0 (TID , tblpmidn42adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5$$anonfun$apply$4.apply(CoGroupedRDD.scala:159) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:129) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:126) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682) at
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
Hi Xiangrui, Based on your suggestion I moved core and mllib both to 1.1.0-SNAPSHOT...I am still getting class cast exception: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 249 in stage 52.0 failed 4 times, most recent failure: Lost task 249.3 in stage 52.0 (TID 10002, tblpmidn06adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 I am running ALS.scala merged with my changes. I will try the mllib jar without my changes next... Can this be due to the fact that my jars are compiled with Java 1.7_55 but the cluster JRE is at 1.7_45. Thanks. Deb On Wed, Aug 6, 2014 at 12:01 PM, Debasish Das debasish.da...@gmail.com wrote: I did not play with Hadoop settings...everything is compiled with 2.3.0CDH5.0.2 for me... I did try to bump the version number of HBase from 0.94 to 0.96 or 0.98 but there was no profile for CDH in the pom...but that's unrelated to this ! On Wed, Aug 6, 2014 at 9:45 AM, DB Tsai dbt...@dbtsai.com wrote: One related question, is mllib jar independent from hadoop version (doesnt use hadoop api directly)? Can I use mllib jar compile for one version of hadoop and use it in another version of hadoop? Sent from my Google Nexus 5 On Aug 6, 2014 8:29 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, Maintaining another file will be a pain later so I deployed spark 1.0.1 without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT along with the code changes for quadratic optimization... Later the plan is to patch the snapshot mllib with the deployed stable mllib... There are 5 variants that I am experimenting with around 400M ratings (daily data, monthly data I will update in few days)... 1. LS 2. NNLS 3. Quadratic with bounds 4. Quadratic with L1 5. Quadratic with equality and positivity Now the ALS 1.1.0 snapshot runs fine but after completion on this step ALS.scala:311 // Materialize usersOut and productsOut. usersOut.count() I am getting from one of the executors: java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 I am debugging it further but I was wondering if this is due to RDD compatibility within 1.0.1 and 1.1.0-SNAPSHOT ? I have built the jars on my Mac which has Java 1.7.0_55 but the deployed cluster has Java 1.7.0_45. The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that Java version mismatch cause this ? Stack traces are below Thanks. Deb Executor stacktrace: org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
I was having this same problem early this week and had to include my changes in the assembly. On Sat, Aug 9, 2014 at 9:59 AM, Debasish Das debasish.da...@gmail.com wrote: I validated that I can reproduce this problem with master as well (without adding any of my mllib changes)... I separated mllib jar from assembly, deploy the assembly and then I supply the mllib jar as --jars option to spark-submit... I get this error: 14/08/09 12:49:32 INFO DAGScheduler: Failed to run count at ALS.scala:299 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 238 in stage 40.0 failed 4 times, most recent failure: Lost task 238.3 in stage 40.0 (TID 10002, tblpmidn05adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5$$anonfun$apply$4.apply(CoGroupedRDD.scala:159) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:129) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:126) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141) at
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
Hi Xiangrui, Maintaining another file will be a pain later so I deployed spark 1.0.1 without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT along with the code changes for quadratic optimization... Later the plan is to patch the snapshot mllib with the deployed stable mllib... There are 5 variants that I am experimenting with around 400M ratings (daily data, monthly data I will update in few days)... 1. LS 2. NNLS 3. Quadratic with bounds 4. Quadratic with L1 5. Quadratic with equality and positivity Now the ALS 1.1.0 snapshot runs fine but after completion on this step ALS.scala:311 // Materialize usersOut and productsOut. usersOut.count() I am getting from one of the executors: java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 I am debugging it further but I was wondering if this is due to RDD compatibility within 1.0.1 and 1.1.0-SNAPSHOT ? I have built the jars on my Mac which has Java 1.7.0_55 but the deployed cluster has Java 1.7.0_45. The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that Java version mismatch cause this ? Stack traces are below Thanks. Deb Executor stacktrace: org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
One thing I like to clarify is that we do not support running a newer version of a Spark component on top of a older version of Spark core. I don't remember any code change in MLlib that requires Spark v1.1 but I might miss some PRs. There were changes to CoGroup, which may be relevant: https://github.com/apache/spark/commits/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala Btw, for the constrained optimization, I'm really interested in how they differ in the final recommendation? It would be great if you can test prec@k or ndcg@k metrics. Best, Xiangrui On Wed, Aug 6, 2014 at 8:28 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, Maintaining another file will be a pain later so I deployed spark 1.0.1 without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT along with the code changes for quadratic optimization... Later the plan is to patch the snapshot mllib with the deployed stable mllib... There are 5 variants that I am experimenting with around 400M ratings (daily data, monthly data I will update in few days)... 1. LS 2. NNLS 3. Quadratic with bounds 4. Quadratic with L1 5. Quadratic with equality and positivity Now the ALS 1.1.0 snapshot runs fine but after completion on this step ALS.scala:311 // Materialize usersOut and productsOut. usersOut.count() I am getting from one of the executors: java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 I am debugging it further but I was wondering if this is due to RDD compatibility within 1.0.1 and 1.1.0-SNAPSHOT ? I have built the jars on my Mac which has Java 1.7.0_55 but the deployed cluster has Java 1.7.0_45. The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that Java version mismatch cause this ? Stack traces are below Thanks. Deb Executor stacktrace: org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: at
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
Ok...let me look into it a bit more and most likely I will deploy the Spark v1.1 and then use mllib 1.1 SNAPSHOT jar with it so that we follow your guideline of not running newer spark component on older version of spark core... That should solve this issue unless it is related to Java versions I am also keen to see the final recommendation within L1 and PositivityI will compute the metrics Our plan is to use scalable matrix factorization as an engine to do clustering, feature extraction, topic modeling and auto encoders (single layer to start with). So these algorithms are not really constrained to recommendation use-cases... On Wed, Aug 6, 2014 at 9:09 AM, Xiangrui Meng men...@gmail.com wrote: One thing I like to clarify is that we do not support running a newer version of a Spark component on top of a older version of Spark core. I don't remember any code change in MLlib that requires Spark v1.1 but I might miss some PRs. There were changes to CoGroup, which may be relevant: https://github.com/apache/spark/commits/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala Btw, for the constrained optimization, I'm really interested in how they differ in the final recommendation? It would be great if you can test prec@k or ndcg@k metrics. Best, Xiangrui On Wed, Aug 6, 2014 at 8:28 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, Maintaining another file will be a pain later so I deployed spark 1.0.1 without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT along with the code changes for quadratic optimization... Later the plan is to patch the snapshot mllib with the deployed stable mllib... There are 5 variants that I am experimenting with around 400M ratings (daily data, monthly data I will update in few days)... 1. LS 2. NNLS 3. Quadratic with bounds 4. Quadratic with L1 5. Quadratic with equality and positivity Now the ALS 1.1.0 snapshot runs fine but after completion on this step ALS.scala:311 // Materialize usersOut and productsOut. usersOut.count() I am getting from one of the executors: java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 I am debugging it further but I was wondering if this is due to RDD compatibility within 1.0.1 and 1.1.0-SNAPSHOT ? I have built the jars on my Mac which has Java 1.7.0_55 but the deployed cluster has Java 1.7.0_45. The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that Java version mismatch cause this ? Stack traces are below Thanks. Deb Executor stacktrace: org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
One related question, is mllib jar independent from hadoop version (doesnt use hadoop api directly)? Can I use mllib jar compile for one version of hadoop and use it in another version of hadoop? Sent from my Google Nexus 5 On Aug 6, 2014 8:29 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, Maintaining another file will be a pain later so I deployed spark 1.0.1 without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT along with the code changes for quadratic optimization... Later the plan is to patch the snapshot mllib with the deployed stable mllib... There are 5 variants that I am experimenting with around 400M ratings (daily data, monthly data I will update in few days)... 1. LS 2. NNLS 3. Quadratic with bounds 4. Quadratic with L1 5. Quadratic with equality and positivity Now the ALS 1.1.0 snapshot runs fine but after completion on this step ALS.scala:311 // Materialize usersOut and productsOut. usersOut.count() I am getting from one of the executors: java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 I am debugging it further but I was wondering if this is due to RDD compatibility within 1.0.1 and 1.1.0-SNAPSHOT ? I have built the jars on my Mac which has Java 1.7.0_55 but the deployed cluster has Java 1.7.0_45. The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that Java version mismatch cause this ? Stack traces are below Thanks. Deb Executor stacktrace: org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
I did not play with Hadoop settings...everything is compiled with 2.3.0CDH5.0.2 for me... I did try to bump the version number of HBase from 0.94 to 0.96 or 0.98 but there was no profile for CDH in the pom...but that's unrelated to this ! On Wed, Aug 6, 2014 at 9:45 AM, DB Tsai dbt...@dbtsai.com wrote: One related question, is mllib jar independent from hadoop version (doesnt use hadoop api directly)? Can I use mllib jar compile for one version of hadoop and use it in another version of hadoop? Sent from my Google Nexus 5 On Aug 6, 2014 8:29 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, Maintaining another file will be a pain later so I deployed spark 1.0.1 without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT along with the code changes for quadratic optimization... Later the plan is to patch the snapshot mllib with the deployed stable mllib... There are 5 variants that I am experimenting with around 400M ratings (daily data, monthly data I will update in few days)... 1. LS 2. NNLS 3. Quadratic with bounds 4. Quadratic with L1 5. Quadratic with equality and positivity Now the ALS 1.1.0 snapshot runs fine but after completion on this step ALS.scala:311 // Materialize usersOut and productsOut. usersOut.count() I am getting from one of the executors: java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 I am debugging it further but I was wondering if this is due to RDD compatibility within 1.0.1 and 1.1.0-SNAPSHOT ? I have built the jars on my Mac which has Java 1.7.0_55 but the deployed cluster has Java 1.7.0_45. The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that Java version mismatch cause this ? Stack traces are below Thanks. Deb Executor stacktrace: org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
I created the assembly file but still it wants to pick the mllib from the cluster: jar tf ./target/ml-0.0.1-SNAPSHOT-jar-with-dependencies.jar | grep QuadraticMinimizer org/apache/spark/mllib/optimization/QuadraticMinimizer$$anon$1.class /Users/v606014/dist-1.0.1/bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class ALSDriver ./target/ml-0.0.1-SNAPSHOT-jar-with-dependencies.jar inputPath outputPath Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.mllib.recommendation.ALS.setLambdaL1(D)Lorg/apache/spark/mllib/recommendation/ALS; Now if I force it to use the jar that I gave using spark.files.userClassPathFirst, then it fails on some serialization issues... A simple solution is to cherry pick the files I need from spark branch to the application branch but I am not sure that's the right thing to do... The way userClassPathFirst is behaving, there might be bugs in it... Any suggestions will be appreciated Thanks. Deb On Sat, Aug 2, 2014 at 11:12 AM, Xiangrui Meng men...@gmail.com wrote: Yes, that should work. spark-mllib-1.1.0 should be compatible with spark-core-1.0.1. On Sat, Aug 2, 2014 at 10:54 AM, Debasish Das debasish.da...@gmail.com wrote: Let me try it... Will this be fixed if I generate a assembly file with mllib-1.1.0 SNAPSHOT jar and other dependencies with the rest of the application code ? On Sat, Aug 2, 2014 at 10:46 AM, Xiangrui Meng men...@gmail.com wrote: You can try enabling spark.files.userClassPathFirst. But I'm not sure whether it could solve your problem. -Xiangrui On Sat, Aug 2, 2014 at 10:13 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I have deployed spark stable 1.0.1 on the cluster but I have new code that I added in mllib-1.1.0-SNAPSHOT. I am trying to access the new code using spark-submit as follows: spark-job --class com.verizon.bda.mllib.recommendation.ALSDriver --executor-memory 16g --total-executor-cores 16 --jars spark-mllib_2.10-1.1.0-SNAPSHOT.jar,scopt_2.10-3.2.0.jar sag-core-0.0.1-SNAPSHOT.jar --rank 25 --numIterations 10 --lambda 1.0 --qpProblem 2 inputPath outputPath I can see the jars are getting added to httpServer as expected: 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/spark-mllib_2.10-1.1.0-SNAPSHOT.jar at http://10.145.84.20:37798/jars/spark-mllib_2.10-1.1.0-SNAPSHOT.jar with timestamp 1406998204236 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/scopt_2.10-3.2.0.jar at http://10.145.84.20:37798/jars/scopt_2.10-3.2.0.jar with timestamp 1406998204237 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/sag-core-0.0.1-SNAPSHOT.jar at http://10.145.84.20:37798/jars/sag-core-0.0.1-SNAPSHOT.jar with timestamp 1406998204238 But the job still can't access code form mllib-1.1.0 SNAPSHOT.jar...I think it's picking up the mllib from cluster which is at 1.0.1... Please help. I will ask for a PR tomorrow but internally we want to generate results from the new code. Thanks. Deb
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
If you cannot change the Spark jar deployed on the cluster, an easy solution would be renaming ALS in your jar. If userClassPathFirst doesn't work, could you create a JIRA and attach the log? Thanks! -Xiangrui On Tue, Aug 5, 2014 at 9:10 AM, Debasish Das debasish.da...@gmail.com wrote: I created the assembly file but still it wants to pick the mllib from the cluster: jar tf ./target/ml-0.0.1-SNAPSHOT-jar-with-dependencies.jar | grep QuadraticMinimizer org/apache/spark/mllib/optimization/QuadraticMinimizer$$anon$1.class /Users/v606014/dist-1.0.1/bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class ALSDriver ./target/ml-0.0.1-SNAPSHOT-jar-with-dependencies.jar inputPath outputPath Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.mllib.recommendation.ALS.setLambdaL1(D)Lorg/apache/spark/mllib/recommendation/ALS; Now if I force it to use the jar that I gave using spark.files.userClassPathFirst, then it fails on some serialization issues... A simple solution is to cherry pick the files I need from spark branch to the application branch but I am not sure that's the right thing to do... The way userClassPathFirst is behaving, there might be bugs in it... Any suggestions will be appreciated Thanks. Deb On Sat, Aug 2, 2014 at 11:12 AM, Xiangrui Meng men...@gmail.com wrote: Yes, that should work. spark-mllib-1.1.0 should be compatible with spark-core-1.0.1. On Sat, Aug 2, 2014 at 10:54 AM, Debasish Das debasish.da...@gmail.com wrote: Let me try it... Will this be fixed if I generate a assembly file with mllib-1.1.0 SNAPSHOT jar and other dependencies with the rest of the application code ? On Sat, Aug 2, 2014 at 10:46 AM, Xiangrui Meng men...@gmail.com wrote: You can try enabling spark.files.userClassPathFirst. But I'm not sure whether it could solve your problem. -Xiangrui On Sat, Aug 2, 2014 at 10:13 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I have deployed spark stable 1.0.1 on the cluster but I have new code that I added in mllib-1.1.0-SNAPSHOT. I am trying to access the new code using spark-submit as follows: spark-job --class com.verizon.bda.mllib.recommendation.ALSDriver --executor-memory 16g --total-executor-cores 16 --jars spark-mllib_2.10-1.1.0-SNAPSHOT.jar,scopt_2.10-3.2.0.jar sag-core-0.0.1-SNAPSHOT.jar --rank 25 --numIterations 10 --lambda 1.0 --qpProblem 2 inputPath outputPath I can see the jars are getting added to httpServer as expected: 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/spark-mllib_2.10-1.1.0-SNAPSHOT.jar at http://10.145.84.20:37798/jars/spark-mllib_2.10-1.1.0-SNAPSHOT.jar with timestamp 1406998204236 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/scopt_2.10-3.2.0.jar at http://10.145.84.20:37798/jars/scopt_2.10-3.2.0.jar with timestamp 1406998204237 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/sag-core-0.0.1-SNAPSHOT.jar at http://10.145.84.20:37798/jars/sag-core-0.0.1-SNAPSHOT.jar with timestamp 1406998204238 But the job still can't access code form mllib-1.1.0 SNAPSHOT.jar...I think it's picking up the mllib from cluster which is at 1.0.1... Please help. I will ask for a PR tomorrow but internally we want to generate results from the new code. Thanks. Deb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
You can try enabling spark.files.userClassPathFirst. But I'm not sure whether it could solve your problem. -Xiangrui On Sat, Aug 2, 2014 at 10:13 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I have deployed spark stable 1.0.1 on the cluster but I have new code that I added in mllib-1.1.0-SNAPSHOT. I am trying to access the new code using spark-submit as follows: spark-job --class com.verizon.bda.mllib.recommendation.ALSDriver --executor-memory 16g --total-executor-cores 16 --jars spark-mllib_2.10-1.1.0-SNAPSHOT.jar,scopt_2.10-3.2.0.jar sag-core-0.0.1-SNAPSHOT.jar --rank 25 --numIterations 10 --lambda 1.0 --qpProblem 2 inputPath outputPath I can see the jars are getting added to httpServer as expected: 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/spark-mllib_2.10-1.1.0-SNAPSHOT.jar at http://10.145.84.20:37798/jars/spark-mllib_2.10-1.1.0-SNAPSHOT.jar with timestamp 1406998204236 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/scopt_2.10-3.2.0.jar at http://10.145.84.20:37798/jars/scopt_2.10-3.2.0.jar with timestamp 1406998204237 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/sag-core-0.0.1-SNAPSHOT.jar at http://10.145.84.20:37798/jars/sag-core-0.0.1-SNAPSHOT.jar with timestamp 1406998204238 But the job still can't access code form mllib-1.1.0 SNAPSHOT.jar...I think it's picking up the mllib from cluster which is at 1.0.1... Please help. I will ask for a PR tomorrow but internally we want to generate results from the new code. Thanks. Deb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
Yes, that should work. spark-mllib-1.1.0 should be compatible with spark-core-1.0.1. On Sat, Aug 2, 2014 at 10:54 AM, Debasish Das debasish.da...@gmail.com wrote: Let me try it... Will this be fixed if I generate a assembly file with mllib-1.1.0 SNAPSHOT jar and other dependencies with the rest of the application code ? On Sat, Aug 2, 2014 at 10:46 AM, Xiangrui Meng men...@gmail.com wrote: You can try enabling spark.files.userClassPathFirst. But I'm not sure whether it could solve your problem. -Xiangrui On Sat, Aug 2, 2014 at 10:13 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I have deployed spark stable 1.0.1 on the cluster but I have new code that I added in mllib-1.1.0-SNAPSHOT. I am trying to access the new code using spark-submit as follows: spark-job --class com.verizon.bda.mllib.recommendation.ALSDriver --executor-memory 16g --total-executor-cores 16 --jars spark-mllib_2.10-1.1.0-SNAPSHOT.jar,scopt_2.10-3.2.0.jar sag-core-0.0.1-SNAPSHOT.jar --rank 25 --numIterations 10 --lambda 1.0 --qpProblem 2 inputPath outputPath I can see the jars are getting added to httpServer as expected: 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/spark-mllib_2.10-1.1.0-SNAPSHOT.jar at http://10.145.84.20:37798/jars/spark-mllib_2.10-1.1.0-SNAPSHOT.jar with timestamp 1406998204236 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/scopt_2.10-3.2.0.jar at http://10.145.84.20:37798/jars/scopt_2.10-3.2.0.jar with timestamp 1406998204237 14/08/02 12:50:04 INFO SparkContext: Added JAR file:/vzhome/v606014/spark-glm/sag-core-0.0.1-SNAPSHOT.jar at http://10.145.84.20:37798/jars/sag-core-0.0.1-SNAPSHOT.jar with timestamp 1406998204238 But the job still can't access code form mllib-1.1.0 SNAPSHOT.jar...I think it's picking up the mllib from cluster which is at 1.0.1... Please help. I will ask for a PR tomorrow but internally we want to generate results from the new code. Thanks. Deb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org