Re: CheckpointRDD has different number of partitions than original RDD

Tathagata Das Mon, 07 Apr 2014 16:56:24 -0700

Few things that would be helpful.

1. Environment settings - you can find them on the environment tab in the
Spark application UI
2. Are you setting the HDFS configuration correctly in your Spark program?
For example, can you write a HDFS file from a Spark program (say
spark-shell) to your HDFS installation and read it back into Spark (i.e.,
create a RDD)? You can test this by write an RDD as a text file from the
shell, and then try to read it back from another shell.
3. If that works, then lets try explicitly checkpointing an RDD. To do this
you can take any RDD and do the following.


myRDD.checkpoint()
myRDD.count()

If there is some issue, then this should reproduce the above error.

TD


On Mon, Apr 7, 2014 at 3:48 PM, Paul Mogren <pmog...@commercehub.com> wrote:

> Hello, Spark community!  My name is Paul. I am a Spark newbie, evaluating
> version 0.9.0 without any Hadoop at all, and need some help. I run into the
> following error with the StatefulNetworkWordCount example (and similarly in
> my prototype app, when I use the updateStateByKey operation).  I get this
> when running against my small cluster, but not (so far) against local[2].
>
> 61904 [spark-akka.actor.default-dispatcher-2] ERROR
> org.apache.spark.streaming.scheduler.JobScheduler - Error running job
> streaming job 1396905956000 ms.0
> org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[310] at take
> at DStream.scala:586(0) has different number of partitions than original
> RDD MapPartitionsRDD[309] at mapPartitions at StateDStream.scala:66(2)
>         at
> org.apache.spark.rdd.RDDCheckpointData.doCheckpoint(RDDCheckpointData.scala:99)
>         at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:989)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:855)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:870)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:884)
>         at org.apache.spark.rdd.RDD.take(RDD.scala:844)
>         at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:586)
>         at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:585)
>         at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
>         at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>         at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>         at scala.util.Try$.apply(Try.scala:161)
>         at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>         at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:744)
>
>
> Please let me know what other information would be helpful; I didn't find
> any question submission guidelines.
>
> Thanks,
> Paul
>

Re: CheckpointRDD has different number of partitions than original RDD

Reply via email to