Re: Pass parameters to RDD functions

Aaron Davidson Thu, 03 Jul 2014 09:29:32 -0700

Either Serializable works, scala Serializable extends Java's (originally
intended a common interface for people who didn't want to run Scala on a
JVM).


Class fields require the class be serialized along with the object to
access. If you declared "val n" inside a method's scope instead, though, we
wouldn't need the class. E.g.:

class TextToWordVector(csvData:RDD[Array[String]]) {
  def computeX() = {
    val n = 1
    csvData.map{ stringArr => stringArr(n)}.collect()
  }
  lazy val x = computeX()
}

Note that if the class itself doesn't actually contain many (large) fields,
though, it may not be an issue to actually transfer it around.



On Thu, Jul 3, 2014 at 5:21 AM, Ulanov, Alexander <alexander.ula...@hp.com>
wrote:

> Thanks, this works both with Scala and Java Serializable. Which one should
> I use?
>
> Related question: I would like only the particular val to be used instead
> of the whole class, what should I do?
> As far as I understand, the whole class is serialized and transferred
> between nodes (am I right?)
>
> Alexander
>
> -----Original Message-----
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Thursday, July 03, 2014 3:31 PM
> To: dev@spark.apache.org
> Subject: Re: Pass parameters to RDD functions
>
> Declare this class with "extends Serializable", meaning
> java.io.Serializable?
>
> On Thu, Jul 3, 2014 at 12:24 PM, Ulanov, Alexander <
> alexander.ula...@hp.com> wrote:
> > Hi,
> >
> > I wonder how I can pass parameters to RDD functions with closures. If I
> do it in a following way, Spark crashes with NotSerializableException:
> >
> > class TextToWordVector(csvData:RDD[Array[String]]) {
> >
> >   val n = 1
> >   lazy val x = csvData.map{ stringArr => stringArr(n)}.collect() }
> >
> > Exception:
> > Job aborted due to stage failure: Task not serializable:
> > java.io.NotSerializableException:
> > org.apache.spark.mllib.util.TextToWordVector
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> not serializable: java.io.NotSerializableException:
> org.apache.spark.mllib.util.TextToWordVector
> >                 at
> > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAG
> > Scheduler$$failJobAndIndependentStages(DAGScheduler.scala:1038)
> >
> >
> > This message proposes a workaround, but it didn't work for me:
> > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAA
> > _qdLrxXzwXd5=6SXLOgSmTTorpOADHjnOXn=tMrOLEJM=f...@mail.gmail.com%3E
> >
> > What is the best practice?
> >
> > Best regards, Alexander
>

Re: Pass parameters to RDD functions

Reply via email to