Experimenting with datasets in Spark 1.6.0 I ran into a serialization error when using case classes containing a Seq member. There is no problem when using Array instead. Nor is there a problem using RDD or DataFrame (even if converting the DF to a DS later).
Here's an example you can test in the Spark shell: import sqlContext.implicits._ case class SeqThing(id: String, stuff: Seq[Int]) val seqThings = Seq(SeqThing("A", Seq())) val seqData = sc.parallelize(seqThings) case class ArrayThing(id: String, stuff: Array[Int]) val arrayThings = Seq(ArrayThing("A", Array())) val arrayData = sc.parallelize(arrayThings) // Array works fine arrayData.collect() arrayData.toDF.as[ArrayThing] arrayData.toDS // Seq can't convert directly to DS seqData.collect() seqData.toDF.as[SeqThing] seqData.toDS // Serialization exception Is this working as intended? Are there plans to support serializing arbitrary Seq values in datasets, or must everything be converted to Array? ~Daniel Siegmann