Hello,

I'm looking at the Dataset API in 1.6.1 and also in upcoming 2.0. However
it does not seems to work with Avro data types:


object Datasets extends App {
  val conf = new SparkConf()
  conf.setAppName("Dataset")
  conf.setMaster("local[2]")
  conf.setIfMissing("spark.serializer", classOf[KryoSerializer].getName)
  conf.setIfMissing("spark.kryo.registrator",
classOf[DatasetKryoRegistrator].getName)

  val sc = new SparkContext(conf)
  val sql = new SQLContext(sc)
  import sql.implicits._

  implicit val encoder = Encoders.kryo[MyAvroType]
  val data = sql.read.parquet("path/to/data").as[MyAvroType]

  var c = 0
  // BUG here
  val sizes = data.mapPartitions { iter =>
    List(iter.size).iterator
  }.collect().toList

  println(c)
}


class DatasetKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.register(
      classOf[MyAvroType],
AvroSerializer.SpecificRecordBinarySerializer[MyAvroType])
  }
}


I'm using chill-avro's kryo servirilizer for avro types and I've tried
`Encoders.kyro` as well as `bean` or `javaSerialization`, but none of them
works. The errors seems to be that the generated code does not compile with
janino.

Tested in 1.6.1 and the 2.0.0-preview. Any idea?

-- 
*JU Han*

Software Engineer @ Teads.tv

+33 0619608888

Reply via email to