[
https://issues.apache.org/jira/browse/KUDU-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805543#comment-15805543
]
Dan Burkert commented on KUDU-1824:
-----------------------------------
We have an internal {{KuduRow}} class which extends Spark's {{Row}} class,
which in turn extends {{java.io.Serializable}}. {{KuduRow}} is not
serializable, so we are breaking this contract. Making {{KuduRow}} be
serializable is not trivial, since it's just a thin facade over the
non-serializable Kudu {{RowResult}}. We wrote it this way originally for
performance reasons - it would be easy to copy the data into a serializable
format, but we wanted to avoid that. I'm still not sure how the DataFrame API
doesn't hit this issue, since it's just a layer over KuduRDD.
> KuduRDD.collect fails because of NoSerializableException
> --------------------------------------------------------
>
> Key: KUDU-1824
> URL: https://issues.apache.org/jira/browse/KUDU-1824
> Project: Kudu
> Issue Type: Bug
> Components: spark
> Affects Versions: 1.2.0
> Reporter: Dan Burkert
>
> [~sarutak] reported in https://gerrit.cloudera.org/#/c/5496/ that the KuduRDD
> operations which require shuffling data fail due to serialization issues.
> Some investigation notes:
> * The serialization issues only affect KuduRDD, they do not manifest when
> working with the DataFrame API, for example:
> {code}
> scala> val df = sqlContext.read.option("kudu.master",
> "localhost").option("kudu.table", "t1").kudu
> df: org.apache.spark.sql.DataFrame = [a: int, b: string]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,2], [3,3], [4,4],
> [5,5])
> scala> val rdd = new KuduContext("localhost").kuduRDD(sc, "t1", List("a",
> "b"))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = KuduRDD[3] at RDD
> at KuduRDD.scala:31
> scala> rdd.collect
> 17/01/06 11:28:45 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.NotSerializableException: org.apache.kudu.client.RowResult
> Serialization stack:
> - object not serializable (class: org.apache.kudu.client.RowResult,
> value: RowResult index: 4, size: 21, schema: org.apache.kudu.Schema@2b4e3275)
> - field (class: org.apache.kudu.spark.kudu.KuduRow, name: rowResult,
> type: class org.apache.kudu.client.RowResult)
> - object (class org.apache.kudu.spark.kudu.KuduRow, RowResult index:
> 4, size: 21, schema: org.apache.kudu.Schema@2b4e3275)
> - element of array (index: 0)
> - array (class [Lorg.apache.spark.sql.Row;, size 5)
> at
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> {code}
> * Any number of partitions will trigger the issue (I tested with a single
> table table).
> * The issue manifests even in spark local mode (e.g. default spark-shell)
> * The issue manifests even with a single int32 column table (so it's not a
> binary or string issue)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)