[jira] [Commented] (KUDU-1824) KuduRDD.collect fails because of NoSerializableException

Dan Burkert (JIRA) Fri, 06 Jan 2017 11:54:12 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805543#comment-15805543
 ]


Dan Burkert commented on KUDU-1824:
-----------------------------------

We have an internal {{KuduRow}} class which extends Spark's {{Row}} class, 
which in turn extends {{java.io.Serializable}}.  {{KuduRow}} is not 
serializable, so we are breaking this contract.  Making {{KuduRow}} be 
serializable is not trivial, since it's just a thin facade over the 
non-serializable Kudu {{RowResult}}.  We wrote it this way originally for 
performance reasons - it would be easy to copy the data into a serializable 
format, but we wanted to avoid that.  I'm still not sure how the DataFrame API 
doesn't hit this issue, since it's just a layer over KuduRDD.

> KuduRDD.collect fails because of NoSerializableException
> --------------------------------------------------------
>
>                 Key: KUDU-1824
>                 URL: https://issues.apache.org/jira/browse/KUDU-1824
>             Project: Kudu
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.2.0
>            Reporter: Dan Burkert
>
> [~sarutak] reported in https://gerrit.cloudera.org/#/c/5496/ that the KuduRDD 
> operations which require shuffling data fail due to serialization issues.
> Some investigation notes:
> * The serialization issues only affect KuduRDD, they do not manifest when 
> working with the DataFrame API, for example:
> {code}
> scala> val df = sqlContext.read.option("kudu.master", 
> "localhost").option("kudu.table", "t1").kudu
> df: org.apache.spark.sql.DataFrame = [a: int, b: string]
> scala> df.collect
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,2], [3,3], [4,4], 
> [5,5])
> scala> val rdd = new KuduContext("localhost").kuduRDD(sc, "t1", List("a", 
> "b"))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = KuduRDD[3] at RDD 
> at KuduRDD.scala:31
> scala> rdd.collect
> 17/01/06 11:28:45 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.NotSerializableException: org.apache.kudu.client.RowResult
> Serialization stack:
>         - object not serializable (class: org.apache.kudu.client.RowResult, 
> value: RowResult index: 4, size: 21, schema: org.apache.kudu.Schema@2b4e3275)
>         - field (class: org.apache.kudu.spark.kudu.KuduRow, name: rowResult, 
> type: class org.apache.kudu.client.RowResult)
>         - object (class org.apache.kudu.spark.kudu.KuduRow, RowResult index: 
> 4, size: 21, schema: org.apache.kudu.Schema@2b4e3275)
>         - element of array (index: 0)
>         - array (class [Lorg.apache.spark.sql.Row;, size 5)
>         at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> {code}
> * Any number of partitions will trigger the issue (I tested with a single 
> table table).
> * The issue manifests even in spark local mode (e.g. default spark-shell)
> * The issue manifests even with a single int32 column table (so it's not a 
> binary or string issue)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KUDU-1824) KuduRDD.collect fails because of NoSerializableException

Reply via email to