Dan Burkert created KUDU-1824:
---------------------------------

             Summary: KuduRDD.collect fails because of NoSerializableException
                 Key: KUDU-1824
                 URL: https://issues.apache.org/jira/browse/KUDU-1824
             Project: Kudu
          Issue Type: Bug
          Components: spark
    Affects Versions: 1.2.0
            Reporter: Dan Burkert


[~sarutak] reported in https://gerrit.cloudera.org/#/c/5496/ that the KuduRDD 
operations which require shuffling data fail due to serialization issues.

Some investigation notes:

* The serialization issues only affect KuduRDD, they do not manifest when 
working with the DataFrame API, for example:

{code}
scala> val df = sqlContext.read.option("kudu.master", 
"localhost").option("kudu.table", "t1").kudu
df: org.apache.spark.sql.DataFrame = [a: int, b: string]

scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,2], [3,3], [4,4], [5,5])

scala> val rdd = new KuduContext("localhost").kuduRDD(sc, "t1", List("a", "b"))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = KuduRDD[3] at RDD at 
KuduRDD.scala:31

scala> rdd.collect
17/01/06 11:28:45 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.NotSerializableException: org.apache.kudu.client.RowResult
Serialization stack:
        - object not serializable (class: org.apache.kudu.client.RowResult, 
value: RowResult index: 4, size: 21, schema: org.apache.kudu.Schema@2b4e3275)
        - field (class: org.apache.kudu.spark.kudu.KuduRow, name: rowResult, 
type: class org.apache.kudu.client.RowResult)
        - object (class org.apache.kudu.spark.kudu.KuduRow, RowResult index: 4, 
size: 21, schema: org.apache.kudu.Schema@2b4e3275)
        - element of array (index: 0)
        - array (class [Lorg.apache.spark.sql.Row;, size 5)
        at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
{code}

* Any number of partitions will trigger the issue (I tested with a single table 
table).
* The issue manifests even in spark local mode (e.g. default spark-shell)
* The issue manifests even with a single int32 column table (so it's not a 
binary or string issue)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to