Dan Burkert created KUDU-1824:
---------------------------------
Summary: KuduRDD.collect fails because of NoSerializableException
Key: KUDU-1824
URL: https://issues.apache.org/jira/browse/KUDU-1824
Project: Kudu
Issue Type: Bug
Components: spark
Affects Versions: 1.2.0
Reporter: Dan Burkert
[~sarutak] reported in https://gerrit.cloudera.org/#/c/5496/ that the KuduRDD
operations which require shuffling data fail due to serialization issues.
Some investigation notes:
* The serialization issues only affect KuduRDD, they do not manifest when
working with the DataFrame API, for example:
{code}
scala> val df = sqlContext.read.option("kudu.master",
"localhost").option("kudu.table", "t1").kudu
df: org.apache.spark.sql.DataFrame = [a: int, b: string]
scala> df.collect
res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,2], [3,3], [4,4], [5,5])
scala> val rdd = new KuduContext("localhost").kuduRDD(sc, "t1", List("a", "b"))
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = KuduRDD[3] at RDD at
KuduRDD.scala:31
scala> rdd.collect
17/01/06 11:28:45 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.NotSerializableException: org.apache.kudu.client.RowResult
Serialization stack:
- object not serializable (class: org.apache.kudu.client.RowResult,
value: RowResult index: 4, size: 21, schema: org.apache.kudu.Schema@2b4e3275)
- field (class: org.apache.kudu.spark.kudu.KuduRow, name: rowResult,
type: class org.apache.kudu.client.RowResult)
- object (class org.apache.kudu.spark.kudu.KuduRow, RowResult index: 4,
size: 21, schema: org.apache.kudu.Schema@2b4e3275)
- element of array (index: 0)
- array (class [Lorg.apache.spark.sql.Row;, size 5)
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
{code}
* Any number of partitions will trigger the issue (I tested with a single table
table).
* The issue manifests even in spark local mode (e.g. default spark-shell)
* The issue manifests even with a single int32 column table (so it's not a
binary or string issue)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)