[ https://issues.apache.org/jira/browse/HIVE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086005#comment-14086005 ]
Sandy Ryza commented on HIVE-7540: ---------------------------------- Just to make sure, was Kryo serialization turned on when you ran into this exception? On closer look, it appears that the Spark code is already trying to handle this situation. > NotSerializableException encountered when using sortByKey transformation > ------------------------------------------------------------------------ > > Key: HIVE-7540 > URL: https://issues.apache.org/jira/browse/HIVE-7540 > Project: Hive > Issue Type: Bug > Components: Spark > Environment: Spark-1.0.1 > Reporter: Rui Li > > This exception is thrown when sortByKey is used as the shuffle transformation > between MapWork and ReduceWork: > {quote} > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: > org.apache.hadoop.io.BytesWritable > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:772) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:715) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:719) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:718) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:718) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:699) > … > {quote} > The root cause is that the RangePartitioner used by sortByKey contains > rangeBounds: Array[BytesWritable], which is considered not serializable in > spark. > A workaround to this issue is to set the number of partitions to 1 when > calling sortByKey, in which case the rangeBounds will be just an empty array. > NO PRECOMMIT TESTS. This is for spark branch only. -- This message was sent by Atlassian JIRA (v6.2#6252)