[jira] [Commented] (HIVE-7843) orc_analyze.q fails with an assertion in FileSinkOperator [Spark Branch]

Venki Korukanti (JIRA) Wed, 27 Aug 2014 10:36:37 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14112513#comment-14112513
 ]


Venki Korukanti commented on HIVE-7843:
---------------------------------------

As part of {{FileSinkOperator.getDynOutPaths()}} we create {{FSPath}} objects 
for given partition and bucket number. {{FSPath}} contains output file name, 
RecordWriter and few other fields. RecordWriter in a given {{FSPath}} is 
non-null if FileSinkOperator is currently writing to the output file in the 
same {{FSPath}}. Once completed RecordWriter is closed and not opened again. 
{{valToPaths}} contains the mapping of parition/bucket value (such as 
"state=Ca/000000) to {{FSPath}}. For Spark we go into a conditition where we 
try to reuse the FSPath which is completed writing and end up with a NPE on 
RecordWriter in FSPath.

In {{FileSinkOperator.getDynOutPaths()}}, we call a method 
{{Utilities.replaceTaskIdFromFilename(taskId, buckNum)}} to get the bucket file 
name. This returns a bucket file name that is equal in length to given task Id 
(see {{Utilities.replaceTaskId}} for details) by prefixing "0"s. For example if 
task Id is 000001_00 and bucket number is 2, then return file name is 
"000002_00". TaskId given to {{replaceTaskIdFromFilename}} is coming from 
"mapred.task.id" (See {{Utilities.getTaskId}} for details). If "mapred.task.id" 
is not present in conf a random number is returned. In case of Spark, there is 
no "mapred.task.id" set, so we get a random number. So bucket file name derived 
is different each time for the same bucket number. This leads to creating more 
than one {{FSPath}} for the same bucket. When a new {{FSPath}} is created, we 
close the current one. And the closed FSPath could be retrieved next and cause 
NPE on RecordWriter. 

Changed {{Utilities.getTaskId}} as follows to always return a random number 
between 1000000 - 9999999 and the test passed fine, but need to find a proper 
fix:

{code}
   public static String getTaskId(Configuration hconf) {
     String taskid = (hconf == null) ? null : hconf.get("mapred.task.id");
     if ((taskid == null) || taskid.equals("")) {
-      return ("" + Math.abs(randGen.nextInt()));
+      // Generate a 6 digit random number (first digit value should be > 0)
+      // TODO: this is not efficient. Need to find if we have task.id 
equivalent in Spark.
+      int rand = randGen.nextInt()%100000;
+      rand += Math.abs(rand) + 1000000;
+      return ("" + rand);
     } else {
{code}

> orc_analyze.q fails with an assertion in FileSinkOperator [Spark Branch]
> ------------------------------------------------------------------------
>
>                 Key: HIVE-7843
>                 URL: https://issues.apache.org/jira/browse/HIVE-7843
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>    Affects Versions: spark-branch
>            Reporter: Venki Korukanti
>            Assignee: Venki Korukanti
>              Labels: Spark-M1
>             Fix For: spark-branch
>
>
> {code}
> java.lang.AssertionError: data length is different from num of DP columns
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynPartDirectory(FileSinkOperator.java:809)
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:730)
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:829)
> org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:502)
> org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:525)
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:198)
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:47)
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:27)
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:98)
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)
> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759)
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7843) orc_analyze.q fails with an assertion in FileSinkOperator [Spark Branch]

Reply via email to