[ https://issues.apache.org/jira/browse/HIVE-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14112513#comment-14112513 ]
Venki Korukanti commented on HIVE-7843: --------------------------------------- As part of {{FileSinkOperator.getDynOutPaths()}} we create {{FSPath}} objects for given partition and bucket number. {{FSPath}} contains output file name, RecordWriter and few other fields. RecordWriter in a given {{FSPath}} is non-null if FileSinkOperator is currently writing to the output file in the same {{FSPath}}. Once completed RecordWriter is closed and not opened again. {{valToPaths}} contains the mapping of parition/bucket value (such as "state=Ca/000000) to {{FSPath}}. For Spark we go into a conditition where we try to reuse the FSPath which is completed writing and end up with a NPE on RecordWriter in FSPath. In {{FileSinkOperator.getDynOutPaths()}}, we call a method {{Utilities.replaceTaskIdFromFilename(taskId, buckNum)}} to get the bucket file name. This returns a bucket file name that is equal in length to given task Id (see {{Utilities.replaceTaskId}} for details) by prefixing "0"s. For example if task Id is 000001_00 and bucket number is 2, then return file name is "000002_00". TaskId given to {{replaceTaskIdFromFilename}} is coming from "mapred.task.id" (See {{Utilities.getTaskId}} for details). If "mapred.task.id" is not present in conf a random number is returned. In case of Spark, there is no "mapred.task.id" set, so we get a random number. So bucket file name derived is different each time for the same bucket number. This leads to creating more than one {{FSPath}} for the same bucket. When a new {{FSPath}} is created, we close the current one. And the closed FSPath could be retrieved next and cause NPE on RecordWriter. Changed {{Utilities.getTaskId}} as follows to always return a random number between 1000000 - 9999999 and the test passed fine, but need to find a proper fix: {code} public static String getTaskId(Configuration hconf) { String taskid = (hconf == null) ? null : hconf.get("mapred.task.id"); if ((taskid == null) || taskid.equals("")) { - return ("" + Math.abs(randGen.nextInt())); + // Generate a 6 digit random number (first digit value should be > 0) + // TODO: this is not efficient. Need to find if we have task.id equivalent in Spark. + int rand = randGen.nextInt()%100000; + rand += Math.abs(rand) + 1000000; + return ("" + rand); } else { {code} > orc_analyze.q fails with an assertion in FileSinkOperator [Spark Branch] > ------------------------------------------------------------------------ > > Key: HIVE-7843 > URL: https://issues.apache.org/jira/browse/HIVE-7843 > Project: Hive > Issue Type: Sub-task > Components: Spark > Affects Versions: spark-branch > Reporter: Venki Korukanti > Assignee: Venki Korukanti > Labels: Spark-M1 > Fix For: spark-branch > > > {code} > java.lang.AssertionError: data length is different from num of DP columns > org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynPartDirectory(FileSinkOperator.java:809) > org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:730) > org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:829) > org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:502) > org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:525) > org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:198) > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:47) > org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:27) > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:98) > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) > scala.collection.Iterator$class.foreach(Iterator.scala:727) > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759) > org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:759) > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)