Szehon Ho created HIVE-13314: -------------------------------- Summary: Hive on spark mapjoin errors if spark.master is not set Key: HIVE-13314 URL: https://issues.apache.org/jira/browse/HIVE-13314 Project: Hive Issue Type: Bug Components: Spark Reporter: Szehon Ho Assignee: Szehon Ho Priority: Minor
There are some errors that happen if spark.master is not set. This is despite the code defaulting to yarn-cluster if spark.master is not set by user or on the config files: [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveSparkClientFactory.java#L51] The funny thing is that while it works the first time due to this default, subsequent tries will fail as the hiveConf is refreshed without that default being set. [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java#L180] Exception is follows: {noformat} Job aborted due to stage failure: Task 40 in stage 1.0 failed 4 times, most recent failure: Lost task 40.3 in stage 1.0 (TID 22, d2409.halxg.cloudera.com): java.lang.RuntimeException: Error processing row: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:154) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2003) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2003) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.spark.HashTableLoader.load(HashTableLoader.java:117) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:197) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.cleanUpInputFileChangedOp(MapJoinOperator.java:223) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1051) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055) at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1055) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:490) at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:141) ... 16 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.isDedicatedCluster(SparkUtilities.java:108) at org.apache.hadoop.hive.ql.exec.spark.HashTableLoader.load(HashTableLoader.java:124) at org.apache.hadoop.hive.ql.exec.spark.HashTableLoader.load(HashTableLoader.java:114) ... 24 more Driver stacktrace: {noformat} The issue is -- This message was sent by Atlassian JIRA (v6.3.4#6332)