[jira] [Created] (HIVE-16456) Kill spark job when InterruptedException happens or driverContext.isShutdown is true.
zhihai xu created HIVE-16456: Summary: Kill spark job when InterruptedException happens or driverContext.isShutdown is true. Key: HIVE-16456 URL: https://issues.apache.org/jira/browse/HIVE-16456 Project: Hive Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Kill spark job when InterruptedException happens or driverContext.isShutdown is true. If the InterruptedException happened in RemoteSparkJobMonitor and LocalSparkJobMonitor, it will be better to kill the job. Also there is a race condition between submit the spark job and query/operation cancellation, it will be better to check driverContext.isShutdown right after submit the spark job. This will guarantee the job being killed no matter when shutdown is called. It is similar as HIVE-15997. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16433) Not nullify rj to avoid NPE due to race condition in ExecDriver.
zhihai xu created HIVE-16433: Summary: Not nullify rj to avoid NPE due to race condition in ExecDriver. Key: HIVE-16433 URL: https://issues.apache.org/jira/browse/HIVE-16433 Project: Hive Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Not nullify rj to avoid NPE due to race condition in ExecDriver. currently {{rj}} is set to null in ExecDriver.shutdown which is called from other thread for query cancellation. It can happen at any time. There is a potential race condition, the rj is still accessed after shutdown is called. For example: if the following is called right after ExecDriver.shutdown is called. {code} this.jobID = rj.getJobID(); updateStatusInQueryDisplay(); returnVal = jobExecHelper.progress(rj, jc, ctx); {code} Currently the purpose of nullifying rj is mainly to make sure {{rj.killJob()}} is only called once. I will add a flag jobKilled to make sure {{rj.killJob()}} will be only called once. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16430) Add log to show the cancelled query id when cancelOperation is called.
zhihai xu created HIVE-16430: Summary: Add log to show the cancelled query id when cancelOperation is called. Key: HIVE-16430 URL: https://issues.apache.org/jira/browse/HIVE-16430 Project: Hive Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Add log to show the cancelled query id when cancelOperation is called. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16429) Should call invokeFailureHooks in handleInterruption to track failed query execution due to interrupted command.
zhihai xu created HIVE-16429: Summary: Should call invokeFailureHooks in handleInterruption to track failed query execution due to interrupted command. Key: HIVE-16429 URL: https://issues.apache.org/jira/browse/HIVE-16429 Project: Hive Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Should call invokeFailureHooks in handleInterruption to track failed query execution due to interrupted command. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16422) Should kill running Spark Jobs when a query is cancelled.
zhihai xu created HIVE-16422: Summary: Should kill running Spark Jobs when a query is cancelled. Key: HIVE-16422 URL: https://issues.apache.org/jira/browse/HIVE-16422 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 2.1.0 Reporter: zhihai xu Assignee: zhihai xu Should kill running Spark Jobs when a query is cancelled. When a query is cancelled, Driver.releaseDriverContext will be called by Driver.close. releaseDriverContext will call DriverContext.shutdown which will call all the running tasks' shutdown. {code} public synchronized void shutdown() { LOG.debug("Shutting down query " + ctx.getCmd()); shutdown = true; for (TaskRunner runner : running) { if (runner.isRunning()) { Task task = runner.getTask(); LOG.warn("Shutting down task : " + task); try { task.shutdown(); } catch (Exception e) { console.printError("Exception on shutting down task " + task.getId() + ": " + e); } Thread thread = runner.getRunner(); if (thread != null) { thread.interrupt(); } } } running.clear(); } {code} since SparkTask didn't implement shutdown method to kill the running spark job, the spark job may be still running after the query is cancelled. So it will be good to kill the spark job in SparkTask.shutdown to save cluster resource. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16368) Unexpected java.lang.ArrayIndexOutOfBoundsException from query with LaterView Operation for hive on MR.
zhihai xu created HIVE-16368: Summary: Unexpected java.lang.ArrayIndexOutOfBoundsException from query with LaterView Operation for hive on MR. Key: HIVE-16368 URL: https://issues.apache.org/jira/browse/HIVE-16368 Project: Hive Issue Type: Bug Components: Query Planning Reporter: zhihai xu Assignee: zhihai xu Unexpected java.lang.ArrayIndexOutOfBoundsException from query. It happened in LaterView Operation. It happened for hive-on-mr. The reason is because the column prune change the column order in LaterView operation, for back-back reducesink operators using MR engine, FileSinkOperator and TableScanOperator are added before the second ReduceSink operator, The serialization column order used by FileSinkOperator in LazyBinarySerDe of previous reducer is different from deserialization column order from table desc used by MapOperator/TableScanOperator in LazyBinarySerDe of current failed mapper. The serialization is decided by the outputObjInspector from LateralViewJoinOperator, {code} ArrayList fieldNames = conf.getOutputInternalColNames(); outputObjInspector = ObjectInspectorFactory .getStandardStructObjectInspector(fieldNames, ois); {code} So the column order for serialization is decided by getOutputInternalColNames in LateralViewJoinOperator. The deserialization is decided by TableScanOperator which is created at GenMapRedUtils.splitTasks. {code} TableDesc tt_desc = PlanUtils.getIntermediateFileTableDesc(PlanUtils .getFieldSchemasFromRowSchema(parent.getSchema(), "temporarycol")); // Create the temporary file, its corresponding FileSinkOperaotr, and // its corresponding TableScanOperator. TableScanOperator tableScanOp = createTemporaryFile(parent, op, taskTmpDir, tt_desc, parseCtx); {code} The column order for deserialization is decided by rowSchema of LateralViewJoinOperator. But ColumnPrunerLateralViewJoinProc changed the order of outputInternalColNames but still keep the original order of rowSchema, Which cause the mismatch between serialization and deserialization for two back-to-back MR jobs. Similar issue for ColumnPrunerLateralViewForwardProc which change the column order of its child selector colList but not rowSchema. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15772) set the exception into SparkJobStatus if exception happened in RemoteSparkJobMonitor and LocalSparkJobMonitor
zhihai xu created HIVE-15772: Summary: set the exception into SparkJobStatus if exception happened in RemoteSparkJobMonitor and LocalSparkJobMonitor Key: HIVE-15772 URL: https://issues.apache.org/jira/browse/HIVE-15772 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 2.2.0 Environment: set the exception into SparkJobStatus if exception happened in RemoteSparkJobMonitor and LocalSparkJobMonitor. Add function setError in SparkJobStatus. Reporter: zhihai xu Assignee: zhihai xu -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15662) check startTime in SparkTask to make sure startTime is not less than submitTime
zhihai xu created HIVE-15662: Summary: check startTime in SparkTask to make sure startTime is not less than submitTime Key: HIVE-15662 URL: https://issues.apache.org/jira/browse/HIVE-15662 Project: Hive Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Check startTime in SparkTask to make sure startTime is not less than submitTime. We saw a corner case when the sparkTask is finished in less than 1 second, the startTime may not be set because RemoteSparkJobMonitor will sleep for 1 second then check the state, in this case, right after sleep for one second, the spark job is already completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15630) add operation handle before operation.run instead of after operation.run
zhihai xu created HIVE-15630: Summary: add operation handle before operation.run instead of after operation.run Key: HIVE-15630 URL: https://issues.apache.org/jira/browse/HIVE-15630 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 2.2.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Add operation handle before operation.run instead of after operation.run. So when session is closed, all the running operations from {{operation.run}} can also be closed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15629) Set DDLTask’s exception with its subtask’s exception
zhihai xu created HIVE-15629: Summary: Set DDLTask’s exception with its subtask’s exception Key: HIVE-15629 URL: https://issues.apache.org/jira/browse/HIVE-15629 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 2.2.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Set DDLTask’s exception with its subtask’s exception, So the exception from subtask can be propagated to TaskRunner. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15564) set task's jobID with hadoop map reduce job ID for PartialScanTask, MergeFileTask and ColumnTruncateTask.
zhihai xu created HIVE-15564: Summary: set task's jobID with hadoop map reduce job ID for PartialScanTask, MergeFileTask and ColumnTruncateTask. Key: HIVE-15564 URL: https://issues.apache.org/jira/browse/HIVE-15564 Project: Hive Issue Type: Improvement Components: Hive Reporter: zhihai xu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15563) Ignore Illegal Operation state transition exception in SQLOperation.runQuery to expose real exception.
zhihai xu created HIVE-15563: Summary: Ignore Illegal Operation state transition exception in SQLOperation.runQuery to expose real exception. Key: HIVE-15563 URL: https://issues.apache.org/jira/browse/HIVE-15563 Project: Hive Issue Type: Bug Affects Versions: 2.2.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Ignore Illegal Operation state transition exception in SQLOperation.runQuery to expose real exception. setState may create Illegal Operation state transition exception which may hide the real exception. we see the following exception happened from {{setState(OperationState.ERROR);}} in SQLOperation.runQuery {code} org.apache.hive.service.cli.operation.Operation: Error running hive query: org.apache.hive.service.cli.HiveSQLException: Illegal Operation state transition from CLOSED to ERROR at org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91) at org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97) at org.apache.hive.service.cli.operation.Operation.setState(Operation.java:154) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:241) at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:82) at org.apache.hive.service.cli.operation.SQLOperation$3$1.run(SQLOperation.java:288) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hive.service.cli.operation.SQLOperation$3.run(SQLOperation.java:301) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15528) Expose Spark job error in SparkTask
zhihai xu created HIVE-15528: Summary: Expose Spark job error in SparkTask Key: HIVE-15528 URL: https://issues.apache.org/jira/browse/HIVE-15528 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 2.2.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Expose Spark job error in SparkTask by propagating Spark job error to task exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15494) Create perfLogger in method execute instead class initialization for SparkTask
zhihai xu created HIVE-15494: Summary: Create perfLogger in method execute instead class initialization for SparkTask Key: HIVE-15494 URL: https://issues.apache.org/jira/browse/HIVE-15494 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 2.2.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Create perfLogger in method execute instead class initialization for SparkTask, so perfLogger can be shared with SparkJobMonitor in the same thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15470) Catch Throwable instead of Exception in driver.execute.
zhihai xu created HIVE-15470: Summary: Catch Throwable instead of Exception in driver.execute. Key: HIVE-15470 URL: https://issues.apache.org/jira/browse/HIVE-15470 Project: Hive Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Catch Throwable instead of Exception in driver.execute. So the failed query with Throwable not Exception will also be logged and reported. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15386) Expose Spark task counts and stage Ids information in SparkTask from SparkJobMonitor
zhihai xu created HIVE-15386: Summary: Expose Spark task counts and stage Ids information in SparkTask from SparkJobMonitor Key: HIVE-15386 URL: https://issues.apache.org/jira/browse/HIVE-15386 Project: Hive Issue Type: Bug Components: Spark Affects Versions: 2.2.0 Reporter: zhihai xu Assignee: zhihai xu Expose Spark task counts and stage Ids information in SparkTask from SparkJobMonitor. So these information can be used by hive hook to monitor spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15301) Expose SparkStatistics information in SparkTask
zhihai xu created HIVE-15301: Summary: Expose SparkStatistics information in SparkTask Key: HIVE-15301 URL: https://issues.apache.org/jira/browse/HIVE-15301 Project: Hive Issue Type: Improvement Components: Spark Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Expose SparkStatistics information in SparkTask. So we can get SparkStatistics in Hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15171) set SparkTask's jobID with application id
zhihai xu created HIVE-15171: Summary: set SparkTask's jobID with application id Key: HIVE-15171 URL: https://issues.apache.org/jira/browse/HIVE-15171 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 2.1.0 Reporter: zhihai xu Assignee: zhihai xu set SparkTask's jobID with application id, The information will be useful to monitor the Spark Application in hook -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14564) Column Pruning generates out of order columns in SelectOperator which cause ArrayIndexOutOfBoundsException.
zhihai xu created HIVE-14564: Summary: Column Pruning generates out of order columns in SelectOperator which cause ArrayIndexOutOfBoundsException. Key: HIVE-14564 URL: https://issues.apache.org/jira/browse/HIVE-14564 Project: Hive Issue Type: Bug Components: Query Planning Affects Versions: 2.1.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Column Pruning generates out of order columns in SelectOperator which cause ArrayIndexOutOfBoundsException. {code} 2016-07-26 21:49:24,390 FATAL [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"_col0":null,"_col1":0,"_col2":36,"_col3":"499ec44-6dd2-4709-a019-33d6d484ed90�\u0001U5�\u001c��\t\u001b�\u","_col4":"5264db53-d650-4678-9261-cdd51efab8bb","_col5":"cb5233dd-214a-4b0b-b43e-0f41befb5c5c","_col6":"","_col8":48,"_col9":null,"_col10":"1befb5c5c�\u00192016-06-09T15:31:15+00:00\u0002\u0005Rider\u0011svc-dash","_col11":64,"_col12":null,"_col13":null,"_col14":"ber.com�\u0001U5ߨP�\u0001U5ᷨider) - 1000\u0005Rider\u0011svc-d...@uber.com�\u0001U4�;x�\u0001U5\u0004��\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u\u","_col15":"","_col16":null} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ArrayIndexOutOfBoundsException at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:397) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497) ... 9 more Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.hadoop.io.Text.set(Text.java:225) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryString.init(LazyBinaryString.java:48) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryStruct.uncheckedGetField(LazyBinaryStruct.java:264) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryStruct.getField(LazyBinaryStruct.java:201) at org.apache.hadoop.hive.serde2.lazybinary.objectinspector.LazyBinaryStructObjectInspector.getStructFieldData(LazyBinaryStructObjectInspector.java:64) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator._evaluate(ExprNodeColumnEvaluator.java:94) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65) at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.makeValueWritable(ReduceSinkOperator.java:550) at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:377) ... 13 more {code} The exception is because the serialization and deserialization doesn't match. The serialization by LazyBinarySerDe from previous MapReduce job used different order of columns. When the current MapReduce job deserialized the intermediate sequence file generated by previous MapReduce job, it will get corrupted data from the deserialization using wrong order of columns by LazyBinaryStruct. The unmatched columns between serialization and deserialization caused by SelectOperator's Column Pruning {{ColumnPrunerSelectProc}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14368) ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message.
zhihai xu created HIVE-14368: Summary: ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message. Key: HIVE-14368 URL: https://issues.apache.org/jira/browse/HIVE-14368 Project: Hive Issue Type: Improvement Components: Thrift API Reporter: zhihai xu Assignee: zhihai xu Priority: Minor ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message. The stack trace will be really helpful for client to debug failed queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14331) Task should set exception for failed map reduce job.
zhihai xu created HIVE-14331: Summary: Task should set exception for failed map reduce job. Key: HIVE-14331 URL: https://issues.apache.org/jira/browse/HIVE-14331 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 2.1.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Task should set exception for failed map reduce job. So the exception can be seen in HookContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14303) CommonJoinOperator.checkAndGenObject should return directly in CLOSE state to avoid NPE if ExecReducer.close is called twice.
zhihai xu created HIVE-14303: Summary: CommonJoinOperator.checkAndGenObject should return directly in CLOSE state to avoid NPE if ExecReducer.close is called twice. Key: HIVE-14303 URL: https://issues.apache.org/jira/browse/HIVE-14303 Project: Hive Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.1.0 CommonJoinOperator.checkAndGenObject should return directly in CLOSE state to avoid NPE if ExecReducer.close is called twice. ExecReducer.close implements Closeable interface and ExecReducer.close can be called multiple time. We saw the following NPE which hide the real exception due to this bug. {code} Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: null at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:296) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:718) at org.apache.hadoop.hive.ql.exec.JoinOperator.endGroup(JoinOperator.java:256) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:284) ... 8 more {code} The code from ReduceTask.runOldReducer: {code} reducer.close(); //line 453 reducer = null; out.close(reporter); out = null; } finally { IOUtils.cleanup(LOG, reducer);// line 459 closeQuietly(out, reporter); } {code} Based on the above stack trace and code, reducer.close() is called twice because the exception happened when reducer.close() is called for the first time at line 453, the code exit before reducer was set to null. NullPointerException is triggered when reducer.close() is called for the second time in IOUtils.cleanup. NullPointerException hide the real exception which happened when reducer.close() is called for the first time at line 453. The reason for NPE is: The first reducer.close called CommonJoinOperator.closeOp which clear {{storage}} {code} Arrays.fill(storage, null); {code} the second reduce.close generated NPE due to null {{storage[alias]}} which is set to null by first reducer.close. The following reducer log can give more proof: {code} 2016-07-14 22:24:51,016 INFO [main] org.apache.hadoop.hive.ql.exec.JoinOperator: 0 finished. closing... 2016-07-14 22:24:51,016 INFO [main] org.apache.hadoop.hive.ql.exec.JoinOperator: 0 finished. closing... 2016-07-14 22:24:51,016 INFO [main] org.apache.hadoop.hive.ql.exec.JoinOperator: SKEWJOINFOLLOWUPJOBS:0 2016-07-14 22:24:51,016 INFO [main] org.apache.hadoop.hive.ql.exec.SelectOperator: 1 finished. closing... 2016-07-14 22:24:51,016 INFO [main] org.apache.hadoop.hive.ql.exec.SelectOperator: 2 finished. closing... 2016-07-14 22:24:51,016 INFO [main] org.apache.hadoop.hive.ql.exec.SelectOperator: 3 finished. closing... 2016-07-14 22:24:51,016 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: 4 finished. closing... 2016-07-14 22:24:51,016 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[4]: records written - 53466 2016-07-14 22:25:11,555 ERROR [main] ExecReducer: Hit error while closing operators - failing tree 2016-07-14 22:25:11,649 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: Hive Runtime Error while closing operators: null at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:296) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:718) at org.apache.hadoop.hive.ql.exec.JoinOperator.endGroup(JoinOperator.java:256)
[jira] [Created] (HIVE-14258) Reduce task timed out because CommonJoinOperator.genUniqueJoinObject took too long to finish without reporting progress
zhihai xu created HIVE-14258: Summary: Reduce task timed out because CommonJoinOperator.genUniqueJoinObject took too long to finish without reporting progress Key: HIVE-14258 URL: https://issues.apache.org/jira/browse/HIVE-14258 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 2.1.0 Reporter: zhihai xu Assignee: zhihai xu Reduce task timed out because CommonJoinOperator.genUniqueJoinObject took too long to finish without reporting progress. This timeout happened when reducer.close() is called in ReduceTask.java. CommonJoinOperator.genUniqueJoinObject() called by reducer.close() will loop over every row in the AbstractRowContainer. This can take a long time if there are a large number or rows, and during this time, it does not report progress. If this runs for long enough more than "mapreduce.task.timeout", ApplicationMaster will kill the task for failing to report progress. we configured "mapreduce.task.timeout" as 10 minutes. I captured the stack trace in the 10 minutes before AM killed the reduce task at 2016-07-15 07:19:11. The following three stack traces can prove it: at 2016-07-15 07:09:42: {code} "main" prio=10 tid=0x7f90ec017000 nid=0xd193 runnable [0x7f90f62e5000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:272) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:154) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked <0x0007deecefb0> (a org.apache.hadoop.fs.BufferedFSInputStream) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:436) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:252) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276) at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) - locked <0x0007deecb978> (a org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker) at java.io.DataInputStream.readFully(DataInputStream.java:195) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:120) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2359) - locked <0x0007deec8f70> (a org.apache.hadoop.io.SequenceFile$Reader) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2491) - locked <0x0007deec8f70> (a org.apache.hadoop.io.SequenceFile$Reader) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82) - locked <0x0007deec82f0> (a org.apache.hadoop.mapred.SequenceFileRecordReader) at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.nextBlock(RowContainer.java:360) at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:267) at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.next(RowContainer.java:74) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:644) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:750) at org.apache.hadoop.hive.ql.exec.JoinOperator.endGroup(JoinOperator.java:256) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:284) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) {code} at 2016-07-15 07:15:35 {code} "main" prio=10 tid=0x7f90ec017000 nid=0xd193 runnable [0x7f90f62e5000] java.lang.Thread.State: RUNNABLE at java.util.zip.CRC32.updateBytes(Native Method) at java.util.zip.CRC32.update(CRC32.java:65) at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:316) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:279)
[jira] [Created] (HIVE-14094) Remove unused function closeFs from Warehouse.java
zhihai xu created HIVE-14094: Summary: Remove unused function closeFs from Warehouse.java Key: HIVE-14094 URL: https://issues.apache.org/jira/browse/HIVE-14094 Project: Hive Issue Type: Improvement Components: Metastore Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Remove unused function closeFs from Warehouse.java after HIVE-10922, no one will call Warehouse.closeFs. It will be good to delete this function to prevent people from using it. Normally closing FileSystem is not safe because most of the time FileSystem will be shared. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14067) Rename pendingCount to activeCalls in HiveSessionImpl for easier understanding.
zhihai xu created HIVE-14067: Summary: Rename pendingCount to activeCalls in HiveSessionImpl for easier understanding. Key: HIVE-14067 URL: https://issues.apache.org/jira/browse/HIVE-14067 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Rename pendingCount to activeCalls in HiveSessionImpl for easier understanding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13960) Session timeout may happen before HIVE_SERVER2_IDLE_SESSION_TIMEOUT for back-to-back synchronous operations.
zhihai xu created HIVE-13960: Summary: Session timeout may happen before HIVE_SERVER2_IDLE_SESSION_TIMEOUT for back-to-back synchronous operations. Key: HIVE-13960 URL: https://issues.apache.org/jira/browse/HIVE-13960 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: zhihai xu Assignee: zhihai xu Session timeout may happen before HIVE_SERVER2_IDLE_SESSION_TIMEOUT(hive.server2.idle.session.timeout) for back-to-back synchronous operations. This issue can happen with the following two operations op1 and op2: op2 is a synchronous long running operation, op2 is running right after op1 is closed. 1. closeOperation(op1) is called: this will set {{lastIdleTime}} with value System.currentTimeMillis() because {{opHandleSet}} becomes empty after {{closeOperation}} remove op1 from {{opHandleSet}}. 2. op2 is running for long time by calling {{executeStatement}} right after closeOperation(op1) is called. If op2 is running for more than HIVE_SERVER2_IDLE_SESSION_TIMEOUT, then the session will timeout even when op2 is still running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13760) Add a HIVE_QUERY_TIMEOUT configuration to kill a query if a query is running for more than the configured timeout value.
zhihai xu created HIVE-13760: Summary: Add a HIVE_QUERY_TIMEOUT configuration to kill a query if a query is running for more than the configured timeout value. Key: HIVE-13760 URL: https://issues.apache.org/jira/browse/HIVE-13760 Project: Hive Issue Type: Improvement Components: Configuration Affects Versions: 2.0.0 Reporter: zhihai xu Add a HIVE_QUERY_TIMEOUT configuration to kill a query if a query is running for more than the configured timeout value. The default value will be -1 , which means no timeout. This will be useful for user to manage queries with SLA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13629) Expose Merge-File task and Column-Truncate task from DDLTask
zhihai xu created HIVE-13629: Summary: Expose Merge-File task and Column-Truncate task from DDLTask Key: HIVE-13629 URL: https://issues.apache.org/jira/browse/HIVE-13629 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 2.0.0 Reporter: zhihai xu Assignee: zhihai xu DDLTask will create subtask in mergeFiles and truncateTable to support HiveOperation.TRUNCATETABLE, HiveOperation.ALTERTABLE_MERGEFILES and HiveOperation.ALTERPARTITION_MERGEFILES. It will be better to expose the tasks which are created at function mergeFiles and truncateTable from class DDLTask to users. -- This message was sent by Atlassian JIRA (v6.3.4#6332)