Chris Drome created HIVE-14344: ---------------------------------- Summary: Intermittent failures caused by leaking delegation tokens Key: HIVE-14344 URL: https://issues.apache.org/jira/browse/HIVE-14344 Project: Hive Issue Type: Bug Components: Tez Affects Versions: 2.1.0, 1.2.1 Reporter: Chris Drome Assignee: Chris Drome
We have experienced random job failures caused by leaking delegation tokens. The Tez child task will fail because it is attempting to read from the delegation tokens directory of a different (related) task. Failure results in the following type of stack trace: {noformat} 2016-07-21 16:57:18,061 [FATAL] [TezChild] |tez.ReduceRecordSource|: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:249) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.io.IOException: Exception reading file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first(RowContainer.java:237) at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first(RowContainer.java:74) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:650) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:756) at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinObject(CommonMergeJoinOperator.java:316) at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:279) at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:272) at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:258) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361) ... 17 more Caused by: java.lang.RuntimeException: java.io.IOException: Exception reading file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens at org.apache.hadoop.mapreduce.security.TokenCache.mergeBinaryTokens(TokenCache.java:141) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:119) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first(RowContainer.java:222) ... 25 more Caused by: java.io.IOException: Exception reading file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:175) at org.apache.hadoop.mapreduce.security.TokenCache.mergeBinaryTokens(TokenCache.java:136) ... 32 more Caused by: java.io.FileNotFoundException: File file:/grid/4/tmp/yarn-local/usercache/.../appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:170) ... 33 more {noformat} The application that failed was {{application_1468602386465_489844}} while complaining about {{appcache/application_1468602386465_489814/container_e02_1468602386465_489814_01_000001/container_tokens}}. This seems to only manifest via HiveAction through Oozie. -- This message was sent by Atlassian JIRA (v6.3.4#6332)