[ https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133426#comment-13133426 ]
Vinod Kumar Vavilapalli commented on MAPREDUCE-2708: ---------------------------------------------------- I hit some kind of a blocker here: A normally finishing jobhistory file for my small job (with 6 maps of 1min sleep each) is 60KB: bq. -rw-rw---- 3 nobody rm 60207 2011-10-22 21:20 /job-history-root/history/done/2011/10/22/000000/job_1319280146725_0003-1319298340296-nobody-Sleep+job-1319298659124-6-1-SUCCEEDED.jhist Now, if I kill the AM after a couple of tasks, NN shows the #bytes to be zero: bq. -rw-r--r-- 3 nobody supergroup 0 2011-10-22 21:15 /user/nobody/staging1234/nobody/.staging/job_1319280146725_0003_1.jhist And either when new generation AM tries to read this file for recovery or if I manually try to read this via dfs command, it errs out: {quote} 11/10/22 21:30:31 DEBUG ipc.Client: closing ipc connection to /127.0.0.1:50020: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:535) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:583) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:205) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1195) at org.apache.hadoop.ipc.Client.call(Client.java:1065) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244) at $Proxy10.getReplicaVisibleLength(Unknown Source) at org.apache.hadoop.hdfs.protocolR23Compatible.ClientDatanodeProtocolTranslatorR23.getReplicaVisibleLength(ClientDatanodeProtocolTranslatorR23.java:121) at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:163) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:140) at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:111) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:569) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:235) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:585) at org.apache.hadoop.fs.shell.Display$Cat.getInputStream(Display.java:93) at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:81) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:300) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:272) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:255) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:239) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:185) at org.apache.hadoop.fs.shell.Command.run(Command.java:149) at org.apache.hadoop.fs.FsShell.run(FsShell.java:254) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) at org.apache.hadoop.fs.FsShell.main(FsShell.java:296) .... 11/10/22 21:30:31 ERROR ipc.RPC: Tried to call RPC.stopProxy on an object that is not a proxy. java.lang.IllegalArgumentException: not a proxy instance at java.lang.reflect.Proxy.getInvocationHandler(Proxy.java:637) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:479) at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:183) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:140) at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:111) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:569) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:235) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:585) at org.apache.hadoop.fs.shell.Display$Cat.getInputStream(Display.java:93) at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:81) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:300) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:272) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:255) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:239) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:185) at org.apache.hadoop.fs.shell.Command.run(Command.java:149) at org.apache.hadoop.fs.FsShell.run(FsShell.java:254) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) at org.apache.hadoop.fs.FsShell.main(FsShell.java:296) 11/10/22 21:30:31 ERROR ipc.RPC: Could not get invocation handler null for proxy class class org.apache.hadoop.hdfs.protocolR23Compatible.ClientDatanodeProtocolTranslatorR23, or invocation handler is not closeable. cat: Cannot obtain block length for LocatedBlock[BP-995821427-127.0.0.1-1318832709756:blk_-7812123742502704244_1249; getBlockSize()=0; corrupt=false; offset=0; locs=[127.0.0.1:999] {quote} So, looks like we are in a fix if the job-history file is of a single block size and that block isn't complete yet. I could try with a small block size say 25-30K for the jobhistory file, but is that okay for running on clusters? Sharad? > [MR-279] Design and implement MR Application Master recovery > ------------------------------------------------------------ > > Key: MAPREDUCE-2708 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: applicationmaster, mrv2 > Affects Versions: 0.23.0 > Reporter: Sharad Agarwal > Assignee: Sharad Agarwal > Priority: Blocker > Fix For: 0.23.0 > > Attachments: MAPREDUCE-2708-20111021.1.txt, > MAPREDUCE-2708-20111021.txt, MAPREDUCE-2708-20111022.txt, mr2708_v1.patch, > mr2708_v2.patch > > > Design recovery of MR AM from crashes/node failures. The running job should > recover from the state it left off. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira