[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery

Vinod Kumar Vavilapalli (Commented) (JIRA) Sat, 22 Oct 2011 09:05:53 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133426#comment-13133426
 ]


Vinod Kumar Vavilapalli commented on MAPREDUCE-2708:
----------------------------------------------------

I hit some kind of a blocker here:

A normally finishing jobhistory file for my small job (with 6 maps of 1min 
sleep each) is 60KB:
bq. -rw-rw----   3 nobody rm      60207 2011-10-22 21:20 
/job-history-root/history/done/2011/10/22/000000/job_1319280146725_0003-1319298340296-nobody-Sleep+job-1319298659124-6-1-SUCCEEDED.jhist

Now, if I kill the AM after a couple of tasks, NN shows the #bytes to be zero:
bq. -rw-r--r--   3 nobody supergroup          0 2011-10-22 21:15 
/user/nobody/staging1234/nobody/.staging/job_1319280146725_0003_1.jhist

And either when new generation AM tries to read this file for recovery or if I 
manually try to read this via dfs command, it errs out:
{quote}
11/10/22 21:30:31 DEBUG ipc.Client: closing ipc connection to /127.0.0.1:50020: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)]
        at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:535)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152)
        at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:499)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:583)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:205)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1195)
        at org.apache.hadoop.ipc.Client.call(Client.java:1065)
        at 
org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
        at $Proxy10.getReplicaVisibleLength(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolR23Compatible.ClientDatanodeProtocolTranslatorR23.getReplicaVisibleLength(ClientDatanodeProtocolTranslatorR23.java:121)
        at 
org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:163)
        at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:140)
        at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:111)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:569)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:235)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:585)
        at 
org.apache.hadoop.fs.shell.Display$Cat.getInputStream(Display.java:93)
        at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:81)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:300)
        at 
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:272)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:255)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:239)
        at 
org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:185)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:149)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:254)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:296)
....
11/10/22 21:30:31 ERROR ipc.RPC: Tried to call RPC.stopProxy on an object that 
is not a proxy.
java.lang.IllegalArgumentException: not a proxy instance
        at java.lang.reflect.Proxy.getInvocationHandler(Proxy.java:637)
        at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:479)
        at 
org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:183)
        at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:140)
        at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:111)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:569)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:235)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:585)
        at 
org.apache.hadoop.fs.shell.Display$Cat.getInputStream(Display.java:93)
        at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:81)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:300)
        at 
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:272)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:255)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:239)
        at 
org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:185)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:149)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:254)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:296)
11/10/22 21:30:31 ERROR ipc.RPC: Could not get invocation handler null for 
proxy class class 
org.apache.hadoop.hdfs.protocolR23Compatible.ClientDatanodeProtocolTranslatorR23,
 or invocation handler is not closeable.
cat: Cannot obtain block length for 
LocatedBlock[BP-995821427-127.0.0.1-1318832709756:blk_-7812123742502704244_1249;
 getBlockSize()=0; corrupt=false; offset=0; locs=[127.0.0.1:999]
{quote}

So, looks like we are in a fix if the job-history file is of a single block 
size and that block isn't complete yet. I could try with a small block size say 
25-30K for the jobhistory file, but is that okay for running on clusters?

Sharad?
                
> [MR-279] Design and implement MR Application Master recovery
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-2708
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Sharad Agarwal
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2708-20111021.1.txt, 
> MAPREDUCE-2708-20111021.txt, MAPREDUCE-2708-20111022.txt, mr2708_v1.patch, 
> mr2708_v2.patch
>
>
> Design recovery of MR AM from crashes/node failures. The running job should 
> recover from the state it left off.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery

Reply via email to