I am testing mr2 on a small real cluster, but I am seeing some flaky behavior in running jobs. The same exact job with the same configuration can sometimes run successfully or generate one of the following errors. It is random as far as I see (the job can give the error one time and then run normally the next, and so on).
Have anyone seen this behavior before? ERROR 1: -------------- 11/07/13 13:21:22 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapreduce.v2.api.MRClientProtocol 11/07/13 13:21:22 INFO mapred.ClientServiceDelegate: Connecting to 172.29.5.33:52675 11/07/13 13:21:22 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapreduce.v2.api.MRClientProtocol 11/07/13 13:21:23 INFO ipc.Client: Retrying connect to server: / 172.29.5.33:52675. Already tried 0 time(s). 11/07/13 13:21:24 INFO ipc.Client: Retrying connect to server: / 172.29.5.33:52675. Already tried 1 time(s). 11/07/13 13:21:25 INFO ipc.Client: Retrying connect to server: / 172.29.5.33:52675. Already tried 2 time(s). java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:161) at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:285) at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:522) at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:540) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1130) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1084) at org.apache.hadoop.examples.WordCount.main(WordCount.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:192) Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call to /172.29.5.33:52675 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:96) at $Proxy9.getTaskAttemptCompletionEvents(Unknown Source) at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:154) ... 18 more ERROR 2: -------------- 11/07/13 13:32:30 INFO mapred.ClientServiceDelegate: Connecting to 172.29.5.34:41667 11/07/13 13:32:30 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapreduce.v2.api.MRClientProtocol 11/07/13 13:32:35 INFO mapreduce.Job: Task Id : attempt_1310587965851_0005_m_000000_0, Status : FAILED java.io.FileNotFoundException: File file:/tmp/nm-local-dir/usercache/ahmed/appcache/application_1310587965851_0005 does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:412) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:109) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:74) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:332) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:367) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:551) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:630) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:627) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2278) at org.apache.hadoop.fs.FileContext.create(FileContext.java:627) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2097) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2039) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:81) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:779) -- Ahmed
