Hi everybody! I'm back and pushing Crunch in a new organisation I'm having some strange non-deterministic problems with the end of my Crunch job executions in a new environment - I've got some possible ideas as to why it's happening, but no good ideas for workarounds so I was hoping somebody might be able to help me out. Basically, this is what it looks like:
15/10/30 15:01:55 INFO jobcontrol.CrunchControlledJob: Running job "crunching.CountEventsByType: SeqFile([{REDACTED}... ID=1 (1/1)" 15/10/30 15:01:55 INFO jobcontrol.CrunchControlledJob: Job status available at: {REDACTED}/proxy/application_1443106319465_13029/ 15/10/30 15:05:02 INFO ipc.Client: Retrying connect to server: {REDACTED}. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) 15/10/30 15:05:03 INFO ipc.Client: Retrying connect to server: {REDACTED}. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) 15/10/30 15:05:04 INFO ipc.Client: Retrying connect to server: {REDACTED}. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) 15/10/30 15:05:04 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 15/10/30 15:05:04 ERROR exec.MRExecutor: Pipeline failed due to exception java.io.IOException: java.lang.NullPointerException at org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook.handleMultiPaths(CrunchJobHooks.java:99) at org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook.run(CrunchJobHooks.java:86) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.checkRunningState(CrunchControlledJob.java:288) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.checkState(CrunchControlledJob.java:299) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.checkRunningJobs(CrunchJobControl.java:201) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:321) at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:131) at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:58) at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:90) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.hadoop.mapreduce.Job$1.run(Job.java:325) at org.apache.hadoop.mapreduce.Job$1.run(Job.java:322) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:322) at org.apache.hadoop.mapreduce.Job.isSuccessful(Job.java:632) at org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook.handleMultiPaths(CrunchJobHooks.java:91) ... 9 more The corresponding line in the Hadoop source is this: return cluster.getClient().getJobStatus(status.getJobID()); The only NPE-generating part of this is that getClient() could return null, but I'm not exactly sure what could cause that. We have some intermittent problems with our job history server (returning "not found" for whatever job it looks up) which could well be correlated to this, but I would expect that to fail at the getJobStatus part rather than the getClient part. This would, however, agree with the fact the job reports itself as SUCCEEDED before it fails during the handleMultiPaths section (as perhaps the request to check status there will get routed to the job history server). This happens with any Crunch jobs I try to run on this cluster, but there are plenty of "plain old MapReduce" running on this cluster with no issues, so I'm struggling to find reasons why Crunch would fail where the others are succeeding. Thanks, David