David! Welcome back!
I haven't hit that one before; if you tweak handleMultiPaths to look like
the below, does it fix the issue?
J
private synchronized void handleMultiPaths(MRJob job) throws IOException {
try {
if (job.getJobState() == MRJob.State.SUCCESS) {
if (!multiPaths.isEmpty()) {
for (Map.Entry<Integer, PathTarget> entry : multiPaths.entrySet()) {
entry.getValue().handleOutputs(job.getJob().getConfiguration(),
workingPath, entry.getKey());
}
}
} } catch(Exception ie) {
throw new IOException(ie);
}
}
On Fri, Oct 30, 2015 at 8:21 AM, David Whiting <[email protected]>
wrote:
> Hi everybody! I'm back and pushing Crunch in a new organisation
>
> I'm having some strange non-deterministic problems with the end of my
> Crunch job executions in a new environment - I've got some possible ideas
> as to why it's happening, but no good ideas for workarounds so I was hoping
> somebody might be able to help me out. Basically, this is what it looks
> like:
>
> 15/10/30 15:01:55 INFO jobcontrol.CrunchControlledJob: Running job
> "crunching.CountEventsByType: SeqFile([{REDACTED}... ID=1 (1/1)"
> 15/10/30 15:01:55 INFO jobcontrol.CrunchControlledJob: Job status available
> at: {REDACTED}/proxy/application_1443106319465_13029/
> 15/10/30 15:05:02 INFO ipc.Client: Retrying connect to server: {REDACTED}.
> Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000
> MILLISECONDS)
> 15/10/30 15:05:03 INFO ipc.Client: Retrying connect to server: {REDACTED}.
> Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000
> MILLISECONDS)
> 15/10/30 15:05:04 INFO ipc.Client: Retrying connect to server: {REDACTED}.
> Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000
> MILLISECONDS)
> 15/10/30 15:05:04 INFO mapred.ClientServiceDelegate: Application state is
> completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history
> server
> 15/10/30 15:05:04 ERROR exec.MRExecutor: Pipeline failed due to exception
> java.io.IOException: java.lang.NullPointerException
> at
>
> org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook.handleMultiPaths(CrunchJobHooks.java:99)
> at
>
> org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook.run(CrunchJobHooks.java:86)
> at
>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.checkRunningState(CrunchControlledJob.java:288)
> at
>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.checkState(CrunchControlledJob.java:299)
> at
>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.checkRunningJobs(CrunchJobControl.java:201)
> at
>
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:321)
> at
> org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:131)
> at
> org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:58)
> at
> org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:90)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at org.apache.hadoop.mapreduce.Job$1.run(Job.java:325)
> at org.apache.hadoop.mapreduce.Job$1.run(Job.java:322)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:322)
> at org.apache.hadoop.mapreduce.Job.isSuccessful(Job.java:632)
> at
>
> org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook.handleMultiPaths(CrunchJobHooks.java:91)
> ... 9 more
>
> The corresponding line in the Hadoop source is this:
>
> return cluster.getClient().getJobStatus(status.getJobID());
>
> The only NPE-generating part of this is that getClient() could return null,
> but I'm not exactly sure what could cause that. We have some intermittent
> problems with our job history server (returning "not found" for whatever
> job it looks up) which could well be correlated to this, but I would expect
> that to fail at the getJobStatus part rather than the getClient part. This
> would, however, agree with the fact the job reports itself as SUCCEEDED
> before it fails during the handleMultiPaths section (as perhaps the request
> to check status there will get routed to the job history server).
>
> This happens with any Crunch jobs I try to run on this cluster, but there
> are plenty of "plain old MapReduce" running on this cluster with no issues,
> so I'm struggling to find reasons why Crunch would fail where the others
> are succeeding.
>
> Thanks,
> David
>