Re: failed to report status for 601 seconds

Corbin Hoenes Thu, 13 May 2010 22:41:27 -0700

I think that issue is correct as we are no longer seeing those kinds of errors. 
 We reconfigured the cluster a bit.


I am trying:

pig -Dmapred.task.timeout=0 ./myscript.pig

but it doesn't like setting properties like this it thinks it's invalid and 
spews the pig "usage" text at me; using pig 0.6 is this still supported?


On May 13, 2010, at 6:39 PM, Ashutosh Chauhan wrote:

> Yes, you can set job conf properties like that, but looking at your
> original stack trace.. it seems that wont solve your problem. Looks
> like your cluster is in trouble. If you can, try restarting it. If
> not,  you need to dig into tasktracker / datanode logs.
> 
> Ashutosh
> 
> On Thu, May 13, 2010 at 16:07, Corbin Hoenes <[email protected]> wrote:
>> Zaki,
>> 
>> can pig take command lines like this to set job conf properties?
>> 
>> pig -Dmapred.task.timeout=0
>> 
>> On May 13, 2010, at 4:18 PM, zaki rahaman wrote:
>> 
>>> Hi Corbin,
>>> 
>>> The timeout error you're seeing could also indicate that your reducer is
>>> trying to process a very large key/group which may be the reason for the
>>> timeout in the first place. At least this is a behavior I've seen in the
>>> past.
>>> 
>>> On Thu, May 13, 2010 at 6:10 PM, Corbin Hoenes <[email protected]> wrote:
>>> 
>>>> Okay so what is the pig way to do this?
>>>> 
>>>> Noticed a lot of chatter about UDFs in pig don't call progress and can
>>>> cause your jobs to get killed.  I am using only builtin UDFs like COUNT,
>>>> FLATTEN do they suffer from this same issue (no progress calls?)
>>>> 
>>>> On May 12, 2010, at 2:56 AM, Andrey Stepachev wrote:
>>>> 
>>>>> You should report progress in a period less then configured (in you case
>>>>> 600sec).
>>>>> Add code like below to you reducer and call ping in you reducer where you
>>>>> process tuples.
>>>>> 
>>>>>       final TaskAttemptContext context = <init in costructor>;
>>>>>       long lastTime = System.currentTimeMillis();
>>>>> 
>>>>>       public void ping() {
>>>>>           final long currtime = System.currentTimeMillis();
>>>>>           if (currtime - lastTime > 10000) {
>>>>>               context.progress();
>>>>>               lastTime = currtime;
>>>>>           }
>>>>>       }
>>>>> 
>>>>> 
>>>>> 2010/5/11 Corbin Hoenes <[email protected]>
>>>>> 
>>>>>> Not sure I am clean on how I can debug stuff on a cluster.  I currently
>>>>>> have a long running reducer that attempts to run 4 times before finally
>>>>>> giving up
>>>>>> 
>>>>>> I get 4 of these: Task attempt_201005101345_0052_r_000012_0 failed to
>>>>>> report status for 601 seconds. Killing!
>>>>>> 
>>>>>> before it gives up...on the last try I noticed this in the log:
>>>>>> ERROR: org.apache.hadoop.hdfs.DFSClient - Exception closing file
>>>>>> 
>>>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>>>>>> : org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
>>>>>> complete write to file
>>>>>> 
>>>> /tmp/temp1925356068/tmp1003826561/_temporary/_attempt_201005101345_0052_r_000012_4/abs/tmp/temp1925356068/tmp-197182389/part-00012
>>>>>> by DFSClient_attempt_201005101345_0052_r_000012_4
>>>>>>      at
>>>>>> 
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
>>>>>>      at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>>>>>>      at
>>>>>> 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>>>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
>>>>>>      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
>>>>>>      at java.security.AccessController.doPrivileged(Native Method)
>>>>>>      at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
>>>>>> How do I turn on log4j's DEBUG statements?  Hoping those will help me
>>>>>> pinpoint what is going on here--maybe it's the cluster or maybe the
>>>> script.
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Zaki Rahaman
>> 
>>

Re: failed to report status for 601 seconds

Reply via email to