[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790859#action_12790859
 ] 

Allen Wittenauer commented on MAPREDUCE-1296:
---------------------------------------------

It is deterministic due to Yahoo! specific configs though.  For example, in the 
case of LinkedIn, we specifically put the mapreduce spill space into a 
separate, fixed-size file system so it would not be deterministic for us.  
There really is no other work around, because it is fairly impossible to 
control how large and how many distributed caches, tmp space, etc, a job may 
use.  [.. and thus one of my big complaints about using the negative math to 
determine the size of HDFS.  It doesn't really work in practice.  You need to 
control -both- sizes.]

FWIW, this is really the same condition as "datanode fails on one bad file 
system" (whichever JIRA that is).

> Tasks fail after the first disk (/grid/0/) of all TTs reaches 100%, even 
> though other disks still have space.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1296
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1296
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.2
>            Reporter: Iyappan Srinivasan
>
> Tasks fail after the first disk (/grid/0/) of all TTs reaches 100%, even 
> though other disks still have space.
> In a cluster, data is distributed almost uniformly.  Disk /grid/0/ reaches 
> 100% first, because of extra filling up of info like logs etc. After it 
> reaches 100% tasks starts to fail with the error, 
> java.lang.Throwable: Child Error
>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:516)
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:503)
> This happens even though the other disks are still at 80%, so still can be 
> filled up more.
> Steps to reproduce:
> 1) Bring up  a cluster with Linux task controller.
> 2) Start filling the dfs up with data using randomwriter or teragen.
> 3) Once the first disk reaches 100%, the tasks are starting to fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to