[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790895#action_12790895
 ] 

Todd Lipcon commented on MAPREDUCE-1296:
----------------------------------------

bq. FWIW, this is really the same condition as "datanode fails on one bad file 
system" (whichever JIRA that is).

I think it's a bit different than that one - in the case of "out of space", you 
don't want to blacklist the volume, since as you said, the space usage is 
fluctuating and if a disk has been out of space once, it isn't the case that it 
will always be so.


As for the usefulness of "reserved", would it be more useful if it were 
relative to the _remaining_ space instead of the _capacity_? That is to say, a 
DN will check the remaining disk on the volume, and not ever write a block to 
that volume if there is less than _reserved_ available. In some ways, it might 
make sense to allow for both - one is a "max amount of usage for DFS on the 
volume" and the other is "don't write to this volume if there's less then X 
free" (for the case when you've under-reserved)

Should we move this discussion to an HDFS ticket?

> Tasks fail after the first disk (/grid/0/) of all TTs reaches 100%, even 
> though other disks still have space.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1296
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1296
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.2
>            Reporter: Iyappan Srinivasan
>
> Tasks fail after the first disk (/grid/0/) of all TTs reaches 100%, even 
> though other disks still have space.
> In a cluster, data is distributed almost uniformly.  Disk /grid/0/ reaches 
> 100% first, because of extra filling up of info like logs etc. After it 
> reaches 100% tasks starts to fail with the error, 
> java.lang.Throwable: Child Error
>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:516)
> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:503)
> This happens even though the other disks are still at 80%, so still can be 
> filled up more.
> Steps to reproduce:
> 1) Bring up  a cluster with Linux task controller.
> 2) Start filling the dfs up with data using randomwriter or teragen.
> 3) Once the first disk reaches 100%, the tasks are starting to fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to