[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113126#comment-13113126
 ] 

Kihwal Lee commented on MAPREDUCE-2928:
---------------------------------------

bq. Sounds good. Since transient disk failures may cause a file system to 
become read-only (causing permanent failures) sometimes re-mounting is 
sufficient to recover in which case it makes sense to re-enable faulty disks 
w/o TT restart.

We need to consider the opposite case as well. The file system in a failing 
disk can be remounted rw and may work for a short amount of time but then fail 
again. I am sure you've seen this too since you said "sometimes".  We need to 
make sure the potential performance degradation caused by bad disk health check 
is limited. The thread can hang for a long time (e.g. 300 seconds) until the 
kernel turns it back to ro. 

Hey, I will also be out of town for three weeks and come back on 10/17. GMTA. :)

> MR-2413 improvements
> --------------------
>
>                 Key: MAPREDUCE-2928
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2928
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: tasktracker
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>             Fix For: 0.20.205.0
>
>         Attachments: mapreduce-2928-1.patch, mapreduce-2928-2.patch, 
> mapreduce-2928-3.patch
>
>
> Tracks improvements to MR-2413. See [this 
> comment|https://issues.apache.org/jira/browse/MAPREDUCE-2413?focusedCommentId=13095073&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13095073].

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to