[ https://issues.apache.org/jira/browse/HADOOP-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646955#action_12646955 ]
Devaraj Das commented on HADOOP-2483: ------------------------------------- Steve, this issue is more about how the various distributed parts of the system work together in the event of failures and under stress. So yes, while you are right that disk failures can be a part of this test, they can probably be easily covered in unit tests as well. As part of this issue, we could inject a fault that just deletes a bunch of map output files from some trackers that will lead to handling of the case where the corresponding maps are killed and reexecuted elsewhere.. Network failures are kind of handled in my proposal where we STOP/CONT trackers at will. STOP can be seen as faking a network failure while CONT can be seen as a network recovery... > Large-scale reliability tests > ----------------------------- > > Key: HADOOP-2483 > URL: https://issues.apache.org/jira/browse/HADOOP-2483 > Project: Hadoop Core > Issue Type: Test > Components: mapred > Reporter: Arun C Murthy > Assignee: Devaraj Das > Fix For: 0.20.0 > > > The fact that we do not have any large-scale reliability tests bothers me. > I'll be first to admit that it isn't the easiest of tasks, but I'd like to > start a discussion around this... especially given that the code-base is > growing to an extent that interactions due to small changes are very hard to > predict. > One of the simple scripts I run for every patch I work on does something very > simple: run sort500 (or greater), then it randomly picks n tasktrackers from > ${HADOOP_CONF_DIR}/conf/slaves and then kills them, a similar script one > kills and restarts the tasktrackers. > This helps in checking a fair number of reliability stories: lost > tasktrackers, task-failures etc. Clearly this isn't good enough to cover > everything, but a start. > Lets discuss - What do we do for HDFS? We need more for Map-Reduce! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.