[jira] Commented: (HADOOP-2483) Large-scale reliability tests

Sharad Agarwal (JIRA) Thu, 13 Nov 2008 05:06:38 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647279#action_12647279
 ]


Sharad Agarwal commented on HADOOP-2483:
----------------------------------------

Perhaps we can install the error injection code with each daemon 
(Datanode/TaskTracker). This code gets triggered based on random function which 
works on say cluster error injection ratio = no of nodes to inject error /total 
nodes in cluster. The default package will inject system level errors. If 
required each daemon can extend it to inject its own more granular errors.
This way error generation could be decentralized and can be controlled via 
config params; avoiding the need to get the slaves list for a cluster and 
injecting it from a single client.
The question is do we need that kind of extensibility or we are fine with a few 
types. Thoughts?


> Large-scale reliability tests
> -----------------------------
>
>                 Key: HADOOP-2483
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2483
>             Project: Hadoop Core
>          Issue Type: Test
>          Components: mapred
>            Reporter: Arun C Murthy
>            Assignee: Devaraj Das
>             Fix For: 0.20.0
>
>
> The fact that we do not have any large-scale reliability tests bothers me. 
> I'll be first to admit that it isn't the easiest of tasks, but I'd like to 
> start a discussion around this... especially given that the code-base is 
> growing to an extent that interactions due to small changes are very hard to 
> predict.
> One of the simple scripts I run for every patch I work on does something very 
> simple: run sort500 (or greater), then it randomly picks n tasktrackers from 
> ${HADOOP_CONF_DIR}/conf/slaves and then kills them, a similar script one 
> kills and restarts the tasktrackers. 
> This helps in checking a fair number of reliability stories: lost 
> tasktrackers, task-failures etc. Clearly this isn't good enough to cover 
> everything, but a start.
> Lets discuss - What do we do for HDFS? We need more for Map-Reduce!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2483) Large-scale reliability tests

Reply via email to