[jira] Commented: (HADOOP-2483) Large-scale reliability tests

Steve Loughran (JIRA) Sun, 07 Dec 2008 14:58:07 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654255#action_12654255
 ]


Steve Loughran commented on HADOOP-2483:
----------------------------------------

This is interesting to me; I'm effectively doing some of this in our codebase 
as we try out some of the lifecycle, but my tests are still trying to bring up 
and stress a functional cluster and not, yet, test how that cluster copes with 
various failure modes, such as
* transient loss of namenode
* loss of 10%, 20%, 30%, 50%, 50%+ of the workers -through either outages or 
network partitioning
* DNS playing up. Because it will, you know :)
* JT, TT, failures. 
* MR job progress when namenodes start failing
There is also performance testing.

Paper to read: 
http://googletesting.blogspot.com/2008/05/performance-testing-of-distributed-file.html


> Large-scale reliability tests
> -----------------------------
>
>                 Key: HADOOP-2483
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2483
>             Project: Hadoop Core
>          Issue Type: Test
>          Components: mapred
>            Reporter: Arun C Murthy
>            Assignee: Devaraj Das
>             Fix For: 0.20.0
>
>
> The fact that we do not have any large-scale reliability tests bothers me. 
> I'll be first to admit that it isn't the easiest of tasks, but I'd like to 
> start a discussion around this... especially given that the code-base is 
> growing to an extent that interactions due to small changes are very hard to 
> predict.
> One of the simple scripts I run for every patch I work on does something very 
> simple: run sort500 (or greater), then it randomly picks n tasktrackers from 
> ${HADOOP_CONF_DIR}/conf/slaves and then kills them, a similar script one 
> kills and restarts the tasktrackers. 
> This helps in checking a fair number of reliability stories: lost 
> tasktrackers, task-failures etc. Clearly this isn't good enough to cover 
> everything, but a start.
> Lets discuss - What do we do for HDFS? We need more for Map-Reduce!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2483) Large-scale reliability tests

Reply via email to