[ https://issues.apache.org/jira/browse/HADOOP-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646917#action_12646917 ]
Devaraj Das commented on HADOOP-2483: ------------------------------------- I have started looking at this. Some thoughts: 1) Have a script that would launch a couple of large randomwriter/sort/sortvalidator jobs (via command line using Java). Some of these jobs would set speculative execution to true. 1.1) Some randomwriter jobs would generate 5x amount of data per map. "Sort" for such data might use a very high value for mapred.min.split.size leading to large reduce partitions. 2) Have another script that would query (via another Java program) the JobTracker to get the list of TaskTrackers. It would randomly issue SIGSTOP (via ssh) to a bunch of trackers. After a certain period, the JobTracker would mark these trackers as Lost. The script would then send a SIGCONT to the same processes allowing these trackers to join back the cluster. 3) Have a script that gets the task reports from the JobTracker, and kill/fail a bunch of random tasks. (2) and (3) could be done multiple times. The test is to see whether the jobs launched by the first script all complete successfully in the event of such random failures. It would also test the JobTracker's reliability w.r.t dealing with a couple of large jobs. Similarly, the Map & Reduce tasks would be tested for reliability w.r.t big input handling & Shuffling would also be stressed (esp due to 1.1). Going one step forward, a script could grep for exceptions in the log files generated (JobTracker, TTs, and tasks), and archive them in the client machine for someone to look at (some exceptions could be indicator of bugs). These are some early thoughts I had. Please chime in with suggestions here. > Large-scale reliability tests > ----------------------------- > > Key: HADOOP-2483 > URL: https://issues.apache.org/jira/browse/HADOOP-2483 > Project: Hadoop Core > Issue Type: Test > Components: mapred > Reporter: Arun C Murthy > Fix For: 0.20.0 > > > The fact that we do not have any large-scale reliability tests bothers me. > I'll be first to admit that it isn't the easiest of tasks, but I'd like to > start a discussion around this... especially given that the code-base is > growing to an extent that interactions due to small changes are very hard to > predict. > One of the simple scripts I run for every patch I work on does something very > simple: run sort500 (or greater), then it randomly picks n tasktrackers from > ${HADOOP_CONF_DIR}/conf/slaves and then kills them, a similar script one > kills and restarts the tasktrackers. > This helps in checking a fair number of reliability stories: lost > tasktrackers, task-failures etc. Clearly this isn't good enough to cover > everything, but a start. > Lets discuss - What do we do for HDFS? We need more for Map-Reduce! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.