[ https://issues.apache.org/jira/browse/TEZ-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972317#comment-13972317 ]
Tassapol Athiapinya commented on TEZ-1060: ------------------------------------------ Attach TEZ-1060.1.patch. One new test is added for randomized task failures. The test is ignored by default because of its speed impact. The test uses SixLevelsFailingDAG. It takes 1-2 minutes to finish. I personally feel it is slow but it is unavoidable because DAG uses for fault tolerance test here needs to be huge enough, or it would defeat the purpose of testing. I ran the test 6-7 times already and it always passed. The algorithm for generating random failing task parameters should be solid. The algorithm I use is to first random configurable number of failing vertices in the DAG first. After that random which vertices should fail. Once vertices are known, random number of tasks to fail for each vertex. Number of tasks must be smaller than TezConfiguration.TEZ_AM_TASK_MAX_FAILED_ATTEMPTS_DEFAULT otherwise AM would see too many failed attempts on that particular vertices and fail the DAG. After it picks number of failing tasks, it randoms physical task indices again. The algorithm can generate non-sequential task indices. Lastly it tries to fail as many attempts as possible while maintaining maximum failed attempts acceptable by AM. > Add randomness to fault tolerance tests > --------------------------------------- > > Key: TEZ-1060 > URL: https://issues.apache.org/jira/browse/TEZ-1060 > Project: Apache Tez > Issue Type: Sub-task > Affects Versions: 0.5.0 > Reporter: Tassapol Athiapinya > Assignee: Tassapol Athiapinya > Attachments: TEZ-1060.1.patch > > > We do have TestFaultTolerance for unit tests that see whether AM can > correctly handles a case when there are processor failures and input > failures. TestFaultTolerance uses TestProcessor and TestInput to simulate > controlled failure scenario for a DAG. In each test, on processor front, we > do select which tasks fail (do-fail), which physical task indexes fail > (failing-task-index) and upto which attempt these physical tasks fail > (failing-upto-task-attempt). On input front, we do select which tasks have > failed inputs (do-fail), which physical task indexes fail > (failing-task-index), upto which attempt these physical tasks have failed > input (failing-task-attempt), which physical inputs to fail > (failing-input-index) and upto which version of physical inputs tasks do > reject (failing-upto-input-attempt). In addition to task failure and input > failures, we also check values of specific physical tasks to see if inputs of > downstream vertices match outputs of upstream vertices (verify-value, > verify-task-index). These tests were added during 0.3.0 and 0.4.0. We could > find several issues in Tez AM, fixed them and enhanced stability of Tez AM. > Though current unit tests are useful, they are limited by scenarios carefully > chosen by individual contributors. When Tez is used in heavy load scenario, > more issues are likely to arise. To bring fault tolerance tests to new level, > we should add tests that generate randomized failure scenarios. When each > contributor runs unit tests, new scenario will be generated. From there, it > gives more opportunity for community to report and fix new issues. > There are few criteria for new tests: > - We want to keep time used to run unit tests minimal. Each contributor runs > different hardware. It is inconvenient if people with slow machine needs to > spend too much time to run tests for any patch. > - Random scenario needs to be controlled enough to know expected behavior. > This means parameters have to be validated by test itself first. -- This message was sent by Atlassian JIRA (v6.2#6252)