[ https://issues.apache.org/jira/browse/TEZ-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013100#comment-14013100 ]
Tassapol Athiapinya commented on TEZ-1060: ------------------------------------------ This means max failed attempt of 4 is independently counted between each task within same vertex. This clarifies. Thanks! [~bikassaha] > Add randomness to fault tolerance tests > --------------------------------------- > > Key: TEZ-1060 > URL: https://issues.apache.org/jira/browse/TEZ-1060 > Project: Apache Tez > Issue Type: Sub-task > Affects Versions: 0.5.0 > Reporter: Tassapol Athiapinya > Assignee: Tassapol Athiapinya > Attachments: TEZ-1060.1.patch, TEZ-1060.2.patch > > > We do have TestFaultTolerance for unit tests that see whether AM can > correctly handles a case when there are processor failures and input > failures. TestFaultTolerance uses TestProcessor and TestInput to simulate > controlled failure scenario for a DAG. In each test, on processor front, we > do select which tasks fail (do-fail), which physical task indexes fail > (failing-task-index) and upto which attempt these physical tasks fail > (failing-upto-task-attempt). On input front, we do select which tasks have > failed inputs (do-fail), which physical task indexes fail > (failing-task-index), upto which attempt these physical tasks have failed > input (failing-task-attempt), which physical inputs to fail > (failing-input-index) and upto which version of physical inputs tasks do > reject (failing-upto-input-attempt). In addition to task failure and input > failures, we also check values of specific physical tasks to see if inputs of > downstream vertices match outputs of upstream vertices (verify-value, > verify-task-index). These tests were added during 0.3.0 and 0.4.0. We could > find several issues in Tez AM, fixed them and enhanced stability of Tez AM. > Though current unit tests are useful, they are limited by scenarios carefully > chosen by individual contributors. When Tez is used in heavy load scenario, > more issues are likely to arise. To bring fault tolerance tests to new level, > we should add tests that generate randomized failure scenarios. When each > contributor runs unit tests, new scenario will be generated. From there, it > gives more opportunity for community to report and fix new issues. > There are few criteria for new tests: > - We want to keep time used to run unit tests minimal. Each contributor runs > different hardware. It is inconvenient if people with slow machine needs to > spend too much time to run tests for any patch. > - Random scenario needs to be controlled enough to know expected behavior. > This means parameters have to be validated by test itself first. -- This message was sent by Atlassian JIRA (v6.2#6252)