[ 
https://issues.apache.org/jira/browse/TEZ-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013100#comment-14013100
 ] 

Tassapol Athiapinya commented on TEZ-1060:
------------------------------------------

This means max failed attempt of 4 is independently counted between each task 
within same vertex. This clarifies. Thanks! [~bikassaha] 

> Add randomness to fault tolerance tests
> ---------------------------------------
>
>                 Key: TEZ-1060
>                 URL: https://issues.apache.org/jira/browse/TEZ-1060
>             Project: Apache Tez
>          Issue Type: Sub-task
>    Affects Versions: 0.5.0
>            Reporter: Tassapol Athiapinya
>            Assignee: Tassapol Athiapinya
>         Attachments: TEZ-1060.1.patch, TEZ-1060.2.patch
>
>
> We do have TestFaultTolerance for unit tests that see whether AM can 
> correctly handles a case when there are processor failures and input 
> failures. TestFaultTolerance uses TestProcessor and TestInput to simulate 
> controlled failure scenario for a DAG. In each test, on processor front, we 
> do select which tasks fail (do-fail), which physical task indexes fail 
> (failing-task-index) and upto which attempt these physical tasks fail 
> (failing-upto-task-attempt). On input front, we do select which tasks have 
> failed inputs (do-fail), which physical task indexes fail 
> (failing-task-index), upto which attempt these physical tasks have failed 
> input (failing-task-attempt), which physical inputs to fail 
> (failing-input-index) and upto which version of physical inputs tasks do 
> reject (failing-upto-input-attempt). In addition to task failure and input 
> failures, we also check values of specific physical tasks to see if inputs of 
> downstream vertices match outputs of upstream vertices (verify-value, 
> verify-task-index). These tests were added during 0.3.0 and 0.4.0. We could 
> find several issues in Tez AM, fixed them and enhanced stability of Tez AM. 
> Though current unit tests are useful, they are limited by scenarios carefully 
> chosen by individual contributors. When Tez is used in heavy load scenario, 
> more issues are likely to arise. To bring fault tolerance tests to new level, 
> we should add tests that generate randomized failure scenarios. When each 
> contributor runs unit tests, new scenario will be generated. From there, it 
> gives more opportunity for community to report and fix new issues.
> There are few criteria for new tests:
> - We want to keep time used to run unit tests minimal. Each contributor runs 
> different hardware. It is inconvenient if people with slow machine needs to 
> spend too much time to run tests for any patch.
> - Random scenario needs to be controlled enough to know expected behavior. 
> This means parameters have to be validated by test itself first.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to