[
https://issues.apache.org/jira/browse/TEZ-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233713#comment-14233713
]
Tassapol Athiapinya commented on TEZ-1060:
------------------------------------------
Thanks Bikas for review. Committed to master.
{code}
commit ac26ade401d49a5c772e9f85312f1c5b5f952e8d
Author: Tassapol Athiapinya <[email protected]>
Date: Wed Dec 3 16:35:20 2014 -0800
TEZ-1060 Add randomness to fault tolerance tests
{code}
> Add randomness to fault tolerance tests
> ---------------------------------------
>
> Key: TEZ-1060
> URL: https://issues.apache.org/jira/browse/TEZ-1060
> Project: Apache Tez
> Issue Type: Sub-task
> Affects Versions: 0.5.0
> Reporter: Tassapol Athiapinya
> Assignee: Tassapol Athiapinya
> Attachments: TEZ-1060.1.patch, TEZ-1060.2.patch, TEZ-1060.3.patch,
> TEZ-1060.4.patch, TEZ-1060.5.patch
>
>
> We do have TestFaultTolerance for unit tests that see whether AM can
> correctly handles a case when there are processor failures and input
> failures. TestFaultTolerance uses TestProcessor and TestInput to simulate
> controlled failure scenario for a DAG. In each test, on processor front, we
> do select which tasks fail (do-fail), which physical task indexes fail
> (failing-task-index) and upto which attempt these physical tasks fail
> (failing-upto-task-attempt). On input front, we do select which tasks have
> failed inputs (do-fail), which physical task indexes fail
> (failing-task-index), upto which attempt these physical tasks have failed
> input (failing-task-attempt), which physical inputs to fail
> (failing-input-index) and upto which version of physical inputs tasks do
> reject (failing-upto-input-attempt). In addition to task failure and input
> failures, we also check values of specific physical tasks to see if inputs of
> downstream vertices match outputs of upstream vertices (verify-value,
> verify-task-index). These tests were added during 0.3.0 and 0.4.0. We could
> find several issues in Tez AM, fixed them and enhanced stability of Tez AM.
> Though current unit tests are useful, they are limited by scenarios carefully
> chosen by individual contributors. When Tez is used in heavy load scenario,
> more issues are likely to arise. To bring fault tolerance tests to new level,
> we should add tests that generate randomized failure scenarios. When each
> contributor runs unit tests, new scenario will be generated. From there, it
> gives more opportunity for community to report and fix new issues.
> There are few criteria for new tests:
> - We want to keep time used to run unit tests minimal. Each contributor runs
> different hardware. It is inconvenient if people with slow machine needs to
> spend too much time to run tests for any patch.
> - Random scenario needs to be controlled enough to know expected behavior.
> This means parameters have to be validated by test itself first.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)