[jira] [Commented] (TEZ-1060) Add randomness to fault tolerance tests

Tassapol Athiapinya (JIRA) Wed, 28 May 2014 19:46:28 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011991#comment-14011991
 ]


Tassapol Athiapinya commented on TEZ-1060:
------------------------------------------

[~bikassaha] Since you worked on TestProcessor before, can you please guide me 
about this improvement?
My current idea is to move randomizing into TestProcessor which supports 
failing configuration instead of externally randomize configs like I did in 
cancelled patch. I can't figure out what is best to proceed. Big part of doing 
random is to keep number of failures per vertex to be under 4 
(TezConfiguration.TEZ_AM_TASK_MAX_FAILED_ATTEMPTS_DEFAULT) to allow job to 
succeed. How should we enforce this? There are 2 options of building random 
logic.
# In TestProcessor.run(..), that is easy to do fail randomly (maybe with 
configurable probability). The thing is physical vertex seems to be 
independent. If task 0 attempt 0 fails already, how does other attempt of same 
task or other task index know about that failure?
# In TestProcessor.initialize(..), this option is to random on building 
TEZ_FAILING_PROCESSOR_FAILING_TASK_INDEX and 
TEZ_FAILING_PROCESSOR_FAILING_UPTO_TASK_ATTEMPT that fall within 4 attempts. 
Still, there must be a way to pass number of task index into TestProcessor. I 
can come up with additional parameter in getProcDesc(..). 
Is there other option?

> Add randomness to fault tolerance tests
> ---------------------------------------
>
>                 Key: TEZ-1060
>                 URL: https://issues.apache.org/jira/browse/TEZ-1060
>             Project: Apache Tez
>          Issue Type: Sub-task
>    Affects Versions: 0.5.0
>            Reporter: Tassapol Athiapinya
>            Assignee: Tassapol Athiapinya
>         Attachments: TEZ-1060.1.patch, TEZ-1060.2.patch
>
>
> We do have TestFaultTolerance for unit tests that see whether AM can 
> correctly handles a case when there are processor failures and input 
> failures. TestFaultTolerance uses TestProcessor and TestInput to simulate 
> controlled failure scenario for a DAG. In each test, on processor front, we 
> do select which tasks fail (do-fail), which physical task indexes fail 
> (failing-task-index) and upto which attempt these physical tasks fail 
> (failing-upto-task-attempt). On input front, we do select which tasks have 
> failed inputs (do-fail), which physical task indexes fail 
> (failing-task-index), upto which attempt these physical tasks have failed 
> input (failing-task-attempt), which physical inputs to fail 
> (failing-input-index) and upto which version of physical inputs tasks do 
> reject (failing-upto-input-attempt). In addition to task failure and input 
> failures, we also check values of specific physical tasks to see if inputs of 
> downstream vertices match outputs of upstream vertices (verify-value, 
> verify-task-index). These tests were added during 0.3.0 and 0.4.0. We could 
> find several issues in Tez AM, fixed them and enhanced stability of Tez AM. 
> Though current unit tests are useful, they are limited by scenarios carefully 
> chosen by individual contributors. When Tez is used in heavy load scenario, 
> more issues are likely to arise. To bring fault tolerance tests to new level, 
> we should add tests that generate randomized failure scenarios. When each 
> contributor runs unit tests, new scenario will be generated. From there, it 
> gives more opportunity for community to report and fix new issues.
> There are few criteria for new tests:
> - We want to keep time used to run unit tests minimal. Each contributor runs 
> different hardware. It is inconvenient if people with slow machine needs to 
> spend too much time to run tests for any patch.
> - Random scenario needs to be controlled enough to know expected behavior. 
> This means parameters have to be validated by test itself first.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TEZ-1060) Add randomness to fault tolerance tests

Reply via email to