[ 
https://issues.apache.org/jira/browse/TEZ-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232617#comment-14232617
 ] 

Tassapol Athiapinya commented on TEZ-1060:
------------------------------------------

Uploaded TEZ-1060.3.patch.
New implementation is based on randomness suggested earlier. TezProcessor uses 
getTaskAttemptNumber() to determine if max failed attempt is reached. TezInput 
uses getVersion() to determine similar thing.

There are 4 new properties:
TezProcessor/TezInput
- do random fail or not
- probability to fail an attempt

There are 2 new tests:
- test random failing tasks
- test random failing inputs
We can combine these twos later.

There is no regression in TestFaultTolerance unit tests, but new test 
(testRandomFailingInputs) can throw NPE. I am still checking if it is test case 
error or other issue.
{code}
2014-12-02 21:09:48,068 INFO [IPC Server handler 4 on 53302] 
app.TaskAttemptListenerImpTezDag: Container with id: 
container_1417583363837_0001_01_000004 given task: 
attempt_1417583363837_0001_1_10_000001_0
2014-12-02 21:09:48,069 FATAL [AsyncDispatcher event handler] 
event.AsyncDispatcher: Error in dispatcher thread
java.lang.NullPointerException
        at 
org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1723)
        at 
org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1708)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
{code}


> Add randomness to fault tolerance tests
> ---------------------------------------
>
>                 Key: TEZ-1060
>                 URL: https://issues.apache.org/jira/browse/TEZ-1060
>             Project: Apache Tez
>          Issue Type: Sub-task
>    Affects Versions: 0.5.0
>            Reporter: Tassapol Athiapinya
>            Assignee: Tassapol Athiapinya
>         Attachments: TEZ-1060.1.patch, TEZ-1060.2.patch, TEZ-1060.3.patch
>
>
> We do have TestFaultTolerance for unit tests that see whether AM can 
> correctly handles a case when there are processor failures and input 
> failures. TestFaultTolerance uses TestProcessor and TestInput to simulate 
> controlled failure scenario for a DAG. In each test, on processor front, we 
> do select which tasks fail (do-fail), which physical task indexes fail 
> (failing-task-index) and upto which attempt these physical tasks fail 
> (failing-upto-task-attempt). On input front, we do select which tasks have 
> failed inputs (do-fail), which physical task indexes fail 
> (failing-task-index), upto which attempt these physical tasks have failed 
> input (failing-task-attempt), which physical inputs to fail 
> (failing-input-index) and upto which version of physical inputs tasks do 
> reject (failing-upto-input-attempt). In addition to task failure and input 
> failures, we also check values of specific physical tasks to see if inputs of 
> downstream vertices match outputs of upstream vertices (verify-value, 
> verify-task-index). These tests were added during 0.3.0 and 0.4.0. We could 
> find several issues in Tez AM, fixed them and enhanced stability of Tez AM. 
> Though current unit tests are useful, they are limited by scenarios carefully 
> chosen by individual contributors. When Tez is used in heavy load scenario, 
> more issues are likely to arise. To bring fault tolerance tests to new level, 
> we should add tests that generate randomized failure scenarios. When each 
> contributor runs unit tests, new scenario will be generated. From there, it 
> gives more opportunity for community to report and fix new issues.
> There are few criteria for new tests:
> - We want to keep time used to run unit tests minimal. Each contributor runs 
> different hardware. It is inconvenient if people with slow machine needs to 
> spend too much time to run tests for any patch.
> - Random scenario needs to be controlled enough to know expected behavior. 
> This means parameters have to be validated by test itself first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to