[
https://issues.apache.org/jira/browse/TEZ-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232617#comment-14232617
]
Tassapol Athiapinya commented on TEZ-1060:
------------------------------------------
Uploaded TEZ-1060.3.patch.
New implementation is based on randomness suggested earlier. TezProcessor uses
getTaskAttemptNumber() to determine if max failed attempt is reached. TezInput
uses getVersion() to determine similar thing.
There are 4 new properties:
TezProcessor/TezInput
- do random fail or not
- probability to fail an attempt
There are 2 new tests:
- test random failing tasks
- test random failing inputs
We can combine these twos later.
There is no regression in TestFaultTolerance unit tests, but new test
(testRandomFailingInputs) can throw NPE. I am still checking if it is test case
error or other issue.
{code}
2014-12-02 21:09:48,068 INFO [IPC Server handler 4 on 53302]
app.TaskAttemptListenerImpTezDag: Container with id:
container_1417583363837_0001_01_000004 given task:
attempt_1417583363837_0001_1_10_000001_0
2014-12-02 21:09:48,069 FATAL [AsyncDispatcher event handler]
event.AsyncDispatcher: Error in dispatcher thread
java.lang.NullPointerException
at
org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1723)
at
org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1708)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
{code}
> Add randomness to fault tolerance tests
> ---------------------------------------
>
> Key: TEZ-1060
> URL: https://issues.apache.org/jira/browse/TEZ-1060
> Project: Apache Tez
> Issue Type: Sub-task
> Affects Versions: 0.5.0
> Reporter: Tassapol Athiapinya
> Assignee: Tassapol Athiapinya
> Attachments: TEZ-1060.1.patch, TEZ-1060.2.patch, TEZ-1060.3.patch
>
>
> We do have TestFaultTolerance for unit tests that see whether AM can
> correctly handles a case when there are processor failures and input
> failures. TestFaultTolerance uses TestProcessor and TestInput to simulate
> controlled failure scenario for a DAG. In each test, on processor front, we
> do select which tasks fail (do-fail), which physical task indexes fail
> (failing-task-index) and upto which attempt these physical tasks fail
> (failing-upto-task-attempt). On input front, we do select which tasks have
> failed inputs (do-fail), which physical task indexes fail
> (failing-task-index), upto which attempt these physical tasks have failed
> input (failing-task-attempt), which physical inputs to fail
> (failing-input-index) and upto which version of physical inputs tasks do
> reject (failing-upto-input-attempt). In addition to task failure and input
> failures, we also check values of specific physical tasks to see if inputs of
> downstream vertices match outputs of upstream vertices (verify-value,
> verify-task-index). These tests were added during 0.3.0 and 0.4.0. We could
> find several issues in Tez AM, fixed them and enhanced stability of Tez AM.
> Though current unit tests are useful, they are limited by scenarios carefully
> chosen by individual contributors. When Tez is used in heavy load scenario,
> more issues are likely to arise. To bring fault tolerance tests to new level,
> we should add tests that generate randomized failure scenarios. When each
> contributor runs unit tests, new scenario will be generated. From there, it
> gives more opportunity for community to report and fix new issues.
> There are few criteria for new tests:
> - We want to keep time used to run unit tests minimal. Each contributor runs
> different hardware. It is inconvenient if people with slow machine needs to
> spend too much time to run tests for any patch.
> - Random scenario needs to be controlled enough to know expected behavior.
> This means parameters have to be validated by test itself first.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)