[ 
https://issues.apache.org/jira/browse/TEZ-4149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103867#comment-17103867
 ] 

László Bodor edited comment on TEZ-4149 at 5/10/20, 4:31 PM:
-------------------------------------------------------------

thanks [~jeagles], waiting for the starter patch
in the meantime, I was playing with parallelism, and I found that with fair 
scheduler I could have improved the overall testing time, please let me know if 
it makes sense

I modified testRecovery_OrderedWordCount to run all the cases (except the flaky 
from TEZ-4173), and I got the following result (scheduler, thread count):
{code}

CAPACITY SCHEDULER:

thread pool 10:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 400.297 
s - in org.apache.tez.test.TestRecovery

thread pool 5:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 401.501 
s - in org.apache.tez.test.TestRecovery

thread pool 3:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 410.556 
s - in org.apache.tez.test.TestRecovery

thread pool: 1
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 385.171 
s - in org.apache.tez.test.TestRecovery



FAIR SCHEDULER:

thread pool 15:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 332.778 
s - in org.apache.tez.test.TestRecovery

thread pool 10:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 320.876 
s - in org.apache.tez.test.TestRecovery

thread pool 5:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 281.534 
s - in org.apache.tez.test.TestRecovery

thread pool 3:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 269.571 
s - in org.apache.tez.test.TestRecovery

thread pool 2:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 253.058 
s - in org.apache.tez.test.TestRecovery

thread pool: 1
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 388.268 
s - in org.apache.tez.test.TestRecovery
{code}

tests use capacity scheduler by default, and after changing to fair scheduler I 
got a runtime minimum at 2-3 threads, important to add that this is an absolute 
zero config, I've only changed:
{code}
    conf.set(YarnConfiguration.RM_SCHEDULER, 
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler");
{code}

I think the recovery tests are still valid by running AMs on the same 
minicluster in a parallel manner


was (Author: abstractdog):
thanks [~jeagles], waiting for the starter patch
in the meantime, I was playing with parallelism, and I found that with fair 
scheduler I could have improved the overall testing time, please let me know if 
it makes sense

I modified testRecovery_OrderedWordCount to run all the cases (except the flaky 
from TEZ-4173), and I got the following result (scheduler, thread count):
{code}

CAPACITY SCHEDULER:
thread pool 10:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 400.297 
s - in org.apache.tez.test.TestRecovery

thread pool 5:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 401.501 
s - in org.apache.tez.test.TestRecovery

thread pool 3:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 410.556 
s - in org.apache.tez.test.TestRecovery

thread pool: 1
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 385.171 
s - in org.apache.tez.test.TestRecovery



FAIR SCHEDULER:
thread pool 15:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 332.778 
s - in org.apache.tez.test.TestRecovery

thread pool 10:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 320.876 
s - in org.apache.tez.test.TestRecovery

thread pool 5:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 281.534 
s - in org.apache.tez.test.TestRecovery

thread pool 3:
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 269.571 
s - in org.apache.tez.test.TestRecovery

thread pool: 1
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 388.268 
s - in org.apache.tez.test.TestRecovery
{code}

tests use capacity scheduler by default, and after changing to fair scheduler I 
got a runtime minimum at 3 threads, important to add that this is an absolute 
zero config, I've only changed:
{code}
    conf.set(YarnConfiguration.RM_SCHEDULER, 
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler");
{code}

I think the recovery tests are still valid by running AMs on the same 
minicluster in a parallel manner

> Speed up TezRecovery tests
> --------------------------
>
>                 Key: TEZ-4149
>                 URL: https://issues.apache.org/jira/browse/TEZ-4149
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jonathan Turner Eagles
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: org.apache.tez.test.TestRecovery-output.txt
>
>
> Currently, approximately 50% of the tests cases are chosen to run as there 
> are many failure points chosen to test recovery on.
> This can lead to the introduction of bugs into the code as not all test cases 
> are run for every Tez QA run.
> In addition, this can be a real development bottleneck as tests take around 
> 20 minutes per cycle if all tests are run (10 minutes if 50% of the tests are 
> run as usual)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to