[
https://issues.apache.org/jira/browse/IMPALA-14784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061402#comment-18061402
]
ASF subversion and git services commented on IMPALA-14784:
----------------------------------------------------------
Commit 0adb0775368dca39aa253c0e11583df315354e15 in impala's branch
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=0adb07753 ]
IMPALA-14784: Upgrade to python-xdist==3.5.0 and use --dist=worksteal
On exhaustive jobs, the end-to-end parallel tests show enormous
skew. The last 1-2% of tests takes hours, and logs indicate that
the last 1257 tests execute on a single worker.
pytest-xdist introduced a 'worksteal' algorithm in 3.2.0 that
can rebalance the work. Exhaustive end-to-end parallel tests
take about 5:20, while the same tests run in about 2:40 with
the worksteal policy. The improvement on core exhaustive
tests is much smaller, because it doesn't suffer the same
level of skew.
pytest-xdist changed the way they assign tests to workers,
and it exposed an issue with TestAcid::test_lock_timings().
The test sets the query option lock_max_wait_time_s on the
session, but it never unsets it. When multiple copies of
the test run on a single worker, the test case for a timeout
of 300 seconds with lock_max_wait_time_s unset is actually
using a value of lock_max_wait_time_s=5. This reworks the
test to set lock_max_wait_time_s via execute_query()'s
query_options argument rather than on the session itself.
Testing:
- Ran end-to-end exhaustive tests
- Ran a core job
- Verified that TestAcid::test_lock_timings() can run multiple
times with a single worker without failing
Change-Id: I6916bbef94b380a516356763dfabb3777c682637
Reviewed-on: http://gerrit.cloudera.org:8080/24035
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Michael Smith <[email protected]>
> Switch end to end parallel tests to python-xdist's --dist=worksteal
> -------------------------------------------------------------------
>
> Key: IMPALA-14784
> URL: https://issues.apache.org/jira/browse/IMPALA-14784
> Project: IMPALA
> Issue Type: Task
> Components: Infrastructure
> Affects Versions: Impala 5.0.0
> Reporter: Joe McDonnell
> Assignee: Joe McDonnell
> Priority: Major
> Fix For: Impala 5.0.0
>
>
> Python-xdist added a "worksteal" distribution mode that can rebalance work
> between the threads when a thread goes idle. This is mainly useful towards
> the end of a parallel test run to reduce skew. On exhaustive end-to-end
> parallel tests, the last 2% of tests take hours:
> {noformat}
> 02:26:10
> ..............................xxx....................................... [
> 98%]
> 03:38:48
> ........................................................................ [
> 98%]
> 04:42:39
> ........................................................................ [
> 99%]
> 05:08:04
> ........................................................................ [
> 99%]
> 05:08:21 ....ss
> [100%]{noformat}
> Looking at the report log, there is massive skew. At the end, a single worker
> is working alone on its remaining tests while other workers are idle. Work
> stealing would make a big difference here.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]