[ 
https://issues.apache.org/jira/browse/IMPALA-6070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459778#comment-16459778
 ] 

ASF subversion and git services commented on IMPALA-6070:
---------------------------------------------------------

Commit d733ea68ca144798ff67054faf577dfdae0f201e in impala's branch 
refs/heads/2.x from [~philip]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=d733ea6 ]

IMPALA-6070: Further improvements to test-with-docker.

This commit tackles a few additions and improvements to
test-with-docker. In general, I'm adding workloads (e.g., exhaustive,
rat-check), tuning memory setting and parallelism, and trying to speed
things up.

Bug fixes:

* Embarassingly, I was still skipping thrift-server-test in the backend
  tests. This was a mistake in handling feedback from my last review.

* I made the timeline a little bit taller to clip less.

Adding workloads:

* I added the RAT licensing check.

* I added exhaustive runs. This led me to model the suites a little
  bit more in Python, with a class representing a suite with a
  bunch of data about the suite. It's not perfect and still
  coupled with the entrypoint.sh shell script, but it feels
  workable. As part of adding exhaustive tests, I had
  to re-work the timeout handling, since now different
  suites meaningfully have different timeouts.

Speed ups:

* To speed up test runs, I added a mechanism to split py.test suites into
  multiple shards with a py.test argument. This involved a little bit of work in
  conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh.

  Furthermore, I moved a bit more logic about managing the
  list of suites into Python.

* Doing the full build with "-notests" and only building
  the backend tests in the relevant target that needs them. This speeds
  up "docker commit" significantly by removing about 20GB from the
  container.  I had to indicates that expr-codegen-test depends on
  expr-codegen-test-ir, which was missing.

* I sped up copying the Kudu data: previously I did
  both a move and a copy; now I'm doing a move followed by a move. One
  of the moves is cross-filesystem so is slow, but this does half the
  amount of copying.

Memory usage:

* I tweaked the memlimit_gb settings to have a higher default. I've been
  fighting empirically to have the tests run well on c4.8xlarge and
  m4.10xlarge.

The more memory a minicluster and test suite run uses, the fewer parallel
suites we can run. By observing the peak processes at the tail of a run (with a
new "memory_usage" function that uses a ps/sort/awk trick) and by observing
peak container total_rss, I found that we had several JVMs that
didn't have Xmx settings set. I added Xms/Xmx settings in a few
places:

 * The non-first Impalad does very little JVM work, so having
   an Xmx keeps it small, even in the parallel tests.
 * Datanodes do work, but they essentially were never garbage
   collecting, because JVM defaults let them use up to 1/4th
   the machine memory. (I observed this based on RSS at the
   end of the run; nothing fancier.) Adding Xms/Xmx settings
   helped.
 * Similarly, I piped the settings through to HBase.

A few daemons still run without resource limitations, but they don't
seem to be a problem.

Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c
Reviewed-on: http://gerrit.cloudera.org:8080/10123
Reviewed-by: Joe McDonnell <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-on: http://gerrit.cloudera.org:8080/10248
Reviewed-by: Philip Zeyliger <[email protected]>


> Speed up test execution
> -----------------------
>
>                 Key: IMPALA-6070
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6070
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Philip Zeyliger
>            Assignee: Philip Zeyliger
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> Our tests (e.g., 
> https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/buildTimeTrend) tend 
> to take about 4 hours. This can be improved.
> I'm opening this JIRA track those changes. I'm currently looking at:
> * Parallelizing multiple data-load steps: TPC-DS, TPC-H, and Functional take 
> ~65 minutes when serialized. They take 35 minutes if running in parallel.
> * Parallelizing compute stats: this takes ~10 minutes; probably can be faster.
> The trickier thing is parallelizing fe tests, ee tests, and custom cluster 
> tests. The approach I'm taking is to create a docker container with 
> everything in it (including data load), and then running tests in parallel. 
> This is a bit messier, but I think it has some legs when it comes to using 
> machines with many cores.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to