Philip Zeyliger has posted comments on this change. (
http://gerrit.cloudera.org:8080/9085 )
Change subject: IMPALA-6070: Expose using Docker to run tests faster.
......................................................................
Patch Set 3:
(37 comments)
Thanks again for the review.
I think I addressed all of your comments; I went through comments both on PS1
and PS3 and hopefully didn't miss anything.
I've been using this pretty successfully in testing some of my other changes,
which is a good sign.
I ran autopep8 and took its suggestions. I found that you have to break lines
manually and run it again, so that it avoids stuff like:
long_function_name(arg1,
arg2, longarg3,
thisisgoingtogetusoutofthelinelimit)
Anyway, if you've got a magical way to avoid that, let me know.
http://gerrit.cloudera.org:8080/#/c/9085/1//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/9085/1//COMMIT_MSG@28
PS1, Line 28: Besides the obvious win of getting test results more quickly, this
> For my information: why does that change things?
Here are a few pointers.
First, a simple reproduction:
$docker rm test; docker rmi -f test:a; docker run -it --name test
python:2-wheezy bash -c "mkdir /xxx; mkdir /xxx/subdir"; docker commit test
test:a; docker run -ti test:a /usr/bin/python
test
Untagged: test:a
Deleted: sha256:11e8a0db4d31a2da59fb41133c3b1ce3ebc59aad7485bbfd2076e181cd52896c
sha256:e44a5d328bcab3fbf59d6960dbab0f589ea893617ae03a36411de907a806aa52
Python 2.7.3 (default, Nov 19 2017, 01:35:09)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.rename("/xxx", "/yyy")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 18] Invalid cross-device link
What's going on is somewhat explained in
https://issues.apache.org/jira/browse/KUDU-1419. If a directory we want to
atomically rename has a subdirectory, aufs refuses to handle it. Kudu expects
these semantics. "mv" works around them.
quote:
To rename(2) directory may return EXDEV even if both of src and tgt are on the
same aufs. When the rename-src dir exists on multiple branches and the lower
dir has child(ren), aufs has to copyup all his children. It can be recursive
copyup. Current aufs does not support such huge copyup operation at one time in
kernel space, instead produces a warning and returns EXDEV. Generally, mv(1)
detects this error and tries mkdir(2) and rename(2) or copy/unlink recursively.
So the result is harmless. If your application which issues rename(2) for a
directory does not support EXDEV, it will not work on aufs. Also this
specification is applied to the case when the src directroy exists on the lower
readonly branch and it has child(ren).
Let me know if that doesn't make sense. I've not studied aufs internals to
understand why it refuses. (And it's possible that other people use other
default Docker storage drivers; that world is... messy.)
http://gerrit.cloudera.org:8080/#/c/9085/3//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/9085/3//COMMIT_MSG@41
PS3, Line 41: as
> nit: "and"?
Done.
http://gerrit.cloudera.org:8080/#/c/9085/3//COMMIT_MSG@48
PS3, Line 48: IMPALA-6494
> I think this is probably a blocker for me.
I agree that this is a blocker for saying turning off GVOs as they are today
and switching to this. However, I tried pretty hard to be minimally invasive to
other things in this change, which will allow some parallel development. I'd
very much like to get this in so that I can start reasonably asking other folks
to try it.
I very much want to avoid making this one commit even bigger than it already
is. For example, it's obvious that we want to run the exhaustive tests in the
same way, or the ASAN ones. I've been avoiding exposing that.
Thoughts?
http://gerrit.cloudera.org:8080/#/c/9085/1/bin/bootstrap_system.sh
File bin/bootstrap_system.sh:
http://gerrit.cloudera.org:8080/#/c/9085/1/bin/bootstrap_system.sh@219
PS1, Line 219: out
> nit: "out" twice
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/bin/start-impala-cluster.py
File bin/start-impala-cluster.py:
http://gerrit.cloudera.org:8080/#/c/9085/3/bin/start-impala-cluster.py@221
PS3, Line 221: system_memory = psutil.virtual_memory().total
> What's the rationale for this change from ps1?
Gerrit is showing me PS1 as the same here; maybe I'm looking at the wrong thing.
Regardless, I've entirely undone this change. There was an intermediate state
where I tried to control mem_limit here by an environment variable. It turns
out that TEST_START_CLUSTER_ARGS could control that just as well with a similar
amount of pain, so I opted for that route (as opposed to adding another
variable). When I fixed that up, it looks like I left in the extra variable.
If you're asking why I need to customize mem_limit, the issue is that we run
multiple clusters simultaneously, and each of them adding up to 80% of RAM
leads to overcommitment.
http://gerrit.cloudera.org:8080/#/c/9085/1/docker/annotate.pl
File docker/annotate.pl:
http://gerrit.cloudera.org:8080/#/c/9085/1/docker/annotate.pl@26
PS1, Line 26: }
> Yes, I'm not thrilled by the introduction of another PL.
The way I worked my way out of this was to move "annotate" into the host. i.e.,
the annotation is now happening on the host rather than inside the container.
I'm still using python, but that's not a net new language.
The nice thing is that this simplified entrypoint.sh nicely. That code now
looks more boring (which is good for shell). I also already needed to know what
the log paths were on the host anyway to parse them for the timeline, but this
removes the need to think about them inside the container.
http://gerrit.cloudera.org:8080/#/c/9085/1/docker/entrypoint.sh
File docker/entrypoint.sh:
http://gerrit.cloudera.org:8080/#/c/9085/1/docker/entrypoint.sh@20
PS1, Line 20: # Entrypoint code for test-with-docker.py containers.
test-with-docker.py
> Still, when I cd into a directory, it's nice to know which file to look at
Good idea; added a pointer to the README.md
http://gerrit.cloudera.org:8080/#/c/9085/1/docker/entrypoint.sh@150
PS1, Line 150: if sudo -u postgres psql -c "select 1"; then
> This variant of exec always returns true. The other version of exec can bui
I switched to -execdir. I created a million files and checked that find is
smart enough to respect ARG_MAX and created 68 echo commands on my behalf.
I don't think the return matters here. I think the return that the man page is
talking about is whether the file will get included in the find output, no?
Either way, it's all academic.
$seq 1000000 | xargs -P16 touch
$ls | wc -l
1000000
$find . -type f -execdir echo '{}' '+' | wc -l
68
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/entrypoint.sh
File docker/entrypoint.sh:
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/entrypoint.sh@33
PS3, Line 33: <name>
> nits: what is "<name>"? How is it used - just an empty directory?
I amended the docs here, and for --name of test-with-docker.py.
Because the log files are overlapping (e.g., impalad logs), I created a
directory for every container. For reasons that are possibly less good, I also
create a test dir for every run. This was super handy for me to compare
multiple runs during development, because I didn't keep losing the logs of
previous runs accidentally.
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/entrypoint.sh@75
PS3, Line 75: busily
> nit: Can you leave a note explaining why we want to sleep busily, rather th
Done
# We sleep busily so that CPU metrics will show usage, to
# better exercise the timeline code.
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/entrypoint.sh@178
PS3, Line 178: cd /home/impdev/Impala
> pushd, the popd trap on function return or exit?
I just moved the line 174 popd to the bottom of this function.
Given that this isn't being sourced and this being largely under "set -e", I'm
not bothering with clever traps; let me know if you think I should be more
clever there.
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/entrypoint.sh@245
PS3, Line 245: IMPALAD_MEM_LIMIT_BYTES
> Can you add a comment about where this variable comes from?
I amended the comment in lines 235-239 to call this variable out explicitly.
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/entrypoint.sh@249
PS3, Line 249: # TODO: session-expiry-test is inconsistent when there isn't
> Can you file a ticket along with the error message you get?
IMPALA-6704. I just looked at it again with fresh eyes and I think the patch is
simple, so I removed this and I'll push that patch along as a pre-requisite.
I'll restore this if people disagree with that patch for that one.
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py
File docker/monitor.py:
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@20
PS3, Line 20: # Monitors Docker containers for CPU and memory usage.
> Can you add some example invocations?
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@21
PS3, Line 21: # Prepares a timeline based on said monitoring.
> What format is that timeline in? What file is it in?
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@61
PS3, Line 61: if total_gb and available_gb:
> What if available_gb is 0.0? Can that happen on a machine with swap?
I switched these to be explicitly "is not None".
And then I replaced this with your implementation using free, so it's moot.
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@96
PS3, Line 96: t = line[:ts_length]
> nit: this and the next section can be de-duped somewhat.
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@122
PS3, Line 122: usercpu
> What is usercpu? I mean, what are the units?
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@123
PS3, Line 123: memory.stat
> Where is memory.stat found?
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@131
PS3, Line 131: frequency_seconds is how often metrics are gathered
> nit: all on one line
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@150
PS3, Line 150: which an
> nit: I think you accidentally a word
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@154
PS3, Line 154: def _metrics_from_stat_file(root, container, stat):
> Generally, it's helpful to know the type (and format, for strings) of the p
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@156
PS3, Line 156: os.path.join(root, "docker", container.id_)
> nit: repeated below; factor out
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@163
PS3, Line 163: disappear
> That should throw a specific exception, yes? Can you catch just that?
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@170
PS3, Line 170: a log file
> in particular, the one set in the constructor.
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@176
PS3, Line 176: ["findmnt", "-n", "-o", "TARGET", "-t", "cgroup", "--source",
"cgroup"]
> nit: "findmnt -n -o TARGET -t cgroup --source cgroup".split()
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@221
PS3, Line 221: Returns a tuple of (name, timestamp, line) for interesting
lines in the file
> nit: this and a couple other doc strings don't quite meet our usual pattern
Done. I'm a little bit confused about why the autopep8 thing you mentioned to
be before didn't work in some cases. Very likely user error on my part.
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@226
PS3, Line 226: parsed_interesting_lines
> nit: why name it -- just return it.
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@230
PS3, Line 230: self
> self unused; Can be static method?
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@270
PS3, Line 270: logfile_timeline
> This is called only once, it seems. Can we call it with just C and have log
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@292
PS3, Line 292: mterics_by_container
> nit: spelling
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/monitor.py@316
PS3, Line 316: o.write(
> nit: is there a cleaner way to copy file contents?
Done
http://gerrit.cloudera.org:8080/#/c/9085/1/docker/test-with-docker.py
File docker/test-with-docker.py:
http://gerrit.cloudera.org:8080/#/c/9085/1/docker/test-with-docker.py@116
PS1, Line 116: import multiprocessing
> free_lines = subprocess.check_output(["free", "-b", "-w"]).split('\n')
Ok, I moved over to this implementation.
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/timeline.html.template
File docker/timeline.html.template:
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/timeline.html.template@19
PS3, Line 19: Runs tests inside of docker containers, parallelizing different
types of
: tests
> That's not what this file actually does, right?
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/timeline.html.template@53
PS3, Line 53: dataTable.addRow([ row[0], row[1], new Date(0, 0, 0, 0, 0, 0,
1000 * row[2]), new Date(0, 0, 0, 0, 0, 0, 1000*row[3]) ]);
> nit: long line
Done
http://gerrit.cloudera.org:8080/#/c/9085/3/docker/timeline.html.template@72
PS3, Line 72: // width: "100px"
> Elide
Done
http://gerrit.cloudera.org:8080/#/c/9085/1/tests/run-tests.py
File tests/run-tests.py:
http://gerrit.cloudera.org:8080/#/c/9085/1/tests/run-tests.py@216
PS1, Line 216: # Note that most stress tests are already being skipped
because of IMPALA-2605
> Do you want to allow --skip-stress, also?
Done
--
To view, visit http://gerrit.cloudera.org:8080/9085
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I82052ef31979564968effef13a3c6af0d5c62767
Gerrit-Change-Number: 9085
Gerrit-PatchSet: 3
Gerrit-Owner: Philip Zeyliger <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Philip Zeyliger <[email protected]>
Gerrit-Comment-Date: Thu, 22 Mar 2018 03:35:20 +0000
Gerrit-HasComments: Yes