Domino Valdano created MADLIB-1379:
--------------------------------------
Summary: Fix madpack slowness by using a single session
Key: MADLIB-1379
URL: https://issues.apache.org/jira/browse/MADLIB-1379
Project: Apache MADlib
Issue Type: Improvement
Components: Madpack
Reporter: Domino Valdano
In the current implementation of madpack, the python function for sending
queries to the database, `run_query` is causing extreme slowness on some
platforms, especially during startup and shutdown.
For example, on a MacBookPro with OSX 16.7.0 and a minimal 3-segment Greenplum
demo cluster:
```
> time src/bin/madpack -p greenplum -c /madlib install-check -t nosuchmodule
madpack.py: INFO : Detected Greenplum DB version 5.10.0.
real 0m46.365s
user 0m0.207s
sys 0m0.136s
```
It takes madpack 46 seconds just to figure out that there is no module named
`nosuchmodule` in madlib, so it can't run install-check on it.
It also takes about 46 seconds to run a single install-check sql file that
*does* exist but only contains the line `SELECT 1`. The general pattern is
that `madpack` will hang for a long time, then run `install-check` or
`dev-check`, report PASS or FAIL's, then hang for a long time again, and then
finally exit.
This makes it almost unusable for rapid iterative development, for example when
writing test files and checking whether they still PASS when minor changes are
made.
I'm not sure how many platforms are affected by this slowdown, but I have
noticed it happens on my MacBook Pro's with greenplum but not postgres.
The main reason for the slowness is that there are a series of setup sql
commands issued one by one to `run_query`, and for each command (`DROP USER`,
`CREATE USER`, GRANT USAGE`, `GRANT ALL`, etc.). Instead of using a single
connection to the db, each call to `run_query` opens a new connection and then
closes it. In particular, it spawns an entirely new child process, redirects
the input and output streams, constructs a `psql` command filled with all the
right command-line options, just to execute the single line of sql.
On `postgres`, may be a little inefficient, but it doesn't matter too much
because there is just a single machine with a single instance running.
Probably slows it down a little, but hardly noticeable.
However, on `greenplum` the situation is far worse. Each time `run_query`
creates a new `psql` client session, it first connects to the master postgres
instance. The master spawns a new child postgres process to handle that
particular client. It then signals all of the segments to create new child
postgres processes on all the remote hosts to also handle that specific client.
This involves sending encrypted network traffic back and forth between every
host in the cluster over an ssh connection, followed by each of those hosts
spawning as many new postgres processes as there are segments per host.
The 46-second delay I've seen is for only a single machine with 1 master and 3
segments, so I think that means we could expect delays of several minutes or
more on multi-host clusters.
I have verified that with greenplum on OSX, each time I open `psql` there is a
3 second delay during the first sql command issued that involves segments, but
no delay for any commands that come after that since all of the extra postgres
processes have already been created. Because of how `run_query` is written,
that necessitates a `3-second` delay for every SQL command issued, and it
issues at least a dozen or so, only one of them being the actual execution of
the test script.
An easy solution to this is to rewrite `run_query` so that it uses a single
client connection instead of spawning a new client in a new child process for
each query. This is only a few lines using `psycopg`. But if we don't want to
have that as a requirement, we could have it start running `psql` the first
time, but save the input and output file descriptors so that with subsequent
queries it can send data back and forth to psql.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)