Good morning all,

We have a problem with performance after upgrading from 9.3 to 9.6 where 
certain queries take 9 times longer to run.  On our initial attempt to upgrade, 
we noticed the system as a whole was taking longer to run through normal daily 
processes.  The query with the largest run time was picked to act as a 
measuring stick.

We are using the staging server to test the upgrade.  It has two 1.3TB 
partitions, each of which holds a copy of the near 1TB database.  The active 
staging 9.3 database is on one of the partitions while a copy was made onto the 
other.  A second instance of 9.3 was set up to verify the copy was successful 
and some performace tests were done, then upgraded to 9.6 via pg_upgrade.  The 
same performance tests were done and this is where the 9 time slower number 
comes from.

OS Ubuntu 14.04.4
PG9.3 is 9.3.19-1.pgdg14.04+1 from http://apt.postgresql.org/pub/repos/apt/
PG9.6 is 9.6.5-1.pgdg14.04+2 from same.
The server has 24 cores and 64GB RAM.  Data drives are spinning platter in 
raid10 - not ssd.

upgrade steps:
* Ran rsync (excluding the xlog folder) from the active 9.3 partition to the 
unused partition while 9.3 was still running.
* Once initial rsync completed, shut down 9.3 and reran the rsync command 
without the exclude.
* Once second rsync completed, restarted 9.3 and left it alone.
* Copied the active 9.3 configuration files into a new /etc/postgresql/9.3/ 
folder so we could test before/after numbers.  Changed the config to point to 
the appropriate data/log/etc folders/files/port.
* Started second 9.3 instance.
* Altered the few foreign servers to account for the second instance's port.
* Ran some arbitrary queries to check performance.
* Installed 9.6 via apt
* Created a 9.6 instance with its data directory on the same partition as the 
newely cloned 9.3 instance.
* Ran pg_upgrade with --link option (includes running --check and 
compiling/installing postgis)
* Copied the config from 9.3 and made minimal changes.  Renamed log files, 
changed folders, removed checkpoint_segments is about it.
* Started the 9.6 instance and was able to do arbitrary queries.
* Ran the upgrade-suggested vacuumdb command on all databases to generate 
statistics

At that point, the database should be ready to use.  Running the same set of 
arbitrary queries that were run on 9.3 yielded much worse performance, though.
The data is broken out by dealers containing customers owning vehicles.  The 
arbitrary queries pull data from 8 tables (really 6 large[millions of records] 
and 2 smaller[hundreds] tables) to populate a new table via "select [fields] 
into [new table]".  Twenty different dealers were used for testing meaning 
twenty of these queries.  The system which runs these queries has 12 workers 
meaning up to 12 of these queries can be running concurrently.  While the 
workers were offline, all twenty were queued up and then the workers activated. 
 For each test, the order of the dealers was the same.  That order was a mix of 
small/large dealers mixed - not exactly high,low,high; more like a few large 
then a few small.  The run time for 9.3 was 21m9s and 9.6 was 3h18m25s.

Each test was done while the other database was idle - htop showed little to no 
activity before each test started.
perf reports (converted to flamegraph via 
https://github.com/brendangregg/FlameGraph) for the 9.6 test show about a 1/3 
of the processor usage similar to that of graph for 9.3.  The other 2/3 is 
still within the postgres process but starts with '[unknown]' and has 
'connect', 'socket', and 'close' as the next call in the chain.  I have not 
been able to figure out what postgres is doing to make these calls.

Changing the configuration based on pgtune (command line version 0.9.3-2) did 
not make much change.  The online pgtune at http://pgtune.leopard.in.ua/ had 
just a couple differences in settings I have yet to test.

Main question is what the connect/socket/close calls in the perf output are and 
how to make them go away as they appear to be what is using up the added time.  
I'm hoping there is just a setting I've missed.

Query plan for a small dealer on 9.6 run without anything else running on the 
server
https://explain.depesz.com/s/z71u
Planning time: 8.218 ms
Execution time: 639319.525 ms

Same query as run on 9.3
https://explain.depesz.com/s/gjN3
Total runtime: 272897.150 ms


--
Thanks,
Adam Torres


Reply via email to