Re: [HACKERS] checkpointer continuous flushing

Fabien COELHO Tue, 18 Aug 2015 23:45:07 -0700

Sure, I think what can help here is a testcase/'s (in form of script file
or some other form, to test this behaviour of patch) which you can write
and post here, so that others can use that to get the data and share it.

Sure... note that I already did that on this thread, without any echo... but I can do it again...

Tests should be run on a dedicated host. If it has n cores, I suggest to share them between postgres checkpointer & workers and pgbench threads so as to avoid thread competition to use cores. With 8 cores I used up to 2 threads & 4 clients, so that there is 2 core left for the checkpointer and other stuff (i.e. I also run iotop & htop in parallel...). Although it may seem conservative to do so, I think that the point of the test is to exercise checkpoints and not to test the process scheduler of the OS.


Here are the latest version of my test scripts:

 (1) cp_test.sh <name> <test>

Run "test" with setup "name". Currently it runs 4000 seconds pgbench with the 4 possible on/off combinations for sorting & flushing, after some warmup. The 4000 second is chosen so that there are a few checkpoint cycles. For larger checkpoint times, I suggest to extend the run time to see at least 3 checkpoints during the run.


More test settings can be added to the 2 "case"s. Postgres settings,

especially shared_buffers, should be set to a pertinent value wrt the memory of the test host.

The test run with postgres version found in the PATH, so ensure that the right version is found!


 (2) cp_test_count.py one-test-output.log

For rate limited runs, look at the final figures and compute the number of late & skipped transactions. This can also be done by hand.


 (3) avg.py

For full speed runs, compute stats about per second tps:

  sh> grep 'progress:' one-test-output.log | cut -d' ' -f4 | \
        ./avg.py --limit=10 --length=4000
  warning: 633 missing data, extending with zeros
  avg over 4000: 199.290575 ± 512.114070 [0.000000, 0.000000, 4.000000, 
5.000000, 2280.900000]
  percent of values below 10.0: 82.5%

The figures I reported are the 199 (average tps), 512 (standard deviation on per second figures), 82.5% (percent of time below 10 tps, aka postgres is basically unresponsive). In brakets, the min q1 median q3 and max tps seen in the run.

Ofcourse, that is not mandatory to proceed with this patch, but still can
help you to prove your point as you might not have access to different
kind of systems to run the tests.

I agree that more tests would be useful to decide which default value for the flushing option is the better. For Linux, all tests so far suggest "on" is the best choice, but for other systems that use posix_fadvise, it is really an open question.

Another option would be to give me a temporary access for some available host, I'm used to running these tests...


--
Fabien.

cp_test.sh
Description: Bourne shell script

#! /usr/bin/env python
#
# $Id: cp_test_counts.py 316 2015-05-31 20:29:44Z fabien $
#
# show the % of skipped and over-the-limit transactions from pgbench output.
#

import re
import fileinput
d = {}

for line in fileinput.input():
	if line.find("number of transactions ") == 0:
		for kw in ['processed', 'skipped', 'limit']:
			if line.find(kw + ": ") != -1:
				d[kw] = int(re.search(kw + ": (\d+)", line).group(1))
		if len(d) == 3:
			print("%f" % (100.0 * (d['skipped']+d['limit']) / (d['processed'] + d['skipped'])))
			d = {}

#! /usr/bin/env python
# -*- coding: utf-8 -*-
#
# $Id: avg.py 1226 2015-08-15 19:14:57Z coelho $
#

import argparse
ap = argparse.ArgumentParser(description='show stats about data: count average stddev [min q1 median q3 max]...')
ap.add_argument('--median', default=True, action='store_true',
				help='compute median and quartile values')
ap.add_argument('--no-median', dest='median', default=True,
				action='store_false',
				help='do not compute median and quartile values')
ap.add_argument('--more', default=False, action='store_true',
				help='show some more stats')
ap.add_argument('--limit', type=float, default=None,
				help='set limit for counting below limit values')
ap.add_argument('--length', type=int, default=None,
				help='set expected length, assume 0 if beyond')
ap.add_argument('file', nargs='*', help='list of files to process')
opt = ap.parse_args()

# option consistency
if opt.limit:
	opt.more = True
if opt.more:
	opt.median = True

# reset arguments for fileinput
import sys
sys.argv[1:] = opt.file

import fileinput

n, skipped, vals = 0, 0, []
k, vmin, vmax = None, None, None
sum1, sum2 = 0.0, 0.0

for line in fileinput.input():
	try:
		v = float(line)
		if opt.median: # keep track only if needed
			vals.append(v)
		if k is None: # first time
			k, vmin, vmax = v, v, v
		else: # next time
			vmin = min(vmin, v)
			vmax = max(vmax, v)
		n += 1
		vmk = v - k
		sum1 += vmk
		sum2 += vmk * vmk
	except ValueError: # float conversion failed
		skipped += 1

if opt.length:
	assert "some data seen", n > 0
	missing = int(opt.length) - len(vals)
	assert "positive number of missing data", missing >= 0
	if missing > 0:
		print("warning: %d missing data, extending with zeros" % missing)
		if opt.median:
			vals += [ 0.0 ] * missing
		vmin = min(vmin, 0.0)
		sum1 += - k * missing
		sum2 += k * k * missing
		n += missing
		assert len(vals) == int(opt.length)

if opt.median:
	assert "consistent length", len(vals) == n

# five numbers...
# numpy.percentile requires numpy at least 1.9 to use 'midpoint'
# statistics.median requires python 3.4 (?)
def median(vals, start, length):
	m, odd = divmod(length, 2)
	#return 0.5 * (vals[start + m + odd - 1] + vals[start + m])
	return  vals[start + m] if odd else \
		0.5 * (vals[start + m-1] + vals[start + m])

# return ratio of below limit values
def below(vals, limit):
	return float(len([v for v in vals if v < limit ])) / len(vals)

if skipped:
	print("warning: %d lines skipped" % skipped)

if n > 0:
	# show result (hmmm, precision is truncated...)
	from math import sqrt
	avg, stddev = k + sum1 / n, sqrt((sum2 - (sum1 * sum1) / n) / n)
	if opt.median:
		vals.sort()
		med = median(vals, 0, len(vals))
		# not sure about odd/even issues here...
		q1 = median(vals, 0, len(vals) // 2)
		q3 = median(vals, (len(vals)+1) // 2, len(vals) // 2)
		print("avg over %d: %f Â± %f [%f, %f, %f, %f, %f]" %
			  (n, avg, stddev, vmin, q1, med, q3, vmax))
		if opt.more:
			limit = opt.limit if opt.limit else 0.1 * med
			print("percent of values below %.1f: %.1f%%" %
				  (limit, 100.0 * below(vals, limit)))
	else:
		print("avg over %d: %f Â± %f [%f, %f]" %
			  (n, avg, stddev, vmin, vmax))
else:
	print("no data seen.")

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

Reply via email to