Thank you for giving comments and my patch reviewer!

(2013/06/16 23:27), Heikki Linnakangas wrote:
On 10.06.2013 13:51, KONDO Mitsumasa wrote:
I create patch which is improvement of checkpoint IO scheduler for
stable transaction responses.

* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint
scheduler has two problems at start and end of checkpoint. One problem
is IO heavy when starting initial checkpoint in rounds of checkpoint.
This problem was caused by full-page-write which cause WAL IO in fast
page writes after checkpoint write page. Therefore, when starting
checkpoint, WAL-based checkpoint scheduler wrong judgment that is late
schedule by full-page-write, nevertheless checkpoint schedule is not
late. This is caused bad transaction response. I think WAL-based
checkpoint scheduler was not property in starting checkpoint.

Yeah, the checkpoint scheduling logic doesn't take into account the heavy WAL
activity caused by full page images. That's an interesting phenomenon, but did
you actually see that causing a problem in your tests?  I couldn't tell from the
results you posted what the impact of that was. Could you repeat the tests
separately with the two separate patches you posted later in this thread?
OK, I try to test with the two separate patches. My patches results which I 
send past
indicate high WAL throughputs(write_size_per_sec) and high transaction during checkpoint. Please see under following HTML file which I set tag jump, and put 'checkpoint highlight switch' button.

* With my patched PG
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#transaction_statistics
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#wal_statistics

* Plain PG
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#transaction_statistics
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#wal_statistics

In wal statistics result, I think that high WAL thorouputs in checkpoint starting indicates that checkpoint IO does not disturb other executing transaction IO.

Rationalizing a bit, I could even argue to myself that it's a *good* thing. At
the beginning of a checkpoint, the OS write cache should be relatively empty, as
the checkpointer hasn't done any writes yet. So it might make sense to write a
burst of pages at the beginning, to partially fill the write cache first, before
starting to throttle. But this is just handwaving - I have no idea what the
effect is in real life.
Yes, I think so. If we want to change IO throttle, we change OS parameter which are '/proc/sys/vm/dirty_background_ratio' or '/proc/sys/vm/dirty_ratio'. But this parameter effects whole applications in OS, it is difficult to change this parameter and cannot set intuitive parameter. And I think that database tuning should be set in database parameter rather than OS parameter. It is more clear in tuning a server.

Another thought is that rather than trying to compensate for that effect in the
checkpoint scheduler, could we avoid the sudden rush of full-page images in the
first place? The current rule for when to write a full page image is
conservative: you don't actually need to write a full page image when you modify
a buffer that's sitting in the buffer cache, if that buffer hasn't been flushed
to disk by the checkpointer yet, because the checkpointer will write and fsync 
it
later. I'm not sure how much it would smoothen WAL write I/O, but it would be
interesting to try.
It is most right method in ideal implementations. But I don't have any idea about this method. It seems very difficult...


Second problem is fsync freeze problem in end of checkpoint.
Normally, checkpoint write is executed in background by OS's IO
scheduler. But when it does not correctly work, end of checkpoint
fsync was caused IO freeze and slower transactions. Unexpected slow
transaction will cause monitor error in HA-cluster and decrease
user-experience in application service. It is especially serious
problem in cloud and virtual server database system which does not
have IO performance. However we don't have solution in
postgresql.conf parameter very much. We prefer checkpoint time to
fast response transactions. In fact checkpoint time is short, and it
becomes little bit long that is not problem. You may think that
checkpoint_segments and checkpoint_timeout are set larger value,
however large checkpoint_segments affects file-cache which is not
read and is wasted, and large checkpoint_timeout was caused
long-time crash-recovery.

A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because Tom
couldn't reproduce the numbers, and because sorting requires allocating a large
array, which has the risk of running out of memory, which would be bad when
you're trying to checkpoint.
Yes, we tested Itagaki's patche last year. But our test results is not good. I think that our test server's RAID contoroler with 1GB cache and 8 disks was too good to indicate good results. Write IO might be eventually optimized in RAID contoroler which has big chache.

Apart from the direct performance impact of that patch, sorting the writes would
allow us to interleave the fsyncs with the writes. You would write out all
buffers for relation A, then fsync it, then all buffers for relation B, then
fsync it, and so forth. That would naturally spread out the fsyncs.

If we don't mind scanning the buffer cache several times, we don't necessarily
even need to sort the writes for that. Just scan the buffer cache for all 
buffers
belonging to relation A, then fsync it. Then scan the buffer cache again, for 
all
buffers belonging to relation B, then fsync that, and so forth.
Yes. But I don't think that it needs *exactly* buffer sort. It needs roughly buffer sort only for interleving the fsyncs with the writes. Roughly buffer sort reduce computational complexity which was said by Tom, and it will be optimized in OS IO scheduler as same as exactly buffer sort. My roughly buffer sort images are clustering like k-means. If we can know distribution of buffers in advance, we will be able to realize roughly buffer sort with less computational complexity.


Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).

For a fair comparison, you should increase the checkpoint_completion_target of
the unpatched test, so that the checkpoints run for roughly the same amount of
time with and without the patch. Otherwise the benefit you're seeing could be
just because of a more lazy checkpoint.
It is important to understand other contributer, I need more fair comparison and an objective analysis. Thanks for your advice, I try it!

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to