from:"Greg Smith"

Re: [HACKERS] Spread checkpoint sync

2011-01-27 Thread Greg Smith


Greg Smith wrote:
I think a helpful next step here would be to put Robert's fsync 
compaction patch into here and see if that helps.  There are enough 
backend syncs showing up in the difficult workloads (scale=1000, 
clients =32) that its impact should be obvious.


Initial tests show everything expected from this change and more.  This 
took me a while to isolate because of issues where the filesystem 
involved degraded over time, giving a heavy bias toward a faster first 
test run, before anything was fragmented.  I just had to do a whole new 
mkfs on the database/xlog disks when switching between test sets in 
order to eliminate that.


At a scale of 500, I see the following average behavior:

Clients TPS backend-fsync
16 557 155
32 587 572
64 628 843
128 621 1442
256 632 2504

On one run through with the fsync compaction patch applied this turned into:

Clients TPS backend-fsync
16 637 0
32 621 0
64 721 0
128 716 0
256 841 0

So not only are all the backend fsyncs gone, there is a very clear TPS 
improvement too.  The change in results at =64 clients are well above 
the usual noise threshold in these tests. 

The problem where individual fsync calls during checkpoints can take a 
long time is not appreciably better.  But I think this will greatly 
reduce the odds of running into the truly dysfunctional breakdown, where 
checkpoint and backend fsync calls compete with one another, that caused 
the worst-case situation kicking off this whole line of research here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-27 Thread Greg Smith


Robert Haas wrote:

Based on what I saw looking at this, I'm thinking that the backend
fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs
spread uniformly throughout the test, but clusters of 100 or more that
happen in very quick succession, followed by relief when the
background writer gets around to emptying the queue.


That's exactly the case.  You'll be running along fine, the queue will 
fill, and then hundreds of them can pile up in seconds.  Since the worst 
of that seemed to be during the sync phase of the checkpoint, adding 
additional queue management logic to there is where we started at.  I 
thought this compaction idea would be more difficult to implement than 
your patch proved to be though, so doing this first is working out quite 
well instead.


This is what all the log messages from the patch look like here, at 
scale=500 and shared_buffers=256MB:


DEBUG:  compacted fsync request queue from 32768 entries to 11 entries

That's an 8GB database, and from looking at the relative sizes I'm 
guessing 7 entries refer to the 1GB segments of the accounts table, 2 to 
its main index, and the other 2 are likely branches/tellers data.  Since 
I know the production system I ran into this on has about 400 file 
segments on it regularly dirtied a higher shared_buffers than that, I 
expect this will demolish this class of problem on it, too.


I'll have all the TPS over time graphs available to publish by the end 
of my day here, including tests at a scale of 1000 as well.  Those 
should give a little more insight into how the patch is actually 
impacting high-level performance.  I don't dare disturb the ongoing 
tests by copying all that data out of there until they're finished, will 
be a few hours yet.


My only potential concern over committing this is that I haven't done a 
sanity check over whether it impacts the fsync mechanics in a way that 
might cause an issue.  Your assumptions there are documented and look 
reasonable on quick review; I just haven't had much time yet to look for 
flaws in them.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-27 Thread Greg Smith


Robert Haas wrote:

During each cluster, the system probably slows way down, and then recovers when
the queue is emptied.  So the TPS improvement isn't at all a uniform
speedup, but simply relief from the stall that would otherwise result
from a full queue.
  


That does seem to be the case here.  
http://www.2ndquadrant.us/pgbench-results/index.htm now has results from 
my a long test series, at two database scales that caused many backend 
fsyncs during earlier tests.  Set #5 is the existing server code, #6 is 
with the patch applied.  There are zero backend fsync calls with the 
patch applied, which isn't surprising given how simple the schema is on 
this test case.  An average of a 14% TPS gain appears at a scale of 500 
and a 8% one at 1000; the attached CSV file summarizes the average 
figures for the archives.  The gains do appear to be from smoothing out 
the dead period that normally occur during the sync phase of the checkpoint.


For example, here are the fastest runs at scale=1000/clients=256 with 
and without the patch:


http://www.2ndquadrant.us/pgbench-results/436/index.html (tps=361)
http://www.2ndquadrant.us/pgbench-results/486/index.html (tps=380)

Here the difference in how much less of a slowdown there is around the 
checkpoint end points is really obvious, and obviously an improvement.  
You can see the same thing to a lesser extent at the other end of the 
scale; here's the fastest runs at scale=500/clients=16:


http://www.2ndquadrant.us/pgbench-results/402/index.html (tps=590)
http://www.2ndquadrant.us/pgbench-results/462/index.html (tps=643)

Where there are still very ugly maximum latency figures here in every 
case, these periods just aren't as wide with the patch in place.


I'm moving onto some brief testing some of the newer kernel behavior 
here, then returning to testing the other checkpoint spreading ideas on 
top of this compation patch, presuming something like it will end up 
being committed first.  I think it's safe to say I can throw away the 
changes to try and alter the fsync absorption code present in what I 
submitted before, as this scheme does a much better job of avoiding that 
problem than those earlier queue alteration ideas.  I'm glad Robert 
grabbed the right one from the pile of ideas I threw out for what else 
might help here.


P.S. Yes, I know I have other review work to do as well.  Starting on 
the rest of that tomorrow.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

,,Unmodified,,Compacted Fsync,,,
scale,clients,tps,max_latency,tps,max_latency,TPS Gain,% Gain
500,16,557,17963.41,631,17116.31,74,13.3%
500,32,587,25838.8,655,24311.54,68,11.6%
500,64,628,35198.39,727,38040.39,99,15.8%
500,128,621,41001.91,687,48195.77,66,10.6%
500,256,632,49610.39,747,46799.48,115,18.2%
,,,
1000,16,306,39298.95,321,40826.58,15,4.9%
1000,32,314,40120.35,345,27910.51,31,9.9%
1000,64,334,46244.86,358,45138.1,24,7.2%
1000,128,343,72501.57,372,47125.46,29,8.5%
1000,256,321,80588.63,350,83232.14,29,9.0%

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use of O_DIRECT only for open_* sync options

2011-01-23 Thread Greg Smith


Bruce Momjian wrote:

xlogdefs.h says:

/*
 *  Because O_DIRECT bypasses the kernel buffers, and because we never
 *  read those buffers except during crash recovery, it is a win to use
 *  it in all cases where we sync on each write().  We could allow O_DIRECT
 *  with fsync(), but because skipping the kernel buffer forces writes out
 *  quickly, it seems best just to use it for O_SYNC.  It is hard to imagine
 *  how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
 *  Also, O_DIRECT is never enough to force data to the drives, it merely
 *  tries to bypass the kernel cache, so we still need O_SYNC or fsync().
 */

This seems wrong because fsync() can win if there are two writes before
the sync call.  Can kernels not issue fsync() if the write was O_DIRECT?
If that is the cause, we should document it.
  


The comment does look busted, because you did imagine exactly a case 
where they might be combined.  The only incompatibility that I'm aware 
of is that O_DIRECT requires reads and writes to be aligned properly, so 
you can't use it in random application code unless it's aware of that.  
O_DIRECT and fsync are compatible; for example, MySQL allows combining 
the two:  http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html


(That whole bit of documentation around innodb_flush_method includes 
some very interesting observations around O_DIRECT actually)


I'm starting to consider the idea that much of the performance gains 
seen on earlier systems with O_DIRECT was because it decreased CPU usage 
shuffling things into the OS cache, rather than its impact on avoiding 
pollution of said cache.  On Linux for example, its main accomplishment 
is decribed like this:  File I/O is done directly to/from user space 
buffers.  
http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html  The 
earliest paper on the implementation suggests a big decrease in CPU 
overhead from that:  
http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html


Impossible to guess whether that's more true (CPU cache pollution is a 
bigger problem now) or less true (drives are much slower relative to 
CPUs now) today.  I'm trying to remain agnostic and let the benchmarks 
offer an opinion instead.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-18 Thread Greg Smith

 + old 
kernel crowd.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] auto-sizing wal_buffers

2011-01-18 Thread Greg Smith


Fujii Masao wrote:

+/* Minimum setting used for a lower bound on wal_buffers */
+#define XLOG_BUFFER_MIN4

Why didn't you use XLOG_BUFFER_MIN instead of XLOGbuffersMin?
XLOG_BUFFER_MIN is not used anywhere for now.
  


That's a typo; will fix.


+   if (XLOGbuffers  (XLOGbuffersMin * 2))
+   XLOGbuffers = XLOGbuffersMin * 2;
+   }

Why is the minimum value 64kB only when wal_buffers is set to
-1? This seems confusing for users.
  


That's because the current default on older versions is 64kB.  Since the 
automatic selection is going to be the new default, I hope, I don't want 
it to be possible it will pick a number smaller than the default of 
older versions.  So the automatic lower limit is 64kB, while the actual 
manually set lower limit remains 32kB, as before.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] auto-sizing wal_buffers

2011-01-18 Thread Greg Smith


Josh Berkus wrote:

I think we can be more specific on that last sentence; is there even any
*theoretical* benefit to settings above 16MB, the size of a WAL segment?
 Certainly there have been no test results to show any.
  


There was the set Marti just reminded about.  The old wording suggested 
big enough to fix a single transaction was big enough, and that let to 
many people being confused and setting this parameter way too low.  
Since it's possible I'm wrong about 16MB being the upper limit, I didn't 
want the wording to specifically rule out people testing that size to 
see what happens.  We believe  there's never any advantage due to the 
forced wal segment switch, but having test results to the contrary 
floating around keeps me from being too aggressive in how the wording 
there goes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2011-01-18 Thread Greg Smith


David Fetter wrote:

I think I haven't communicated clearly what I'm suggesting, which is
that we ship with both an UPSERT and a MERGE, the former being ugly,
crude and simple, and the latter festooned with dire warnings about
isolation levels and locking.
  


I don't know that I completely agree with the idea that the UPSERT 
version should always be crude and the MERGE one necessarily heavy from 
a performance perspective.  However, I do feel you raise a legitimate 
point that once the hard stuff is done, and the locking issues around 
SQL proper MERGE sorted, having an UPSERT/REPLACE synonym that maps to 
it under the hood may be a useful way to provide a better user 
experience.  The way I see this, the right goal is to first provide the 
full spec behavior with good performance, and get all that plumbing 
right.  There's nothing that says we can't provide an easier syntax than 
the admittedly complicated MERGE one to the users though.  (I am tempted 
to make a joke about how you could probaby


So, as for this patch...there's about half a dozen significant open 
issues with the current code, along with a smattering of smaller ones.  
The problems remaining are deep enough that I think it would be 
challenging to work through them for this CF under the best conditions.  
And given that we haven't heard from Boxuan since early December, we're 
definitely understaffed to tackle major revisions.


My hope was that we'd get an updated patch from him before the CF 
deadline.  Even without it, maybe get some more internal help here.  
Given my focus on checkpoint issues, Simon on Sync Rep, and Dimitri 
still on his Extensions push, that's second part isn't going to happen.


I am marking MERGE officially Returned With Feedback for 9.1.  Lots of 
progress made, much better community understanding of the unresolved 
issues now than when we started, but not in shape to go into this 
release.  Since we still have some significant interest here in getting 
this finished, I'll see if I can get put together a better game-plan for 
how to get this actually done for 9.2 in time to discuss at release 
planning time.  The main thing that's become much more obvious to me 
just recently is how the remaining issues left here relate to the true 
serialization work, so worrying about that first is probably the right 
order anyway.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] pgbench to the MAXINT

2011-01-18 Thread Greg Smith


Euler Taveira de Oliveira wrote:
(i) If we want to support and scale factor greater than 21474 we have 
to convert some columns to bigint; it will change the test. From the 
portability point it is a pity but as we have never supported it I'm 
not too worried about it. Why? Because it will use bigint columns only 
if the scale factor is greater than 21474. Is it a problem? I don't 
think so because generally people compare tests with the same scale 
factor.


(ii) From the performance perspective, we need to test if the 
modifications don't impact performance. I don't create another code 
path for 64-bit modifications (it is too ugly) and I'm afraid some 
modifications affect the 32-bit performance. I'm in a position to test 
it though because I don't have a big machine ATM. Greg, could you lead 
these tests?


(iii) I decided to copy scanint8() (called strtoint64 there) from 
backend (Robert suggestion [1]) because Tom pointed out that strtoll() 
has portability issues. I replaced atoi() with strtoint64() but didn't 
do any performance tests.


(i):  Completely agreed.

(ii):  There is no such thing as a big machine that is 32 bits now; 
anything that's 32 is a tiny system here in 2011.  What I can do is 
check for degredation on the only 32-bit system I have left here, my 
laptop.  I'll pick a sensitive test case and take a look.


(iii) This is an important thing to test, particularly given it has the 
potential to impact 64-bit results too.


Thanks for picking this up again and finishing the thing off.  I'll add 
this into my queue of performance tests to run and we can see if this is 
worth applying.  Probably take a little longer than the usual CF review 
time.  But as this doesn't interfere with other code people are working 
on and is sort of a bug fix, I don't think it will be a problem if it 
takes a little longer to get this done.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Greg Smith


Jeff Janes wrote:

Have you ever tested Robert's other idea of having a metronome process
do a periodic fsync on a dummy file which is located on the same ext3fs
as the table files?  I think that that would be interesting to see.
  


To be frank, I really don't care about fixing this behavior on ext3, 
especially in the context of that sort of hack.  That filesystem is not 
the future, it's not possible to ever really make it work right, and 
every minute spent on pandering to its limitations would be better spent 
elsewhere IMHO.  I'm starting with the ext3 benchmarks just to provide 
some proper context for the worst-case behavior people can see right 
now, and to make sure refactoring here doesn't make things worse on it.  
My target is same or slightly better on ext3, much better on XFS and ext4.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Greg Smith


Jim Nasby wrote:

Wow, that's the kind of thing that would be incredibly difficult to figure out, 
especially while your production system is in flames... Can we change ereport 
that happens in that case from DEBUG1 to WARNING? Or provide some other means 
to track it


That's why we already added pg_stat_bgwriter.buffers_backend_fsync to 
track the problem before trying to improve it.  It was driving me crazy 
on a production server not having any visibility into when it happened.  
I haven't seen that we need anything beyond that so far.  In the context 
of this new patch for example, if you get to where a backend does its 
own sync, you'll know it did a compaction as part of that.  The existing 
statistic would tell you enough.


There's now enough data in test set 3 at 
http://www.2ndquadrant.us/pgbench-results/index.htm to start to see how 
this breaks down on a moderately big system (well, by most people's 
standards, but not Jim for whom this is still a toy).  Note the 
backend_sync column on the right, very end of the page; that's the 
relevant counter I'm commenting on:


scale=175:  Some backend fsync with 64 clients, 2/3 runs.
scale=250:  Significant backend fsync with 32 and 64 clients, every run.
scale=500:  Moderate to large backend fsync at any client count =16.  
This seems to be worst spot of those mapped.  Above here, I would guess 
the TPS numbers start slowing enough that the fsync request queue 
activity drops, too.

scale=1000:  Backend fsync starting at 8 clients
scale=2000:  Backend fsync starting at 16 clients.  By here I think the 
TPS volumes are getting low enough that clients are stuck significantly 
more often waiting for seeks rather than fsync.


Looks like the most effective spot for me to focus testing on with this 
server is scales of 500 and 1000, with 16 to 64 clients.  Now that I've 
got the scale fine tuned better, I may crank up the client counts too 
and see what that does.  I'm glad these are appearing in reasonable 
volume here though, was starting to get nervous about only having NDA 
restricted results to work against.  Some days you just have to cough up 
for your own hardware.


I just tagged pgbench-tools-0.6.0 and pushed to 
GitHub/git.postgresql.org with the changes that track and report on 
buffers_backend_fsync if anyone else wants to try this out.  It includes 
those numbers if you have a 9.1 with them, otherwise just reports 0 for 
it all the time; detection of the feature wasn't hard to add.  The end 
portion of a config file for the program (the first part specifies 
host/username info and the like) that would replicate the third test set 
here is:


MAX_WORKERS=4
SCRIPT=tpc-b.sql
SCALES=1 10 100 175 250 500 1000 2000
SETCLIENTS=4 8 16 32 64
SETTIMES=3
RUNTIME=600
TOTTRANS=

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: compact fsync request queue on overflow

2011-01-17 Thread Greg Smith


Chris Browne wrote:

  It was a little troublesome inducing it.  I did so by cutting
  shared memory to minimum (128kB)...
  With higher shared memory, I couldn't readily induce compaction,
  which is probably a concurrency matter of not having enough volume
  of concurrent work going on.
  


Quite.  It's taken me 12 days of machine time running pgbench to find 
the spots where this problem occurs on a system with a reasonably sized 
shared_buffers (I'm testing against 256MB).  It's one of those things 
it's hard to reproduce with test data.


Thanks for the thorough code review.  I've got a clear test plan I'm 
progressing through this week to beat on the performance measurement 
aspects of the patch.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_fsync open_sync test

2011-01-17 Thread Greg Smith


Bruce Momjian wrote:

Is there a value to this test_fsync test?

Compare open_sync with different sizes:
(This is designed to compare the cost of one large
sync'ed write and two smaller sync'ed writes.)
open_sync 16k write   242.563 ops/sec
2 open_sync 8k writes 752.752 ops/sec

It compares the cost of doing larger vs. two smaller open_sync writes.
  


Might be some value for determining things like what the optimal WAL 
block size to use is.  All these tests are kind of hard to use 
effectively still, I'm not sure if it's time to start trimming tests yet 
until we've made more progress on interpreting results first.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Moving test_fsync to /contrib?

2011-01-17 Thread Greg Smith


Alvaro Herrera wrote:

I don't understand why it would be overkill.  Are you saying people
would complain because you installed a 25 kB executable that they might
not want to use?  Just for fun I checked /usr/bin and noticed that I
have a pandoc executable, weighing 17 MB, that I have never used and I
have no idea what is it for.
  


It's for converting between the various types of text-like markup, i.e. 
reST, LaTex, Markdown, etc.


Anyway, just because the rest of the world has no standards anymore 
doesn't mean we shouldn't.  The changes Bruce has made recently have 
gotten this tool to where its output is starting to be readable and 
reliable.  The sort of people who want to run this will certainly be 
fine with installing contrib to do it, because they may want to have 
things like pgbench too.  There's really not enough demand for this to 
pollute the default server install footprint with any overhead from this 
tool, either in bytes or increased tool name squatting.  And the fact 
that it's still a little rough around the edges nudges away from the 
standard server package too.


Install in contrib as pg_test_fsync and I think you'll achieve the 
optimal subset of people who can be made happy here.  Not having it 
packaged at all before wasn't a big deal, because it was so hard to 
collect good data from only developer-level people were doing it 
anyway.  Now that it is starting to be more useful in that role for less 
experienced users, we need to make it easier for more people to run it, 
to collect feedback toward further improving its quality.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-17 Thread Greg Smith


Bruce Momjian wrote:

Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00?
  


The idea of having a dead period doing no work at all between write 
phase and sync phase may have some merit.  I don't have enough test data 
yet on some more fundamental issues in this area to comment on whether 
that smaller optimization would be valuable.  It may be a worthwhile 
concept to throw into the sequencing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to log aborted autovacuums

2011-01-16 Thread Greg Smith


Robert Haas wrote:

On Sat, Jan 15, 2011 at 11:14 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  

Greg Smith g...@2ndquadrant.com writes:


Does try_relation_open need to have a lock acquisition timeout when AV
is calling it?
  

Hmm.  I think when looking at the AV code, I've always subconsciously
assumed that try_relation_open would fail immediately if it couldn't get
the lock.  That certainly seems like it would be a more appropriate way
to behave than delaying indefinitely.



I'm confused how that's not happening already. What does try mean, otherwise?
  


Apparently try means acquire the requested lock on the oid of the 
relation, waiting for any amount of time for that part, and then check 
if the relation really exists or not once it's got it.  In this context, 
it means it will try to open the relation, but might fail if it doesn't 
actually exist anymore.  The relation not existing once it tries the 
check done after the lock is acquired is the only way it will return a 
failure.


try_relation_open calls LockRelationOid, which blocks.  There is also a 
ConditionalLockRelationOid, which does the same basic thing except it 
exits immediately, with a false return code, if it can't acquire the 
lock.  I think we just need to nail down in what existing cases it's 
acceptable to have try_relation_oid use ConditionalLockRelationOid 
instead, which would make it do what all us reading the code thought it 
did all along.


There are four callers of try_relation_open to be concerned about here:

src/backend/commands/vacuum.cvacuum_rel
   onerel = try_relation_open(relid, lmode);

src/backend/commands/analyze.canalyze_rel
   onerel = try_relation_open(relid, ShareUpdateExclusiveLock);

src/backend/commands/cluster.ccluster_rel
   OldHeap = try_relation_open(tableOid, AccessExclusiveLock);

src/backend/commands/lockcmds.c LockTableRecurse
   * Apply LOCK TABLE recursively over an inheritance tree
   rel = try_relation_open(reloid, NoLock);

I think that both the vacuum_rel and analyze_rel cases are capable of 
figuring out if they are the autovacuum process, and if so calling the 
fast non-blocking version of this.  I wouldn't want to mess with the 
other two, which rely upon the current behavior as far as I can see.


Probably took me longer to write this e-mail than the patch will take.  
Since I've got trivial patch fever this weekend and already have the 
test case, I'll do this one next.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

Re: [HACKERS] We need to log aborted autovacuums

2011-01-16 Thread Greg Smith


Simon Riggs wrote:

I'm fairly confused by this thread.
  


That's becuase you think it has something to do with cancellation, which 
it doesn't.  The original report here noted a real problem but got the 
theorized cause wrong.  It turns out the code that acquires a lock when 
autovacuum decides it is going to process something will wait forever 
for that lock to be obtained.  It cannot be the case that other locks on 
the table are causing it to cancel, or as you say it would be visible in 
the logs.  Instead the AV worker will just queue up and wait for its 
turn as long as it takes.


That does mean there's all sorts of ways that your AV workers can all 
get stuck where they are waiting for a table, and there's no way to know 
when that's happening either from the logs; you'll only see it in ps or 
pg_stat_activity.  Given that I think it's actually a mild denial of 
service attack vector, this really needs an improvement.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to log aborted autovacuums

2011-01-16 Thread Greg Smith


Tom Lane wrote:

No, I don't believe we should be messing with the semantics of
try_relation_open.  It is what it is.
  


With only four pretty simple callers to the thing, and two of them 
needing the alternate behavior, it seemed a reasonable place to modify 
to me.  I thought the nowait boolean idea was in enough places that it 
was reasonable to attach to try_relation_open.


Attached patch solves the wait for lock forever problem, and 
introduces a new log message when AV or auto-analyze fail to obtain a 
lock on something that needs to be cleaned up:


DEBUG:  autovacuum: processing database gsmith
INFO:  autovacuum skipping relation 65563 --- cannot open or obtain lock
INFO:  autoanalyze skipping relation 65563 --- cannot open or obtain lock

My main concern is that this may cause AV to constantly fail to get 
access to a busy table, where in the current code it would queue up and 
eventually get the lock needed.  A secondary issue is that while the 
autovacuum messages only show up if you have log_autovacuum_min_duration 
set to not -1, the autoanalyze ones can't be stopped.


If you don't like the way I structured the code, you can certainly do it 
some other way instead.  I thought this approach was really simple and 
not unlike similar code elsewhere.


Here's the test case that worked for me here again:

psql
SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,10);
SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables 
WHERE relname='t';

DELETE FROM t WHERE s5;
\q
psql
BEGIN;
LOCK t;

Leave that open, then go to anther session with old tail -f on the 
logs to wait for the errors to show up.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 25d9fde..4193fff 100644
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*** relation_open(Oid relationId, LOCKMODE l
*** 918,936 
   *
   *		Same as relation_open, except return NULL instead of failing
   *		if the relation does not exist.
   * 
   */
  Relation
! try_relation_open(Oid relationId, LOCKMODE lockmode)
  {
  	Relation	r;
! 
  	Assert(lockmode = NoLock  lockmode  MAX_LOCKMODES);
  
  	/* Get the lock first */
  	if (lockmode != NoLock)
! 		LockRelationOid(relationId, lockmode);
! 
  	/*
  	 * Now that we have the lock, probe to see if the relation really exists
  	 * or not.
--- 918,951 
   *
   *		Same as relation_open, except return NULL instead of failing
   *		if the relation does not exist.
+  *
+  *		If called with nowait enabled, this will immediately exit
+  *		if a lock is requested and it can't be acquired.  The
+  *		return code in this case doesn't distinguish between this
+  *		situation and the one where the relation was locked, but
+  *		doesn't exist.  Callers using nowait must not care that
+  *		they won't be able to tell the difference.
   * 
   */
  Relation
! try_relation_open(Oid relationId, LOCKMODE lockmode, bool nowait)
  {
  	Relation	r;
! 	bool		locked;
  	Assert(lockmode = NoLock  lockmode  MAX_LOCKMODES);
  
  	/* Get the lock first */
  	if (lockmode != NoLock)
! 		{
! 		if (nowait)
! 			{
! 			locked=ConditionalLockRelationOid(relationId, lockmode);
! 			if (!locked)
! return NULL;	
! 			}
! 		else
! 			LockRelationOid(relationId, lockmode);
! 		}
  	/*
  	 * Now that we have the lock, probe to see if the relation really exists
  	 * or not.
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7bc5f11..24bfb16 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*** analyze_rel(Oid relid, VacuumStmt *vacst
*** 147,156 
  	 * concurrent VACUUM, which doesn't matter much at the moment but might
  	 * matter if we ever try to accumulate stats on dead tuples.) If the rel
  	 * has been dropped since we last saw it, we don't need to process it.
  	 */
! 	onerel = try_relation_open(relid, ShareUpdateExclusiveLock);
  	if (!onerel)
! 		return;
  
  	/*
  	 * Check permissions --- this should match vacuum's check!
--- 147,168 
  	 * concurrent VACUUM, which doesn't matter much at the moment but might
  	 * matter if we ever try to accumulate stats on dead tuples.) If the rel
  	 * has been dropped since we last saw it, we don't need to process it.
+ 	 *
+ 	 * If this is a manual analyze, opening will wait forever to acquire
+ 	 * the requested lock on the relation.  Autovacuum will just give up
+ 	 * immediately if it can't get the lock.  This prevents a series of locked
+ 	 * relations from potentially hanging all of the AV workers waiting
+ 	 * for locks.
  	 */
! 	onerel = try_relation_open(relid

Re: [HACKERS] Spread checkpoint sync

2011-01-16 Thread Greg Smith

I have finished a first run of benchmarking the current 9.1 code at 
various sizes.  See http://www.2ndquadrant.us/pgbench-results/index.htm 
for many details.  The interesting stuff is in Test Set 3, near the 
bottom.  That's the first one that includes buffer_backend_fsync data.  
This iall on ext3 so far, but is using a newer 2.6.32 kernel, the one 
from Ubuntu 10.04.


The results are classic Linux in 2010:  latency pauses from checkpoint 
sync will easily leave the system at a dead halt for a minute, with the 
worst one observed this time dropping still for 108 seconds.  That one 
is weird, but these two are completely averge cases:


http://www.2ndquadrant.us/pgbench-results/210/index.html
http://www.2ndquadrant.us/pgbench-results/215/index.html

I think a helpful next step here would be to put Robert's fsync 
compaction patch into here and see if that helps.  There are enough 
backend syncs showing up in the difficult workloads (scale=1000, 
clients =32) that its impact should be obvious.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to log aborted autovacuums

2011-01-15 Thread Greg Smith


Josh Berkus wrote:

The lack of vacuum could be occurring for any of 4 reasons:

1) Locking
2) You have a lot of tables and not enough autovac_workers / too much
sleep time
3) You need to autovac this particular table more frequently, since it
gets dirtied really fast
4) The table has been set with special autovac settings which keep it
from being autovac'd

We can currently distinguish between cased 2, 3, 4 based on existing
available facts.  However, distinguishing case 1 from 2 or 3, in
particular, isn't currently possible except by methods which require
collecting a lot of ad-hoc monitoring data over a period of time.  This
makes the effort required for the diagnosis completely out of proportion
with the magnitude of the problem.
  


Since no one coded up something more exciting, I just did a round of 
self-review of the little logging patch I sent upthread to see if it was 
a worthy CF candidate.  I agree that *something* should be done, and I'd 
rather have a logging-based solution than nothing at all here.


So that patch is, to use a polite UK term I've adopted recently, 
rubbish.  But unless I'm confused (it is late), I think there's a 
problem here that's much worse than what you described--in the case 
where someone has grabbed a really large lock on a table that needs 
cleanup at least.  I started with the following test case:


SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,10);
SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables 
WHERE relname='t';

DELETE FROM t WHERE s5;
\q

Which, so long as the min duration is 0, does mostly what I expected.  I 
followed immediately by starting a new session, doing:


BEGIN;
LOCK t;

Before autovacuum got to the table; I left this session open, stuck 
there idle in a transaction holding a lock.  I threw some extra logging 
in the code path leading up to where we'd both like the lack of lock 
acquisition to be logged, and it shows the following (at DEBUG3):


DEBUG:  t: vac: 4 (threshold 50), anl: 14 (threshold 50)
DEBUG:  autovacuum: t: vac needed
INFO:  vacuum_rel:  processing 16528
INFO:  vacuum_rel:  trying to open relation 16528

All but the first one of those lines are not in the current code, and as 
noted already that one existing line is at DEBUG3.  This was trying to 
simulate the case you were complaining about:  autovacuum has been 
triggered, it knows there is work to be done, and it can't do said work.


It hasn't triggered the error message I tried to add yet through, 
because it's not hitting that spot.  Look where it's actually blocked at:


$ ps -eaf | grep postgres
gsmith5490  4880  0 04:09 ?00:00:00 postgres: autovacuum 
worker process   gsmith waiting  

$ psql -c select procpid,now() - query_start as runtime,current_query 
from pg_stat_activity where current_query like 'autovacuum%'
procpid | runtime |current_query   
-+-+-

   5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE public.t

I haven't just failed to vacuum the table that needs it without any 
entry in the logs saying as much.  I've actually tied down an autovacuum 
worker and kept it from doing anything else until that lock is 
released!  When vacuum_rel inside of vacuum.c does this on a relation it 
wants to acquire a lock on:


   onerel = try_relation_open(relid, lmode);

It just hangs out there forever, if the only issue is not being able to 
get the lock it wants.  The abort code path I tried to insert logging 
into only occurs if the relation was deleted.  Watch what's happened 
while I've been typing:


procpid | runtime |current_query   
-+-+-

   5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE public.t

I've now hit the AV worker with a denial of service attack, essentially, 
for over 35 minutes.  I'd bet that if I was a malicious user, I could 
make AV come to a halt altogether for the whole server with this 
approach, even with relatively limited rights on the database.


Does try_relation_open need to have a lock acquisition timeout when AV 
is calling it?  I don't see any other way to get a log mesage noting 
when this problem passed by in there, you can only see it live in these 
process view.  And the current behavior doesn't feel right to me.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith g...@2ndquadrant.com wrote:
  

One of the ideas Simon and I had been considering at one point was adding
some better de-duplication logic to the fsync absorb code, which I'm
reminded by the pattern here might be helpful independently of other
improvements.



Hopefully I'm not stepping on any toes here, but I thought this was an
awfully good idea and had a chance to take a look at how hard it would
be today while en route from point A to point B.  The answer turned
out to be not very, so PFA a patch that seems to work.  I tested it
by attaching gdb to the background writer while running pgbench, and
it eliminate the backend fsyncs without even breaking a sweat.
  


No toe damage, this is great, I hadn't gotten to coding for this angle 
yet at all.  Suffering from an overload of ideas and (mostly wasted) 
test data, so thanks for exploring this concept and proving it works.


I'm not sure what to do with the rest of the work I've been doing in 
this area here, so I'm tempted to just combine this new bit from you 
with the older patch I submitted, streamline, and see if that makes 
sense.  Expected to be there already, then how about spending 5 minutes 
first checking out that autovacuum lock patch again turned out to be a 
wild underestimate.


Part of the problem is that it's become obvious to me the last month 
that right now is a terrible time to be doing Linux benchmarks that 
impact filesystem sync behavior.  The recent kernel changes that are 
showing in the next rev of the enterprise distributions--like RHEL6 and 
Debian Squeeze both working to get a stable 2.6.32--have made testing a 
nightmare.  I don't want to dump a lot of time into optimizing for 
2.6.32 if this problem changes its form in newer kernels, but the 
distributions built around newer kernels are just not fully baked enough 
yet to tell.  For example, the pre-release Squeeze numbers we're seeing 
are awful so far, but it's not really done yet either.  I expect 3-6 
months from today, that all will have settled down enough that I can 
make some sense of it.  Lately my work with the latest distributions has 
just been burning time installing stuff that doesn't work quite right yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

Idea #2: At the beginning of a checkpoint when we scan all the
buffers, count the number of buffers that need to be synced for each
relation.  Use the same hashtable that we use for tracking pending
fsync requests.  Then, interleave the writes and the fsyncs...

Idea #3: Stick with the idea of a fixed delay between fsyncs, but
compute how many fsyncs you think you're ultimately going to need at
the start of the checkpoint, and back up the target completion time by
3 s per fsync from the get-go, so that the checkpoint still finishes
on schedule.
  


What I've been working on is something halfway between these two ideas.  
I have a patch, and it doesn't work right yet because I just broke it, 
but since I have some faint hope this will all come together any minute 
now I'm going to share it before someone announces a deadline has passed 
or something.  (whistling).  I'm going to add this messy thing and the 
patch you submitted upthread to the CF list; I'll review yours, I'll 
either fix the remaining problem in this one myself or rewrite to one of 
your ideas, and then it's onto a round of benchmarking.


Once upon a time we got a patch from Itagaki Takahiro whose purpose was 
to sort writes before sending them out:


http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

This didn't really work repeatedly for everyone because of the now well 
understood ext3 issues--I never replicated that speedup at the time for 
example.  And this was before the spread checkpoint code was in 8.3.  
The hope was that it wasn't really going to be necessary after that anyway.


Back to today...instead of something complicated, it struck me that if I 
just had a count of exactly how many files were involved in each 
checkpoint, that would be helpful.  I could keep the idea of a fixed 
delay between fsyncs, but just auto-tune that delay amount based on the 
count.  And how do you count the number of unique things in a list?  
Well, you can always sort them.  I thought that if the sorted writes 
patch got back to functional again, it could serve two purposes.  It 
would group all of the writes for a file together, and if you did the 
syncs in the same sorted order they would have the maximum odds of 
discovering the data was already written.  So rather than this possible 
order:


table block
a 1
b 1
c 1
c 2
b 2
a 2
sync a
sync b
sync c

Which has very low odds of the sync on a finishing quickly, we'd get 
this one:


table block
a 1
a 2
b 1
b 2
c 1
c 2
sync a
sync b
sync c

Which sure seems like a reasonable way to improve the odds data has been 
written before the associated sync comes along.


Also, I could just traverse the sorted list with some simple logic to 
count the number of unique files, and then set the delay between fsync 
writes based on it.  In the above, once the list was sorted, easy to 
just see how many times the table name changes on a linear scan of the 
sorted data.  3 files, so if the checkpoint target gives me, say, a 
minute of time to sync them, I can delay 20 seconds between.  Simple 
math, and exactly the sort I used to get reasonable behavior on the busy 
production system this all started on.  There's some unresolved 
trickiness in the segment-driven checkpoint case, but one thing at a time.


So I fixed the bitrot on the old sorted patch, which was fun as it came 
from before the 8.3 changes.  It seemed to work.  I then moved the 
structure it uses to hold the list of buffers to write, the thing that's 
sorted, into shared memory.  It's got a predictable maximum size, 
relying on palloc in the middle of the checkpoint code seems bad, and 
there's some potential gain from not reallocating it every time through.


Somewhere along the way, it started doing this instead of what I wanted:

BadArgument(!(((header-context) != ((void *)0)  
(Node*)((header-context)))-type) == T_AllocSetContext, File: 
mcxt.c, Line: 589)


(that's from initdb, not a good sign)

And it's left me wondering whether this whole idea is a dead end I used 
up my window of time wandering down.


There's good bits in the patch I submitted for the last CF and in the 
patch you wrote earlier this week.  This unfinished patch may be a 
valuable idea to fit in there too once I fix it, or maybe it's 
fundamentally flawed and one of the other ideas you suggested (or I have 
sitting on the potential design list) will work better.  There's a patch 
integration problem that needs to be solved here, but I think almost all 
the individual pieces are available.  I'd hate to see this fail to get 
integrated now just for lack of time, considering the problem is so 
serious when you run into it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dadb49d

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

I'll believe it when I see it.  How about this:

a 1
a 2
sync a
b 1
b 2
sync b
c 1
c 2
sync c

Or maybe some variant, where we become willing to fsync a file a
certain number of seconds after writing the last block, or when all
the writes are done, whichever comes first.


That's going to give worse performance than the current code in some 
cases.  The goal of what's in there now is that you get a sequence like 
this:


a1
b1
a2
[Filesystem writes a1]
b2
[Filesystem writes b1]
sync a [Only has to write a2]
sync b [Only has to write b2]

This idea works until you to get where the filesystem write cache is so 
large that it becomes lazier about writing things.  The fundamental 
idea--push writes out some time before the sync, in hopes the filesystem 
will get to them before that said--it not unsound.  On some systems, 
doing the sync more aggressively than that will be a regression.  This 
approach just breaks down in some cases, and those cases are happening 
more now because their likelihood scales with total RAM.  I don't want 
to screw the people with smaller systems, who may be getting 
considerable benefit from the existing sequence.  Today's little 
systems--which are very similar to the high-end ones the spread 
checkpoint stuff was developed on during 8.3--do get some benefit from 
it as far as I know.


Anyway, now that the ability to get logging on all this stuff went in 
during the last CF, it's way easier to just setup a random system to run 
tests in this area than it used to be.  Whatever testing does happen 
should include, say, a 2GB laptop with a single hard drive in it.  I 
think that's the bottom of what is reasonable to consider a reasonable 
target for tweaking write performance on, given hardware 9.1 is likely 
to be deployed on.



How does the checkpoint target give you any time to sync them?  Unless
you squeeze the writes together more tightly, but that seems sketchy.
  


Obviously the checkpoint target idea needs to be shuffled around some 
too.  I was thinking of making the new default 0.8, and having it split 
the time in half for write and sync.  That will make the write phase 
close to the speed people are seeing now, at the default of 0.5, while 
giving some window for spread sync too.  The exact way to redistribute 
that around I'm not so concerned about yet.  When I get to where that's 
the most uncertain thing left I'll benchmark the TPS vs. latency 
trade-off and see what happens.  If the rest of the code is good enough 
but this just needs to be tweaked, that's a perfect thing to get beta 
feedback to finalize.



Well you don't have to put it in shared memory on account of any of
that.  You can just hang it on a global variable.
  


Hmm.  Because it's so similar to other things being allocated in shared 
memory, I just automatically pushed it over to there.  But you're right; 
it doesn't need to be that complicated.  Nobody is touching it but the 
background writer.



If we can find something that's a modest improvement on the
status quo and we can be confident in quickly, good, but I'd rather
have 9.1 go out the door on time without fully fixing this than delay
the release.
  


I'm not somebody who needs to be convinced of that.  There are two near 
commit quality pieces of this out there now:


1) Keep some BGW cleaning and fsync absorption going while sync is 
happening, rather than starting it and ignoring everything else until 
it's done.


2) Compact fsync requests when the queue fills

If that's all we can get for 9.1, it will still be a major improvement.  
I realize I only have a very short period of time to complete a major 
integration breakthrough on the pieces floating around before the goal 
here has to drop to something less ambitious.  I head to the West Coast 
for a week on the 23rd; I'll be forced to throw in the towel at that 
point if I can't get the better ideas we have in pieces here all 
assembled well by then.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] auto-sizing wal_buffers

2011-01-15 Thread Greg Smith


Fujii Masao wrote:

+intXLOGbuffersMin = 8;

XLOGbuffersMin is a fixed value. I think that defining it as a macro
rather than a variable seems better.

+   if (XLOGbuffers  2048)
+   XLOGbuffers = 2048;

Using XLOG_SEG_SIZE/XLOG_BLCKSZ rather than 2048 seems
better.

+#wal_buffers = -1  # min 32kB, -1 sets based on 
shared_buffers

Typo: s/32kB/64kB
  


Thanks, I've fixed all these issues and attached a new full patch, 
pushed to github, etc.  Tests give same results back, and it's nice that 
it scale to reasonable behavior if someone changes their XLOG segment size.


It should be possible to set the value back to the older minimum value 
of 32kB too.  That's doesn't actually seem to work though; when I try it 
I get:


$ psql -c SELECT name,unit,boot_val,setting,current_setting(name) FROM 
pg_settings WHERE name IN ('wal_buffers','shared_buffers')

 name  | unit | boot_val | setting | current_setting
+--+--+-+-
shared_buffers | 8kB  | 1024 | 131072  | 1GB
wal_buffers| 8kB  | -1   | 8   | 64kB

Where I was expecting that setting to be 4 instead for 32kB.  So 
there's probably some minor bug left in where I inserted this into the 
initialization sequence.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8e2a2c5..1c13f20 100644
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** SET ENABLE_SEQSCAN TO OFF;
*** 1638,1649 
/indexterm
listitem
 para
! The amount of memory used in shared memory for WAL data.  The
! default is 64 kilobytes (literal64kB/).  The setting need only
! be large enough to hold the amount of WAL data generated by one
! typical transaction, since the data is written out to disk at
! every transaction commit.  This parameter can only be set at server
! start.
 /para
  
 para
--- 1638,1659 
/indexterm
listitem
 para
! The amount of shared memory used for storing WAL data.  The
! default setting of -1 adjusts this automatically based on the size
! of varnameshared_buffers/varname, making it 1/32 (about 3%) of
! the size of that normally larger shared memory block.  Automatically
! set values are limited to a maximum sufficient to hold one WAL
! 		segment worth of data, normally 16 megabytes (literal16MB/).
! The smallest allowable setting is 32 kilobytes (literal32kB/).
!/para
! 
!para
! Since the data is written out to disk at every transaction commit,
! the setting many only need to be be large enough to hold the amount
! of WAL data generated by one typical transaction.  Larger values,
! typically at least a few megabytes, can improve write performance
! on a busy server where many clients are committing at once.
! Extremely large settings are unlikely to provide additional benefit.
 /para
  
 para
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b6a230..d057773 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 69,75 
  /* User-settable parameters */
  int			CheckPointSegments = 3;
  int			wal_keep_segments = 0;
! int			XLOGbuffers = 8;
  int			XLogArchiveTimeout = 0;
  bool		XLogArchiveMode = false;
  char	   *XLogArchiveCommand = NULL;
--- 69,76 
  /* User-settable parameters */
  int			CheckPointSegments = 3;
  int			wal_keep_segments = 0;
! int			XLOGbuffers = -1;
! int			XLOGbuffersMin = 8;
  int			XLogArchiveTimeout = 0;
  bool		XLogArchiveMode = false;
  char	   *XLogArchiveCommand = NULL;
*** GetSystemIdentifier(void)
*** 4780,4790 
--- 4781,4819 
  /*
   * Initialization of shared memory for XLOG
   */
+ 
+ void XLOGTuneNumBuffers(void)
+ {
+ 	int one_segment = XLOG_SEG_SIZE / XLOG_BLCKSZ;
+ 
+ 	/*
+ 	 * If automatic setting was requested, use about 3% as much memory as
+ 	 * requested for the buffer cache.  Clamp the automatic maximum to the
+ 	 * size of one XLOG segment, while still allowing a larger manual
+ 	 * setting.  Don't go below the default setting in earlier versions:
+ 	 * twice the size of the minimum, which at default build options is 64kB.
+ 	 */	 
+ 	if (XLOGbuffers == -1)  /* set automatically */
+ 		{
+ 		XLOGbuffers = NBuffers / 32;
+ 		if (XLOGbuffers  one_segment)
+ 			XLOGbuffers = one_segment;
+ 		if (XLOGbuffers  (XLOGbuffersMin * 2))
+ 			XLOGbuffers = XLOGbuffersMin * 2;
+ 		}
+ 
+ 	/* Enforce a 32KB minimum in every case */
+ 	if (XLOGbuffers  XLOGbuffersMin)
+ 		XLOGbuffers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

That seems like a bad idea - don't we routinely recommend that people
crank this up to 0.9?  You'd be effectively bounding the upper range
of this setting to a value to the less than the lowest value we
recommend anyone use today.
  


I was just giving an example of how I might do an initial split.  
There's a checkpoint happening now at time T; we have a rough idea that 
it needs to be finished before some upcoming time T+D.  Currently with 
default parameters this becomes:


Write:  0.5 * D; Sync:  0

Even though Sync obviously doesn't take zero.  The slop here is enough 
that it usually works anyway.


I was suggesting that a quick reshuffling to:

Write:  0.4 * D; Sync:  0.4 * D

Might be a good first candidate for how to split the time up better.  
The fact that this gives less writing time than the current biggest 
spread possible:


Write:  0.9 * D; Sync: 0

Is true.  It's also true that in the case where sync time really is 
zero, this new default would spread writes less than the current 
default.  I think that's optimistic, but it could happen if checkpoints 
are small and you have a good write cache.


Step back from that a second though.  Ultimately, the person who is 
getting checkpoints at a 5 minute interval, and is being nailed by 
spikes, should have the option of just increasing the parameters to make 
that interval bigger.  First you increase the measly default segments to 
a reasonable range, then checkpoint_completion_target is the second one 
you can try.  But from there, you quickly move onto making 
checkpoint_timeout longer.  At some point, there is no option but to 
give up checkpoints every 5 minutes as being practical, and make the 
average interval longer.


Whether or not a refactoring here makes things slightly worse for cases 
closer to the default doesn't bother me too much.  What bothers me is 
the way trying to stretch checkpoints out further fails to deliver as 
well as it should.  I'd be OK with saying to get the exact same spread 
situation as in older versions, you may need to retarget for checkpoints 
every 6 minutes *if* in the process I get a much better sync latency 
distribution in most cases.


Here's an interesting data point from the customer site this all started 
at, one I don't think they'll mind sharing since it helps make the 
situation more clear to the community.  After applying this code to 
spread sync out, in order to get their server back to functional we had 
to move all the parameters involved up to where checkpoints were spaced 
35 minutes apart.  It just wasn't possible to write any faster than that 
without disrupting foreground activity. 

The whole current model where people think of this stuff in terms of 
segments and completion targets is a UI disaster.  The direction I want 
to go in is where users can say make sure checkpoints happen every N 
minutes, and something reasonable happens without additional parameter 
fiddling.  And if the resulting checkpoint I/O spike is too big, they 
just increase the timeout to N+1 or N*2 to spread the checkpoint 
further.  Getting too bogged down thinking in terms of the current, 
really terrible interface is something I'm trying to break myself of.  
Long-term, I want there to be checkpoint_timeout, and all the other 
parameters are gone, replaced by an internal implementation of the best 
practices proven to work even on busy systems.  I don't have as much 
clarity on exactly what that best practice is the way that, say, I just 
suggested exactly how to eliminate wal_buffers as an important thing to 
manually set.  But that's the dream UI:  you shoot for a checkpoint 
interval, and something reasonable happens; if that's too intense, you 
just increase the interval to spread further.  There probably will be 
small performance regression possible vs. the current code with 
parameter combination that happen to work well on it.  Preserving every 
one of those is something that's not as important to me as making the 
tuning interface simple and clear.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

What is the basis for thinking that the sync should get the same
amount of time as the writes?  That seems pretty arbitrary.  Right
now, you're allowing 3 seconds per fsync, which could be a lot more or
a lot less than 40% of the total checkpoint time...


Just that it's where I ended up at when fighting with this for a month 
on the system I've seen the most problems at.  The 3 second number was 
reversed from a computation that said aim for an internal of X minutes; 
we have Y relations on average involved in the checkpoint.  The 
direction my latest patch is strugling to go is computing a reasonable 
time automatically in the same way--count the relations, do a time 
estimate, add enough delay so the sync calls should be spread linearly 
over the given time range.




the checkpoint activity is always going to be spikey if it does
anything at all, so spacing it out *more* isn't obviously useful.
  


One of the components to the write queue is some notion that writes that 
have been waiting longest should eventually be flushed out.  Linux has 
this number called dirty_expire_centiseconds which suggests it enforces 
just that, set to a default of 30 seconds.  This is why some 5-minute 
interval checkpoints with default parameters, effectively spreading the 
checkpoint over 2.5 minutes, can work under the current design.  
Anything you wrote at T+0 to T+2:00 *should* have been written out 
already when you reach T+2:30 and sync.  Unfortunately, when the system 
gets busy, there is this congestion control logic that basically 
throws out any guarantee of writes starting shortly after the expiration 
time.


It turns out that the only thing that really works are the tunables that 
block new writes from happening once the queue is full, but they can't 
be set low enough to work well in earlier kernels when combined with 
lots of RAM.  Using the terminology of 
http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point 
you hit a point where a process generating disk writes will itself 
start writeback.  This is anologous to the PostgreSQL situation where 
backends do their own fsync calls.  The kernel will eventually move to 
where those trying to write new data are instead recruited into being 
additional sources of write flushing.  That's the part you just can't 
make aggressive enough on older kernels; dirty writers can always win.  
Ideally, the system never digs itself into a hole larger than you can 
afford to wait to write out.  It's a transacton speed vs. latency thing 
though, and the older kernels just don't consider the latency side well 
enough.


There is new mechanism in the latest kernels to control this much 
better:  dirty_bytes and dirty_background_bytes are the tunables.  I 
haven't had a chance to test yet.  As mentioned upthread, some of the 
bleding edge kernels that have this feature available in are showing 
such large general performance regressions in our tests, compared to the 
boring old RHEL5 kernel, that whether this feature works or not is 
irrelevant.  I haven't tracked down which new kernel distributions work 
well performance-wise and which don't yet for PostgreSQL.


I'm hoping that when I get there, I'll see results like 
http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages 
, where the ideal setting for dirty_bytes  to keep latency under control 
with BBWC was 15MB.  To put that into perspective, the lowest useful 
setting you can set dirty_ratio to is 5% of RAM.  That's 410MB on my 
measly 8GB desktop, and 3.3GB on the 64GB production server I've been 
trying to tune.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] auto-sizing wal_buffers

2011-01-14 Thread Greg Smith


Tom Lane wrote:

I think we need to keep the override capability until the autotune
algorithm has proven itself in the field for a couple of years.

I agree with Josh that a negative value should be used to select the
autotune method.
  


Agreed on both fronts.  Attached patch does the magic.  Also available 
in branch walbuffers from git://github.com/greg2ndQuadrant/postgres.git


By changing only shared_buffers I get the following quite reasonable 
automatic behavior:


$ psql -c SELECT name,unit,boot_val,setting,current_setting(name) FROM 
pg_settings WHERE name IN ('wal_buffers','shared_buffers')

 name  | unit | boot_val | setting | current_setting
+--+--+-+-
shared_buffers | 8kB  | 1024 | 3072| 24MB
wal_buffers| 8kB  | -1   | 96  | 768kB

shared_buffers | 8kB  | 1024 | 4096| 32MB
wal_buffers| 8kB  | -1   | 128 | 1MB

shared_buffers | 8kB  | 1024 | 16384   | 128MB
wal_buffers| 8kB  | -1   | 512 | 4MB

shared_buffers | 8kB  | 1024 | 131072  | 1GB
wal_buffers| 8kB  | -1   | 2048| 16MB

shared_buffers | 8kB  | 1024 | 262144  | 2GB
wal_buffers| 8kB  | -1   | 2048| 16MB

If you've set it to the auto-tuning behavior, you don't see that setting 
of -1 in the SHOW output; you see the value it's actually been set to.  
The only way to know that was set automatically is to look at boot_val 
as I've shown here.  I consider this what admins would prefer, as the 
easy way to expose the value that was used.  I would understand if 
people considered it a little odd though.  Since you can't change it 
without a postgresql.conf edit and a server start anyway, and it's 
tersely documented in the sample postgresql.conf what -1 does, I don't 
see this being a problem for anyone in the field.


To try and clear up some of the confusion around how the earlier 
documentation suggests larger values of this aren't needed, I added the 
following updated description of how this has been observed to work for 
admins in practice:


! Since the data is written out to disk at every transaction commit,
! the setting many only need to be be large enough to hold the 
amount

! of WAL data generated by one typical transaction.  Larger values,
! typically at least a few megabytes, can improve write performance
! on a busy server where many clients are committing at once.
! Extremely large settings are unlikely to provide additional 
benefit.


And to make this easy as possible to apply if I got this right, here's 
some proposed commit text:


Automatically set wal_buffers to be proportional
to the size of shared_buffers.  Make it 1/32
as large when the auto-tuned behavior, which
is the default and set with a value of -1,
is used.  The previous default of 64kB is still
enforced as a minimum value.  The maximum
automatic value is limited to 16MB.

(Note that this not exactly what I put in my own commit message if you 
grab from my repo, that had a typo)


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8e2a2c5..c3f5632 100644
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** SET ENABLE_SEQSCAN TO OFF;
*** 1638,1649 
/indexterm
listitem
 para
! The amount of memory used in shared memory for WAL data.  The
! default is 64 kilobytes (literal64kB/).  The setting need only
! be large enough to hold the amount of WAL data generated by one
! typical transaction, since the data is written out to disk at
! every transaction commit.  This parameter can only be set at server
! start.
 /para
  
 para
--- 1638,1659 
/indexterm
listitem
 para
! The amount of shared memory used for storing WAL data.  The
! default setting of -1 adjusts this automatically based on the size
! of varnameshared_buffers/varname, making it 1/32 (about 3%) of
! the size of that normally larger shared memory block.  Automatically
! set values are limited to a maximum of 16 megabytes
! (literal16MB/), sufficient to hold one WAL segment worth of data.
! The smallest allowable setting is 64 kilobytes (literal64kB/).
!/para
! 
!para
! Since the data is written out to disk at every transaction commit,
! the setting many only need to be be large enough to hold the amount
! of WAL data generated by one typical transaction.  Larger values,
! typically at least a few megabytes, can improve write performance
! on a busy server where many clients are committing at once.
! Extremely large settings

Re: [HACKERS] auto-sizing wal_buffers

2011-01-14 Thread Greg Smith


Kevin Grittner wrote:

I guess a manual override doesn't bother me too much, but I am a bit dubious of 
its
value, and there is value in keeping the GUC count down...


It's a risk-reward thing really.  The reward for removing it is that a 
few lines of code and a small section of the documentation go away.  
It's not very big.  The risk seems low, but it's not zero.  Let's say 
this goes in, we get to 9.2 or later, and a survey suggests that no one 
has needed to ever set wal_buffers when deploying 9.1.  At that point I 
think everyone would feel much better considering to nuke it 
altogether.  I just looked at the code again when developing the patch, 
and there's really not enough benefit to removing it to worry about 
taking any risk right now.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to log aborted autovacuums

2011-01-07 Thread Greg Smith


Josh Berkus wrote:

It occurs to me that another way of diagnosis would simply be a way to
cause the autovac daemon to spit out output we could camp on, *without*
requiring the huge volumes of output also required for DEBUG3.  This
brings us back to the logging idea again.
  


Right.  I don't know how much you looked at the little patch I threw out 
there yet, but I think it's a reasonable 90% solution to the problem of 
how do I distinguish between a locking block and other reasons for AV 
not running? without generating a huge amount of log data.  Since 
there's no real overhead or code complexity added by it, it's a 
relatively easy thing to throw in the codebase without annoying anyone.



I really don't think that argument applies to either patch;
last_autovac_attempt *or* the last_stats_reset time, since neither event
is expected to occur frequently.  If you have last_autovac_attempt (for
example) being updated frequently, you clearly had a db problem bigger
than the size of the stats table.
  


It just returns to the usual why make everyone pay overhead for 
something that only happens to a small percentage of users? 
argument.[1]  I personally have all sorts of additional data I'd like to 
instrument in myriad obtrusive spots of code.  All I'm saying is that 
this one in particular I don't quite feel strongly enough about to make 
it the spot I try to establish that foothold at.  If a stats tracking 
patch for this appeared that seemed reasonably well written (i.e. it 
tries to keep the added value NULL whenever it's not relevant), I'd be 
in favor of it going in.  I'm just not so excited about the idea that 
I'm going to write it.  I'm lukewarm to per-table last_stats_reset time 
just because there's so few places I'd actually use it in practice, 
relative to the guaranteed overhead it adds.


[1] Silly aside:  I was thinking today that I should draw a chart of all 
the common objections to code that show up here, looking like those maps 
you see when walking into a mall.  Then we can give a copy to new 
submitters with a big you are here X marking where they have 
inadvertently wandered onto.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to log aborted autovacuums

2011-01-06 Thread Greg Smith


Robert Treat wrote:

This is a great use case for user level tracing support. Add a probe
around these bits, and you can capture the information when you need
it.
  


Sure.  I would also like a pony.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to log aborted autovacuums

2011-01-06 Thread Greg Smith


Josh Berkus wrote:

Or should it perhaps be a per-table counter in pg_stat_user_tables,
given your statement above?



Or even a timestamp: last_autovacuum_attempt, which would record the
last time autovacuum was tried.  If that's fairly recent and you have a
large number of dead rows, you know what kind of problem you have and
can turn on debug.
  


These are both reasonable ideas.  But there was just some kickback on 
Tomas's keeping timestamp of the lasts stats reset patch recently, 
from the perspective of trying to limit per-table stats bloat.  I think 
it's relatively easy to make a case that this situation is difficult 
enough to diagnose that a little bit of extra low-level logging is 
worthwhile.  That Josh and I have both been bit by it enough to be 
thinking about patches to make it easier to diagnost suggests it's 
obviously too hard to nail down.  But is this so common and difficult to 
recognize that it's worth making all the table stats bigger?  That's a 
harder call. 

It's already possible to detect the main symptom--dead row percentage is 
much higher than the autovacuum threshold, but there's been no recent 
autovacuum.  That makes me less enthusiastic that there's such a genuine 
need to justify the overhead of storing more table stats just to detect 
the same thing a little more easily.  I've been playing with the Munin 
PG plug-in more recently, and I was just thinking of adding a dead row 
trend graph/threshold to it to address this general area instead.


We could argue both sides of the trade-off of tracking this directly in 
stats for some time, and I'd never expect there to be a clear victory 
for either perspective.  I've run into this vacuum problem a few times, 
but certainly less than I've run into why is the stats table so huge?


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2011-01-04 Thread Greg Smith


Heikki Linnakangas wrote:

You can of course LOCK TABLE as a work-around, if that's what you want.


What I was trying to suggest upthread is that while there are other 
possible ways around this problem, the only one that has any hope of 
shipping with 9.1 is to do just that.  So from my perspective, the rest 
of the discussion about the right way to proceed is moot for now.


For some reason it didn't hit me until you said this that I could do the 
locking manually in my test case, without even touching the server-side 
code yet.  Attached are a new pair of scripts where each pgbench UPDATE 
statement executes an explicit LOCK TABLE.  Here's the result of a 
sample run here:


$ pgbench -f update-merge.sql -T 60 -c 16 -j 4 -s 2 pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 2
query mode: simple
number of clients: 16
number of threads: 4
duration: 60 s
number of transactions actually processed: 84375
tps = 1405.953672 (including connections establishing)
tps = 1406.137456 (excluding connections establishing)
$ psql -c 'select count(*) as updated FROM pgbench_accounts WHERE NOT 
abalance=0' -d pgbench

updated
-
  68897
(1 row)

$ psql -c 'select count(*) as inserted FROM pgbench_accounts WHERE aid  
10' -d pgbench

inserted
--
   34497
(1 row)

No assertion crashes, no duplicate key failures.  All the weird stuff I 
was running into is gone, so decent evidence the worst of the problems 
were all because the heavy lock I expecting just wasn't integrated into 
the patch.  Congratulations to Boxuan:  for the first time this is 
starting to act like a viable feature addition to me, just one with a 
moderately long list of limitations and performance issues.


1400 TPS worth of UPSERT on my modest 8-core desktop (single drive with 
cheating fsync) isn't uselessly slow.  If I add SET TRANSACTION 
ISOLATION LEVEL SERIALIZABLE; just after the BEGIN;, I don't see any 
serialization errors, and performance is exactly the same.


Run a straight UPDATE over only the existing range of keys, and I get 
7000 TPS instead.  So the locking etc. is reducing performance to 20% of 
its normal rate, on this assertion+debug build.  I can run this tomorrow 
(err, later today I guess looking at the time) on a proper system with 
BBWC and without asseritions to see if the magnitude of the difference 
changes, but I don't think that's the main issue here.


Presuming the code quality issues and other little quirks I've 
documented (and new ones yet to be discovered) can get resolved here, 
and that's a sizeable open question, I could see shipping this with the 
automatic heavy LOCK TABLE in there.  Then simple UPSERT could work out 
of the box via a straightforward MERGE.  We'd need a big warning 
disclaiming that concurrent performance is very limited in this first 
release of the feature, but I don't know that this is at the 
unacceptable level of slow for smaller web apps and such.


Until proper fine-grained concurrency is implemented, I think it would 
be PR suicide to release a version of this without a full table lock 
happening automatically though.  The idea Robert advocated well, that it 
would be possible for advanced users to use even this rough feature in a 
smarter way to avoid conflicts and not suffer the full performance 
penalty, is true.  But if you consider the main purpose here to be 
making it easier to get smaller MySQL apps and the like ported to 
PostgreSQL (which is what I see as goal #1), putting that burden on the 
user is just going to reinforce the old PostgreSQL is so much harder 
than MySQL stereotype.  I'd much prefer to see everyone have a slow but 
simple to use UPSERT via MERGE available initially, rather than to worry 
about optimizing for the advanced user in a way that makes life harder 
for the newbies.  The sort of people who must have optimal performance 
already have trigger functions available to them, that they can write 
and tweak for best performance.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books



test-merge.sh
Description: Bourne shell script
\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 10 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
BEGIN;

-- Optional mode change
-- SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

LOCK TABLE pgbench_accounts;
MERGE INTO pgbench_accounts t USING (SELECT :aid,1+(:aid / 100)::integer,:delta,'') AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler);
COMMIT;

-- This syntax worked with MERGE v203 patch, but isn't compatible with v204
--MERGE INTO pgbench_accounts t USING (VALUES (:aid,1+(:aid / 100)::integer,:delta,'')) AS s

Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2011-01-04 Thread Greg Smith


Robert Haas wrote:

And even if it isn't, the MERGE syntax is insane if what you really
want to do is insert or update ONE record.  If all we have is MERGE,
people will keep doing it with a PL/pgsql stored procedure or some
crummy application logic just so that they don't have to spend several
days trying to understand the syntax.  Heck, I understand the syntax
(or I think I do) and I still think it's more trouble than its worth


I hoped that the manual would have a clear example of this is how you 
do UPSERT with MERGE, preferrably cross-linked to the existing Example 
39-2. Exceptions with UPDATE/INSERT trigger implementation that's been 
the reference implementation for this for a long time, so people can see 
both alternatives.  New users will cut and paste that example into their 
code, and in the beginning neither know nor care how MERGE actually 
works, so long as the example does what it claims.  I would wager the 
majority of PL/pgsql implementations of this requirement start the exact 
same way.  I don't think the learning curve there is really smaller, 
it's just that you've just already been through it.


I've been purposefully ignoring the larger applications of MERGE in the 
interest of keeping focus on a managable subset.  But the more general 
feature set is in fact enormously useful for some types of data 
warehouse applications.  Build REPLACE, and you built REPLACE.  Build 
MERGE that is REPLACE now and eventually full high-performance MERGE, 
and you've done something with a much brighter future.  I don't think 
the concurrency hurdles here are unique to this feature either, as shown 
by the regular overlap noted with the other serialization work. 


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2011-01-04 Thread Greg Smith


t...@fuzzy.cz wrote:

- I really am not sure about the changes made in pg_proc.h. I'm not sure
how to assign OIDs to the new functions (I've simply chosen values that
are were not used in this file), and I'm not sure about the other columns
(I've copied and modified another function with the same parameter/return
types)
  


The description of the columns is at the beginning of pg_proc.h, the 
part that begins with CATALOG(pg_proc,1255)...  The descriptions of some 
of the first 11 fields are mostly straighforward.  The first fun part is 
that how may times the information expected in the second VARIABLE 
LENGTH FIELDS section repeats varies based on the parameters listed.  
The other thing that's usually confusing is that the types for the 
values are all expressed as type OID numbers.  For example, if you see 
25, that's the OID of the text type.  You can see the whole list with:


select oid,typname from pg_type;

And if you go back to the file with that list in handle, a lot more of 
it should make sense.  If you see a multiple parameter type list like 
23 21, that's a function whose input values are of types (int4,int2),


As for getting a new OID, if you go into src/include/catalog/ and run 
the unused_oids script, it will give you some help in figuring which 
have been used and which not.  It's not worth getting too stressed about 
the number you choose in the patch submission, because commits between 
when you got your free number and when your patch is considered for 
commit can make your choice worthless anyway.  There's a process 
referred to as catversion bump, where the database catalog version get 
updated to reflect things like new pg_proc information, that committers 
take care of as one of the last adjustments before final commit.  Doing 
a final correction to the OID choice is a part every committer knows to 
look at.


I wrote a talk that covered some additional trivia in this area, as well 
as other things people tend to get confused about in the source code, 
that you can find at 
http://www.pgcon.org/2010/schedule/attachments/142_HackingWithUDFs.pdf ; 
that might be helpful for some other things you might wonder about 
eventually.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)

2011-01-04 Thread Greg Smith


Tomas Vondra wrote:

OK, so here goes the simplified patch - it tracks one reset timestamp
for a background writer and for each database.
  


Adding timestamps like this was something I wanted to do after adding 
pg_stat_reset_shared to 9.0, so since you've beaten me to it I'll review 
your patch and make sure it all works the way I was hoping instead.  The 
whole per-table idea was never going to fly given how few people touch 
this area at all, but the way you're tracking things now seems reasonable.


When you post an updated version of a patch that's already being tracked 
on the CommitFest app, please try to remember to add that update to the 
tracker there.  I just did that for this 12/23 update so that's covered 
already.


Next problem is that the preferred method for submitted patches uses 
context diffs.  See http://wiki.postgresql.org/wiki/Working_with_Git for 
some information about the somewhat annoying way you have to setup git 
to generate those.  Don't worry about that for this round though.  I 
don't care about the diff formatting given the code involved, but it's 
something you should sort out if you do another update.



PS: I've noticed Magnus posted a patch to track recovery conflicts,
adding a new view pg_stat_database_conflicts. I have not checked it yet
but it should not influence this patch.
  


I need to do some testing of that anyway, so I'll take a look at any 
potential clash as part of my review.  I want to check how this 
interacts with track_functions resets too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump --split patch

2011-01-04 Thread Greg Smith


Joel Jacobson wrote:
To understand a change to my database functions, I would start by 
looking at the top-level, only focusing on the names of the functions 
modified/added/removed.
At this stage, you want as little information as possible about each 
change, such as only the names of the functions.
To do this, get a list of changes functions, you cannot compare two 
full schema plain text dumps using diff, as it would only reveal the 
lines changed, not the name of the functions, unless you are lucky to 
get the name of the function within the (by default) 3 lines of copied 
context.


While you could increase the number of copied lines of context to a 
value which would ensure you would see the name of the function in the 
diff, that is not feasible if you want to quickly get a picture of 
the code areas modified, since you would then need to read through 
even more lines of diff output.


I can agree on some use cases you've outlined, where there's merit to 
the general idea of your patch.  But as an aside, you really should 
launch an investigation into some better diff tools if this is how 
you're doing this type of work.  Last week I reviewed 3K lines worth of 
changes from two versions of a 12K line schema dump I'd never seen 
before in a couple of hours using kdiff3.  I'd have killed myself before 
finishing if I had to do the same job with traditional diff as you're 
describing it here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep Design

2011-01-04 Thread Greg Smith


Simon Riggs wrote:

Based upon that series of conversations, I've reworked the design so
that there is (currently) only a single standby offering sync rep at any
one time. Other standbys can request to be sync standbys but they only
become the sync standby if the first one fails. Which was simple to do
and bridges the challenges of an exactly identified sync standby and the
fragility of too closely specifying the config.
  


That seems like a good enough starting point to cover a lot of cases.  
Presuming the two servers each at two sites config that shows up in a 
lot of these discussions, people in the I need sync to a remote spot 
can get that, and if that site is unavailable for long enough to be 
kicked out they'll smoothly degrade to sync on a secondary local copy.  
And those who want high performance local sync and best-effort for the 
remote site can configure for that too, and if the local secondary dies 
they'll just degrade to the slow remote commits.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2011-01-04 Thread Greg Smith


Kevin Grittner wrote:

Greg Smith  wrote:
 
  

I could see shipping this with the automatic heavy LOCK TABLE in
there.

 
How would you handle or document behavior in REPEATABLE READ

isolation?  The lock doesn't do much good unless you acquire it
before you get your snapshot, right?
  


Hand-wave and hope you offer a suggested implementation?  I haven't 
gotten to thinking about this part just yet--am still assimilating 
toward a next move after the pleasant surprise that this is actually 
working to some degree now.  You're right that turning the high-level 
idea of just lock the table actually has to be mapped into exact 
snapshot mechanics and pitfalls before moving in that direction will get 
very far.  I'm probably not the right person to answer just exactly how 
feasibile that is this week though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2011-01-04 Thread Greg Smith


David Fetter wrote:

How about implementing an UPSERT command as take the lock, do the
merge?  That way, we'd have both the simplicity for the simpler cases
and a way to relax consistency guarantees for those who would like to
do so.
  


Main argument against is that path leads to a permanent non-standard 
wart to support forever, just to work around what should be a short-term 
problem.  And I'm not sure whether reducing the goals to only this 
actually improves the ability to ship something in the near term too 
much.  Many of the hard problems people are bothered by don't go away, 
it just makes deciding which side of the speed/complexity trade-off 
you're more interested in becomes more obvious.  What I've been 
advocating is making that decision go away altogether by only worrying 
about the simple to use and slow path for now, but that's a highly 
debatable viewpoint I appreciate the resistence to, if it's possible to 
do at all.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to log aborted autovacuums

2011-01-04 Thread Greg Smith


Josh Berkus wrote:

I've been trying to diagnose in a production database why certain tables
never get autovacuumed despite having a substantial % of updates.  The
obvious reason is locks blocking autovacuum from vacuuming the table ...
  


Missed this dicussion when it popped up but have plenty to say about it 
now.  What I do here is look for such anomolies using 
pg_stat_user_tables, that the dead rows number has exceeded the vacuum 
threshold.  That comparison is logged in the code at DEBUG3:


   elog(DEBUG3, %s: vac: %.0f (threshold %.0f), anl: %.0f 
(threshold %.0f),

NameStr(classForm-relname),
vactuples, vacthresh, anltuples, anlthresh);

But a rough computation isn't too hard to derive in a report, so long as 
you haven't customized per-table parameters.  As you suggested in your 
message, the really bad cases here stick out a whole lot.  If you pay 
the slightest amount of attention to the dead row percentages they jump 
right out at you.  This all works easily on any version back to 8.3.  
Not having as much relevant data stored in pg_stat_user_tables makes the 
problem cases less obvious to spot in older versions.


If I start seeing these badly maintained tables and suspect locking is 
getting in the way, I then dump traces from pg_locks+pg_stat_activity 
often enough that I can estimate how often someone has an interfering 
lock and what they're doing.


Should the log level on this message go up from DEBUG3?  I could see 
rewriting it so that it logs at DEBUG1 instead when 
Log_autovacuum_min_duration is set *and* when the trigger threshold is 
crossed, and at DEBUG3 the rest of the time.  Given you can derive this 
with a bit of work in userland, I don't see this even being justified as 
an INFO or LOG level message.  Anytime I can script a SQL-level monitor 
for something that's easy to tie into Nagios or something, I greatly 
prefer that to log file scraping for it anyway.



What I'd like to do is add some logging code to autovacuum.c so that if
log_autovacuum is any value other than -1, failure to vacuum due to
locks gets logged.   Does this make sense?
  


The general idea is interesting and probably more productive for the 
situation you theorize is happening then messing with the logging 
discussed above.  But that's not where the code needs to go--the lock 
isn't opened until much further down the function call stack.  Attached 
quickie and only tested for compilation patch probably does what you 
want here.  Since this would eliminate the messy follow-up step I 
sometimes have gone through, dumping pg_locks data to confirm or rule 
out locking issues messing with AV processing, I can see some potential 
that it may have simplified situations I've ran into in the past.  And 
it's not out of line with the logging verbosity of similar failure mode 
tests that follow it.  Right now failure to acquire a lock is just not 
considered a log-worthy issue, and I agree that it's worth considering 
whether it should be. 

If you could gather more info on whether this logging catches the 
problem cases you're seeing, that would really be the right test for the 
patch's usefulness.  I'd give you solid 50/50 odds that you've correctly 
diagnosed the issue, and knowing for sure would make advocating for this 
logging a pretty easy sell to me at least.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2f68df4..2124e25 100644
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
*** vacuum_rel(Oid relid, VacuumStmt *vacstm
*** 851,856 
--- 851,864 
  	{
  		PopActiveSnapshot();
  		CommitTransactionCommand();
+ 
+ 		if (IsAutoVacuumWorkerProcess()  Log_autovacuum_min_duration = 0)
+ 		{
+ 		ereport(INFO,
+ 			(errmsg(skipping \%s\ --- cannot open or obtain lock,
+ 	RelationGetRelationName(onerel;
+ 		}
+ 
  		return;
  	}
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] management of large patches

2011-01-03 Thread Greg Smith


Robert Haas wrote:

- MERGE
- checkpoint improvements
  


As far as these two go, the state of MERGE is still rougher than I would 
like.  The code itself isn't too hard to read, and that the errors that 
are popping up tend to be caught by assertions (rather than just being 
mysterious crashes) makes me feel a little better that there's some 
defensive coding in there.  It's still a 3648 line patch that touches 
grammar, planner, and executor bits though, and I've been doing mainly 
functional and coding style review so far.  I'm afraid here's not too 
many committers in a good position to actually consume the whole scope 
of this thing for a commit level review.  And the way larger patches 
tend to work here, I'd be surprised to find it passes through such a 
review without some as yet unidentified major beef appearing.  Will see 
what we can do to help move this forward more before the CF start.


The checkpoint changes I'm reworking are not really large from a code 
complexity or size perspective--I estimate around 350 lines of diff, 
with the rough version I submitted to CF2010-11 at 258.  I suspect it 
will actually be the least complicated patch to consume from that list, 
from a committer perspective.  The complexity there is mainly in the 
performance testing.  I've been gearing up infrastructure the last 
couple weeks to automate and easily publish all the results I collect 
there.  The main part that hasn't gone through any serious testing yet, 
auto-tuning the spread interval, will also be really easy to revert if a 
problem is found there.  With Simon and I both reviewing each others 
work on this already, I hope we can keep this one from clogging the 
committer critical path you're worried about here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery conflict monitoring

2011-01-03 Thread Greg Smith


Couple of doc suggestions:

--- doc/src/sgml/high-availability.sgml
+ The number of query cancels and the reason for them can be viewed 
using

+ the structnamepg_stat_database_conflicts/ system view on the slave
+ server.

For compleness sake, this should also mention the per-database summary, 
even though I'm not sure how valuable that view is.  Also, on a standby 
server instead of on the slave server here.  slave is mentioned 
once as a synonym in high-availability.sgml once, but that's it, and 
there can be more than one standby you want to pull these stats from.


*** doc/src/sgml/monitoring.sgml
!   number of rows returned, fetched, inserted, updated and deleted, and
!   total number of queries cancelled due to conflict with recovery.

This would be clearer if it said you're talking about standby recovery 
here, and possibly that this info is only available on the standby.  I 
could see someone reading this and thinking it's possible for general 
database crash recovery to produce cancelled queries, instead of the way 
connections are actually blocked until that's done.


!   entrystructnamepg_stat_database_conflicts/
!   entryOne row per database, showing database OID, database name and
!   the number of queries that have been cancelled in this database 
due to
!   dropped tablespaces, lock timeouts, old snapshots, pinned 
buffers and

!   deadlocks.

A clarification that you're talking about standby query cancellation 
here might be helpful too.  I don't think that's necessary for all of 
the detailed pg_stat_get_* functions that regular users are less likely 
to care about, just these higher level ones.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2011-01-02 Thread Greg Smith


Marko Tiikkaja wrote:
I'm confused.  Are you saying that the patch is supposed to lock the 
table against concurrent INSERT/UPDATE/DELETE/MERGE?  Because I don't 
see it in the patch, and the symptoms you're having are a clear 
indication of the fact that it's not happening.  I also seem to recall 
that people thought locking the table would be excessive.


That's exactly what it should be doing.  I thought I'd seen just that in 
one of the versions of this patch, but maybe that's a mistaken memory on 
my part.  In advance of the planned but not available yet ability to 
lock individual index key values, locking the whole table is the only 
possible implementation that can work correctly here I'm aware of.  In 
earlier versions, I think this code was running into issues before it 
even got to there.  If you're right that things like the duplicate key 
error in the current version are caused exclusively by not locking 
enough, that may be the next necessary step here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: new patch of MERGE (merge_204) a question about duplicated ctid

2010-12-29 Thread Greg Smith


Erik Rijkers wrote:

I get some whitespace-warnings, followed by error:

$  git apply 
/home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch
/home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:481:
 trailing
whitespace.

/home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:482:
 trailing
whitespace.
if (IsA(plan, ModifyTable) 
/home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:550:
 trailing
whitespace.
/*print the action qual*/
/home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:556:
 trailing
whitespace.
(act_plan-operation == CMD_INSERT ||
/home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:560:
 trailing
whitespace.

error: patch failed: src/backend/optimizer/plan/planner.c:739
error: src/backend/optimizer/plan/planner.c: patch does not appl


Maybe I'm doing something wrong, but I've never had good luck with git 
apply.  I took this patch and applied it the 12/15 copy of HEAD I had 
checked out (trying to minimize drift in there since the patch was 
created) using:


patch -p 1  merge_204_2010DEC06.patch

There was one trivial conflict it produced 
src/backend/optimizer/plan/planner.c.rej for, and that fix was 
straightforward to apply by hand.


The result is now sitting as the merge204 branch in my github repo: 
https://github.com/greg2ndQuadrant/postgres/tree/merge204 if you did 
want to try this out.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2010-12-29 Thread Greg Smith

I did some basic testing of the latest update here, but quickly hit a 
problem that wasn't in the previous version.  Attached is the standalone 
test script that used to work, but now fails like this:


psql:simple.sql:12: ERROR:  the vars in merge action tlist of qual 
should only belongs to the source table or target table


This test case is intended to implement the common UPSERT situation that 
is one of the main requests that MERGE is intended to satisfy, using 
this syntax:


MERGE INTO Stock t
USING (VALUES(10,100)) AS s(item_id,balance)
ON s.item_id=t.item_id
WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance
WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance)
;

If you can suggest an alternate way to express this that works with the 
new patch, I might switch to that and retry.  I was never 100% sure this 
was the right way to write this, and I don't have another database with 
MERGE support here to try against.  (Aside:  if someone else does, I'd 
be really curious to see if the attached test case works or not on 
another database system.  I think we need to include compatibility 
testing with other MERGE implementations into the test mix here soon.)


Regardless, this failure suggests that you need to add this sort of test 
to the regression test set.  We need to have an example of an UPSERT 
using constant data in there to make sure this continues to work in the 
future.


This is a good week for me in terms of having time for PostgreSQL 
hacking, so if you can suggest something here or update the patch I'll 
try it soon afterwards.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us


DROP TABLE Stock;
CREATE TABLE Stock(item_id int UNIQUE, balance int);
INSERT INTO Stock VALUES (10, 2200);
INSERT INTO Stock VALUES (20, 1900);
SELECT * FROM Stock ORDER BY item_id; 

MERGE INTO Stock t
 USING (VALUES(10,100)) AS s(item_id,balance)
 ON s.item_id=t.item_id
 WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance
 WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance)
 ; 

SELECT * FROM Stock ORDER BY item_id; 

MERGE INTO Stock t
 USING (VALUES(30,2000)) AS s(item_id,balance)
 ON s.item_id=t.item_id
 WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance
 WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance)
 ; 

SELECT * FROM Stock ORDER BY item_id; 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2010-12-29 Thread Greg Smith


Marko Tiikkaja wrote:
As far as I can tell, this should work.  I played around with the 
patch and the problem seems to be the VALUES:


INTO Stock t
 USING (SELECT 30, 2000) AS s(item_id,balance)
 ON s.item_id=t.item_id
 WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance
 WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance)
 ;
MERGE 1


Good catch...while I think the VALUES syntax should work, that is a 
useful workaround so I could keep testing.  I rewrote like this 
(original syntax commented out):


MERGE INTO Stock t
-- USING (VALUES(10,100)) AS s(item_id,balance)
USING (SELECT 10,100) AS s(item_id,balance)
ON s.item_id=t.item_id
WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance
WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance)
;

And that got me back again to concurrent testing.

Moving onto next two problems...the basic MERGE feature seems to have 
stepped backwards a bit too.  I'm now seeing these quite often:


ERROR:  duplicate key value violates unique constraint 
pgbench_accounts_pkey

DETAIL:  Key (aid)=(176641) already exists.
STATEMENT:  MERGE INTO pgbench_accounts t USING (SELECT 176641,1+(176641 
/ 100)::integer,168,'') AS s(aid,bid,balance,filler) ON s.aid=t.aid 
WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT 
MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler);


On my concurrent pgbench test, which had been working before.  Possibly 
causing that, the following assertion is tripping:


TRAP: FailedAssertion(!(epqstate-origslot != ((void *)0)), File: 
execMain.c, Line: 1762)


That's coming from the following code:

void
EvalPlanQualFetchRowMarks(EPQState *epqstate)
{
   ListCell   *l;

   Assert(epqstate-origslot != NULL);

   foreach(l, epqstate-rowMarks)


Stepping back to summarize...here's a list of issues I know about with 
the current v204 code:


1) VALUE syntax doesn't work anymore
2) Assertion failure in EvalPlanQualFetchRowMarks
3) Duplicate key bug (possibly a direct result of #3)
4) Attempts to use MERGE in a fuction spit back ERROR:  table is not 
a known fuction
5) The ctid junk attr handling needs to be reviewed more carefully, 
based on author request.


I've attached the current revisions of all my testing code in hopes that 
Boxuan might try and replicate these (this makes it simple to replicate 
#1 through #3), and therefore confirm whether changes made do better.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

DROP TABLE Stock;
CREATE TABLE Stock(item_id int UNIQUE, balance int);
INSERT INTO Stock VALUES (10, 2200);
INSERT INTO Stock VALUES (20, 1900);
SELECT * FROM Stock ORDER BY item_id; 

MERGE INTO Stock t
-- USING (VALUES(10,100)) AS s(item_id,balance)
 USING (SELECT 10,100) AS s(item_id,balance)
 ON s.item_id=t.item_id
 WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance
 WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance)
 ; 

SELECT * FROM Stock ORDER BY item_id; 

MERGE INTO Stock t
-- USING (VALUES(30,2000)) AS s(item_id,balance)
 USING (SELECT 30,2000) AS s(item_id,balance)
 ON s.item_id=t.item_id
 WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance
 WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance)
 ; 

SELECT * FROM Stock ORDER BY item_id; 
\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 10 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
MERGE INTO pgbench_accounts t USING (SELECT :aid,1+(:aid / 100)::integer,:delta,'') AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler);

-- This syntax worked with MERGE v203 patch, but isn't compatible with v204
--MERGE INTO pgbench_accounts t USING (VALUES (:aid,1+(:aid / 100)::integer,:delta,'')) AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler);


test-merge.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2010-12-29 Thread Greg Smith


Marko Tiikkaja wrote:
I have no idea why it worked in the past, but the patch was never 
designed to work for UPSERT.  This has been discussed in the past and 
some people thought that that's not a huge deal.


It takes an excessively large lock when doing UPSERT, which means its 
performance under a heavy concurrent load can't be good.  The idea is 
that if the syntax and general implementation issues can get sorted out, 
fixing the locking can be a separate performance improvement to be 
implemented later.  Using MERGE for UPSERT is the #1 use case for this 
feature by a gigantic margin.  If that doesn't do what's expected, the 
whole implementation doesn't provide the community anything really worth 
talking about.  That's why I keep hammering on this particular area in 
all my testing.


One of the reflexive I can't switch to PostgreSQL easily stopping 
points for MySQL users is I can't convert my ON DUPLICATE KEY UPDATE 
code.  Every other use for MERGE is a helpful side-effect of adding the 
implementation in my mind, but not the primary driver of why this is 
important.  My hints in this direction before didn't get adopted, so I'm 
saying it outright now:  this patch must have an UPSERT implementation 
in its regression tests.  And the first thing I'm going to do every time 
a new rev comes in is try and break it with the pgbench test I 
attached.  If Boxuan can start doing that as part of his own testing, I 
think development here might start moving forward faster.  I don't care 
so much about the rate at which concurrent UPSERT-style MERGE happens, 
so long as it doesn't crash.  But that's where this has been stuck at 
for a while now.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unnecessary limit on max_standby_streaming_delay

2010-12-17 Thread Greg Smith


Magnus Hagander wrote:

The limit on max_standby_streaming_delay is currently 35 minutes
(around) - or you have to set it to unlimited. This is because the GUC
is limited to MAX_INT/1000, unit milliseconds.

Is there a reason for the /1000, or is it just an oversight thinking
the unit was in seconds?

If we can get rid of the /1000, it would make the limit over three
weeks, which seems much more reasonable. Or if we made it a 64-bit
counter it would go away completely.
  


The code for this uses TimestampTzPlusMilliseconds to determine its 
deadline:


 return TimestampTzPlusMilliseconds(rtime, max_standby_streaming_delay);

Which is implemented like this:

 #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000))

My guess is that the range was limited at some point to avoid concerns 
of integer overflow in that multiplication, which I don't think actually 
is a risk due the int64 cast there. 

I confirmed the upper limit on this thing is the 36 minutes that Magnus 
suggests.  This is a terrible limitation to have slipped through. For 
the pg_dump from the standby use case, the appropriate setting for most 
people is in the multiple hours range, but not unlimited (lest an out of 
control backup linger to overlap with what happens the next day).  
Whichever approach is taken to make this better, I think it deserves a 
backport too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] directory archive format for pg_dump

2010-12-16 Thread Greg Smith

Moving onto the directory archive part of this patch, the feature seems 
to work as advertised; here's a quick test case:


createdb pgbench
pgbench -i -s 1 pgbench
pg_dump -F d -f test
pg_restore -k test
pg_restore -l test
createdb copy
pg_restore -d copy test

The copy made that way looked good.  There's a good chunk of code in the 
patch that revolves around BLOB support.  We need to get someone who is 
more familiar with those than me to suggest some tests for that part 
before this gets committed.  If you could suggest how to test that code, 
that would be helpful.


There's a number of small things that I'd like to see improved in new 
rev of this code


pg_dump:  help message for --file needs to mention that this is 
overloaded to also specify the output directory too.


pg_dump:  the documentation for --file should say the directory is 
created, and must not exist when you start.  The code catches this well, 
but that expectation is not clear until you try it.


pg_restore:  the help message check the directory archive would be 
clearer as check an archive in directory format.


There are some tab vs. space whitespace inconsistencies in the 
documentation added.


The comments at the beginning of functions could be more consistent.  
Early parts of the code have a header for each function that's 
extensive.  Maybe even a bit more than needed.  I'm not sure why it's 
important to document here which of these functions is 
optional/mandatory for example, and getting rid of just those would trim 
a decent number of lines out of the patch.  But then at the end, all of 
the new functions added aren't documented at all.  Some of those are 
near trivial, but it would be better to have at least a small 
descriptive header for them.


The comment header at the beginning of pg_backup_directory is a bit 
weird.  I guess Philip Warner should still be credited as the author of 
the code this was based on, but it's a weird seeing a new file 
attributed solely to him.  Also, there's an XXX in the identification 
field there that should be filled in with the file name.


There's your feedback for this round.  I hope we'll see an updated patch 
from you as part of the next CommitFest.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Per-column collation

2010-12-16 Thread Greg Smith


Robert Haas wrote:

I don't really have a position on whether or not this patch is ready
to commit... but I do think that this is the sort of patch that is
very likely to have some bugs almost no matter when we commit it


I just updated the CF app to track Peter's latest update, which remains 
untested by anyone else for whether it fixes all the issues brought up.  
It would be nice to get a re-review to confirm things are still working 
in advance of CF 2011-01.  Given the reviewer here is also a committer, 
that means it's possible this can go into the tree after that if 
everything checks out even outside of the regular CF schedule.  In the 
interest of closing out this CF, I'm updating this one as returned for 
now though.  That doesn't prevent it from going in anyway once it's 
confirmed ready, and I agree the sooner the better to help find 
breakage.  But I don't think that's so important that it should block 
the critical path for the next alpha.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Per-column collation

2010-12-16 Thread Greg Smith


Itagaki Takahiro wrote:

We might need previous reviewers and active reviewers in commit-fest
app. Or, should non-active reviewers delete their names?
  


This is only really an issue with patches that get moved from one CF to 
the next, which doesn't happen that often.  Patches that are marked 
Returned With Feedback instead get a new entry in the next CF instead, 
which avoids this problem.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-12-16 Thread Greg Smith


Robert Haas wrote:

On Wed, Dec 15, 2010 at 9:22 AM, Greg Smith g...@2ndquadrant.com wrote:
  

patch I submit.  Doesn't seem worth going through the trouble of committing
that minor rework on its own, I'll slip it into the next useful thing that
touches this area I do.  Thanks for the hint, this would work better than
what I did.



Well, if I'm the one committing it, I'll pull that part out again and
commit it separately.  Not sure if that affects your calculus, but I
much prefer patches that don't try to do ancillary things along the
way.
  


I meant that I'd bundle it into the block of time I spend on that, and 
likely submit with something else that touches the same area.  Obviously 
the correction patch would be better on its own when being handed over 
to a committer.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

Re: [HACKERS] [PATCH] V3: Idle in transaction cancellation

2010-12-16 Thread Greg Smith


Andres Freund wrote:

On Thursday 02 December 2010 22:21:37 Alvaro Herrera wrote:
  

Excerpts from Andres Freund's message of sáb oct 30 05:49:21 -0300 2010:


Ill set this up for the next commitfest, I don't think I can do much
more  without further input.
  

Are you reserving about 20 bits for levels, and 12 for flags?  Given the
relatively scarce growth of levels, I think we could live with about 6
or 7 bits for level, rest for flags.

The number I picked was absolutely arbitrary I admit. Neither did I think it 
would be likely to see more levels, nor did I forsee many flags, so I just 
chose some number I liked in that certain moment ;-)
  


This bit of detail seems to have died down without being resolved; 
bumping it to add a reminder about that.


I count four issues of various sizes left with this patch right now:

1) This levels bit
2) Can the approach used be simplified or the code made cleaner?
3) What is the interaction with Hot Standby error handling?
4) The usual code formatting nitpicking, Kevin mentioned braces being an 
issue


Robert is already thinking about (2); I'll keep an eye out for someone 
who can test (3) now that it's been identified as a concern; and the 
other two are pretty small details once those are crossed.  I don't see 
this as being ready to commit just yet though, so I don't see this going 
anywhere other than returned for now; will mark it as such.  Hopefully 
this will gather enough additional review to continue moving forward now 
that the main issues are identified.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

Re: [HACKERS] Tab completion for view triggers in psql

2010-12-16 Thread Greg Smith


David Fetter wrote:

That we're in the position of having prevN_wd for N = 1..5 as the
current code exists is a sign that we need to refactor the whole
thing, as you've suggested before.

I'll work up a design and prototype for this by this weekend.
  


Great.  I don't think issues around tab completion are enough to block 
the next alpha though, and it sounds like the next stage of this needs 
to gel a bit more before it will be ready to commit anyway.  I'm going 
to mark the remaining bits here as returned for now, and trust that 
you'll continue chugging away on this so we can get it into the next CF 
early.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extensions, patch v17

2010-12-16 Thread Greg Smith

Dimitri tells me there's a V18 of this patch due real soon now.  That 
may very well be ready for a committer, but even if that's the case it's 
going to take them some time to consume what was at last count an almost 
10K line long context diff.  In the interest of closing this CF out 
without further schedule overrun, I'm going to mark this as returned for 
now.


Extensions are such a major improvement for PostgreSQL that I certainly 
hope review and eventual commit for 9.1 continues during or before the 
next CF.  Both the multiple dependency patches that have been committed 
in the last month and the steady flow of updates to this one are good 
signs of progress.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CommitFest wrap-up

2010-12-16 Thread Greg Smith


Tom Lane wrote:

- Writeable CTEs - I think we need Tom to pick this one up.
- Fix snapshot taking inconsistencies - Ready for committer. Can any
committer pick this up?
  


Will take a look at these two also.
  


I marked you down at the listed committer for them both.  That leaves 
serializable lock consistency as the only patch that's left in Ready 
for Committer state; everything else seemed to me to still have some 
kinks left to work out and I marked them returned.  Given that patch has 
been in that state since 9/24 and Florian even took care of the bit rot 
yesterday, it's certainly been queued patiently enough waiting for 
attention.  Obviously you, Robert, and other committers can work on one 
of the patches I bounced instead if you think they're close enough to 
slip in.  But this serializable lock one is the only submission that I 
think hasn't gotten a fair share of attention yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

Re: [HACKERS] Re: Proposed Windows-specific change: Enable crash dumps (like core files)

2010-12-15 Thread Greg Smith

I've updated this entry in the CommitFest app to note that Craig had 
some implementation questions attached to his patch submission that I 
haven't seen anyone address yet, and to include a reference to Tom's 
latest question--which may make those questions moot, not sure.  This 
pretty clearly need to sit on the stove a little bit longer before it's 
done regardless.  I'm marking this one Returned With Feedback, and 
hopefully Craig will continue hammering on this to clean up the 
remaining details and resubmit in the next month.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-12-15 Thread Greg Smith


Alvaro Herrera wrote:

I gave this patch a look and it seems pretty good to me, except that I'm
uncomfortable with the idea of mdsync filling in the details for
CheckpointStats fields directly.  Would it work to pass a struct (say
SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
mdsync, have this function fill it, and return it back so that
CheckPointBuffers copies the data from this struct into CheckpointStats?
  


That was originally how I planned to write this bit of code.  When I 
realized that the CheckpointStats structure was already visible there 
and stuffed with details that ultimately go into the same output line at 
the end, it just didn't seem worth the extra code complexity.  The 
abstraction layer around md.c was not exactly airtight before I poked 
that extra little hole in there, and I was aiming via the principal of a 
smaller diff usually being the better patch . 

If you feel strongly that the result led to a bad abstraction violation, 
I'll submit a patch to refactor it to pass a structure instead before 
the next CF.  I appreciate your concern, I'm just not sure it's worth 
spending time on.  What I'd really like to do is refactor out major 
parts of the leaky md/smgr layers altogether instead, but that's 
obviously a bigger project.



Another minor nitpick: inside the block when you call FileSync, why
check for log_checkpoints at all?  Seems to me that just checking for
zero of sync_start should be enough.  Alternatively, seems simpler to
just have a local var with the value of log_checkpoints at the start of
mdsync and use that throughout the function.  (Surely if someone turns
off log_checkpoints in the middle of a checkpoint, it's not really a
problem that we collect and report stats during that checkpoint.)
  


And now you're just getting picky!  This is a useful observation though, 
and I'll try to include that fix along with the next general checkpoint 
overhaul patch I submit.  Doesn't seem worth going through the trouble 
of committing that minor rework on its own, I'll slip it into the next 
useful thing that touches this area I do.  Thanks for the hint, this 
would work better than what I did.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-12-14 Thread Greg Smith


Robert Haas wrote:

I took a look at this and it looks generally good, but I'm wondering
why md.c is converting the results from an exact value to a floating
point, only to have xlog.c turn around and convert back to an integer.
 I think it could just return milliseconds directly, or if you're
worried about a checkpoint that takes more than 24 days to complete,
seconds and microseconds.  


Attached patch now does something like this, except without the 
roll-over concern.  INSTR_TIME_GET_MICROSEC returns a uint64 value.  I 
just made that the storage format for all these values until they're 
converted for display.  Test output:


LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/16385/16480 time=10422.859 msec
DEBUG:  checkpoint sync: number=2 file=base/16385/16475_vm time=2896.614 
msec

DEBUG:  checkpoint sync: number=3 file=base/16385/16475.1 time=57.836 msec
DEBUG:  checkpoint sync: number=4 file=base/16385/16466 time=20.080 msec
DEBUG:  checkpoint sync: number=5 file=base/16385/16463 time=74.926 msec
DEBUG:  checkpoint sync: number=6 file=base/16385/16482 time=74.263 msec
DEBUG:  checkpoint sync: number=7 file=base/16385/16475_fsm time=7.062 msec
DEBUG:  checkpoint sync: number=8 file=base/16385/16475 time=35.164 msec
LOG:  checkpoint complete: wrote 2143 buffers (52.3%); 0 transaction log 
file(s) added, 0 removed, 3 recycled; write=1.213 s, sync=13.589 s, 
total=24.744 s; sync files=8, longest=10.422 s, average=1.698 s


This shows the hard truncation used, so 10422.859 msec becomes 10.422 
s.  I don't think allowing up to 0.999ms of error there is a problem 
given the expected scale.  But since none of the precision is lost until 
the end, that could be changed with only touching the final display 
formatting conversion of the value.  Following your general idea 
further, why throw away any resolution inside of md.c; let xlog.c decide 
how to show it.


Note that I also fixed the DEBUG level lines to only show their actual 
precision.  Before that was printing 6 digits to the right of the 
decimal point each time, the last three of which were always 0.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1ed9687..c9778df 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*** LogCheckpointEnd(bool restartpoint)
*** 6955,6964 
  {
  	long		write_secs,
  sync_secs,
! total_secs;
  	int			write_usecs,
  sync_usecs,
! total_usecs;
  
  	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
  
--- 6955,6969 
  {
  	long		write_secs,
  sync_secs,
! total_secs,
! longest_secs,
! average_secs;
  	int			write_usecs,
  sync_usecs,
! total_usecs,
! longest_usecs,
! average_usecs;
! 	uint64		average_sync_time;
  
  	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
  
*** LogCheckpointEnd(bool restartpoint)
*** 6974,6991 
  		CheckpointStats.ckpt_sync_end_t,
  		sync_secs, sync_usecs);
  
  	if (restartpoint)
  		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 write_secs, write_usecs / 1000,
  			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
  	else
  		elog(LOG, checkpoint complete: wrote %d buffers (%.1f%%); 
  			 %d transaction log file(s) added, %d removed, %d recycled; 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 CheckpointStats.ckpt_segs_added,
--- 6979,7017 
  		CheckpointStats.ckpt_sync_end_t,
  		sync_secs, sync_usecs);
  
+ 	/*
+ 	 * Timing values returned from CheckpointStats are in microseconds.
+ 	 * Convert to the second plus microsecond form that TimestampDifference
+ 	 * returns for homogeneous printing.
+ 	 */
+ 	longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 100);
+ 	longest_usecs = CheckpointStats.ckpt_longest_sync -
+ 		(uint64) longest_secs * 100;
+ 
+ 	average_sync_time = 0;
+ 	if (CheckpointStats.ckpt_sync_rels  0) 
+ 		average_sync_time = CheckpointStats.ckpt_agg_sync_time /
+ 			CheckpointStats.ckpt_sync_rels;
+ 	average_secs = (long) (average_sync_time / 100);
+ 	average_usecs = average_sync_time - (uint64) average_secs * 100;
+ 
  	if (restartpoint)
  		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; 
! 			 sync files=%d, longest=%ld.%03d s, average=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double

Re: [HACKERS] Instrument checkpoint sync calls

2010-12-13 Thread Greg Smith


Robert Haas wrote:

I'm wondering why md.c is converting the results from an exact value to a 
floating
point, only to have xlog.c turn around and convert back to an integer.
 I think it could just return milliseconds directly, or if you're
worried about a checkpoint that takes more than 24 days to complete,
seconds and microseconds.


md.c is printing the value as a float, so I converted early to a double 
and then percolated it upward from there.  More an artifact of how the 
code grew from its initial form than an intentional decision. I see your 
point that making elapsed, total_elapsed, ckpt_agg_sync_time, and 
ckpt_longest_sync all the same type of integer that 
INSTR_TIME_GET_MICROSEC returns would reduce the potential for rounding 
abberations.  I could do another rev of the patch tomorrow with that 
change if you'd prefer it that way.  I don't have a strong opinion about 
that implementation detail; given that xlog.c is printing a less 
fine-grained time anyway (seconds with 3 digits vs. msec with 3 digits) 
it seems unlikely to run into a real problem here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CommitFest wrap-up

2010-12-13 Thread Greg Smith


Robert Haas wrote:

- instrument checkpoint sync calls - I plan to commit this in the next
48 hours.  (Hopefully Greg Smith will do the cleanup I suggested, if
not I'll do it.)
  


Yes, doing that tonight so you can have a simple (hopefully) bit of work 
to commit tomorrow.



- MERGE command - Returned with Feedback?  Not sure where we stand with this.
  


That's waiting for me to do another round of review.  I'm getting to 
that soon I hope, maybe tomorrow.



- SQL/MED - file_fdw - Discussion is ongoing, but I see no harm in
marking this Returned with Feedback for now in anticipation of a new
version before CF 2011-01.
- SQL/MED - postgresql_fdw - Hasn't received as much review, I think,
so should probably be moved to next CF as-is.
  


I was thinking of just moving both of those into the next CF without 
adding any clear resolution code--then they can get worked on as 
feasible after the core goes in.




All in all it's disappointing to have so many major patches
outstanding at this point in the CommitFest, but I think we're just
going to have to make the best of it.
  


I've done a pretty lame job of pushings thing forward here, but I don't 
think things have progressed that badly.  The community produced several 
large and/or multi-part patches that the CF could have choked on, and 
instead they've been broken into digestible chunks and kept chewing 
through them.  I'm just glad that's happening in this CF, rather than a 
pile like this showing up for the last one.


Thanks for the wrap-up summary, I was going to do something like that 
myself tonight but you beat me to it. 


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Final(?) proposal for wal_sync_method changes

2010-12-09 Thread Greg Smith

Since any Windows refactoring has been postponed for now (I'll get back 
to performance checks on that platform later), during my testing time 
this week instead I did a round of pre-release review of the change Tom 
has now committed.  All looks good to me, including the docs changes.


I confirmed that:

-Ubuntu system with an older kernel still has the same wal_sync_method 
(fdatasync) and performance after pulling the update
-RHEL6 system changes as planned from using open_datasync to fdatasync 
once I updated to a HEAD after the commit


On the RHEL6 system, I also checked the commit rate using pgbench with 
the attached INSERT only script, rather than relying on test_fsync.  
This is 7200 RPM drive, so theoretical max of 120 commits/second, on 
ext4; this is the same test setup I described in more detail back in 
http://archives.postgresql.org/message-id/4ce2ebf8.4040...@2ndquadrant.com


$ psql -c show wal_sync_method
wal_sync_method
-
fdatasync
(1 row)

$ pgbench -i -s 10 pgbench

[gsm...@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 10
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 6733
tps = 112.208795 (including connections establishing)
tps = 112.216904 (excluding connections establishing)

And then manually switched over to test performance of the troublesome 
old default:


[gsm...@meddle ~]$ psql -c show wal_sync_method
wal_sync_method
-
open_datasync

[gsm...@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 10
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 6672
tps = 111.185802 (including connections establishing)
tps = 111.195089 (excluding connections establishing)

This is interesting, because test_fsync consistently reported a rate of 
about half this when using open_datasync instead of the equal 
performance I'm getting from the database.  I'll see if I can reproduce 
that further, but it's no reason to be concerned about the change that's 
been made I think.  Just more evidence that test_fsync has quirks left 
to be sorted out.  But that's not backbranch material, it should be part 
of 9.1 only refactoring, already in progress via the patch Josh 
submitted.  There's a bit of time left to get that done.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 10 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-12-09 Thread Greg Smith


Jeff Janes wrote:

In my test cases, the syncs that the backends were doing were almost
always to the same file that the checkpoint writer was already choking
on (so they are entangled simply by virtue of that).  So very quickly
all the backends hit the same wall and thunked to a halt.  This is
probably a feature of trying to use pgbench as the basis to get a very
artificial model.


Yes--pgbench has some problems like you describe, ones that are a bit 
different than the way I've seen fsync writes get in each other's way in 
the production systems I've looked at.  That's good if you really want 
to provoke this behavior, which is one reason why I've used as an 
example for my patches so far (the other being that it's already 
available in everyone's installation).  But it's tough to get it to act 
more like a real-world system, which don't have quite so many localized 
updates, without cranking the scale way up.  And that then tends to 
aggravate other problems too.


The 8.3 checkpoint spreading work also got some useful results using the 
dbt-2 benchmark.  I'm at the point where I think I need to return to 
that test program for what I'm doing now.  I'd encourage you to try that 
out too if you get a chance.


Thanks for the feedback and the review.  I hope you appreciate now why I 
suggested you wait for the stuff I was submitting before getting back 
into the sorted checkpoint topic again.  That should be a lot easier to 
make sense of with this instrumentation in place.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Final(?) proposal for wal_sync_method changes

2010-12-09 Thread Greg Smith


Josh Berkus wrote:

Did you rerun test_sync with O_DIRECT entabled, using my patch?  The
figures you had from test_fsync earlier were without O_DIRECT.
  


No--I was just focused on testing the changes that had already been 
committed.  The use of O_DIRECT in the server but not test_fsync could 
very well be the reason for the difference; don't know yet.  I'm trying 
to get through this CF before I start getting distracted by newer 
patches, I'll get to yours soon I hope.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Optimize commit_siblings in two ways to improve group commit.

2010-12-08 Thread Greg Smith


Tom Lane wrote:

http://archives.postgresql.org/pgsql-performance/2010-12/msg00073.php

Possibly it should have been posted to -hackers instead, but surely you
read -performance?
  


Trying to figure out what exactly commit_delay and commit_siblings did 
under the hood was actually the motivation behind my first foray into 
reading the PostgreSQL source code.  Ever since, I've been annoyed that 
the behavior didn't really help the way it's intended, but was not sure 
what would be better.  The additional input from Jignesh this week on 
the performance list suddenly made it crystal clear what would preserve 
the good behavior he had seen, even improving things for his case, while 
also helping the majority who won't benefit from the commit_delay 
behavior at all a little.  I immediately wrote the patch and breathed a 
sign of relief that it was finally going to get better.


I then posted the patch and added it to the January CF.  Unbeknownst to 
me until today, Simon had the same multi-year this itches and I can't 
make it stop feel toward these parameters, and that's how it jumped the 
standard process.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Optimize commit_siblings in two ways to improve group commit.

2010-12-08 Thread Greg Smith


Tom Lane wrote:

I'm not entirely convinced that zero commit_siblings is a better
default than small positive values, but it's certainly plausible.
  


Not being allowed to set it to zero was certainly a limitation worth 
abolishing though; that has been the case before now, for those who 
didn't see the thread on the performance list.  I think that on the sort 
of high throughput system likely to benefit from this behavior, whether 
commit_siblings is zero or five doesn't matter very much--those people 
should cross the siblings threshold very quickly regardless.  The main 
arguments in favor of making the default lower aren't as exciting now 
that it jumps out of the loop early once finding the requisite number.


I like keeping the default at 5 though.  It keeps the person who 
experiments with increasing commit_delay from suffering when there are 
in reality not a lot of active connections.  There are essentially two 
foot-guns you have to aim before you run into the worst case here, which 
is making every single commit wait for the delay when there's really 
only one active process committing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-12-07 Thread Greg Smith


Alvaro Herrera wrote:

Why would multiple bgwriter processes worry you?

Of course, it wouldn't work to have multiple processes trying to execute
a checkpoint simultaneously, but what if we separated the tasks so that
one process is in charge of checkpoints, and another one is in charge of
the LRU scan?
  


I was commenting more in the context of development resource 
allocation.  Moving toward that design would be helpful, but it alone 
isn't enough to improve the checkpoint sync issues.  My concern is that 
putting work into that area will be a distraction from making progress 
on those.  If individual syncs take so long that the background writer 
gets lost for a while executing them, and therefore doesn't do LRU 
cleanup, you've got a problem that LRU-related improvements probably 
aren't enough to solve.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Final(?) proposal for wal_sync_method changes

2010-12-07 Thread Greg Smith


Tom Lane wrote:

So actually, there is no difference between selecting fsync and
fsync_writethrough on Windows, this comment and the SGML documentation
to the contrary.  Both settings result in invoking _commit() and
presumably are safe.  One wonders why we bothered to invent a separate
fsync_writethrough setting on Windows.
  


Quite; I documented some the details about mapping to _commit() long ago 
at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm but 
forgot to suggest fixing the mistakes in the docs afterwards (Windows is 
not exactly my favorite platform). 
http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php 
explains some of the history I think you're looking for here.



Would someone verify via pgbench or similar test (*not* test_fsync) that
on Windows, wal_sync_method = fsync or fsync_writethrough perform the
same (ie tps ~= disk rotation rate) while open_datasync is too fast to
be real?  I'm losing confidence that I've found all the spaghetti ends
here, and I don't have a Windows setup to try it myself.
  


I can look into this tomorrow.  The laptop I posted Ubuntu/RHEL6 
test_fsync numbers from before also boots into Vista, so I can compare 
all those platforms on the same hardware.  I just need to be aware of 
the slightly different sequential speeds on each partition of the drive.


As far as your major battle plan goes, I think what we should do is find 
the simplest possible patch that just fixes the newer Linux kernel 
problem, preferrably without changing any other platform, then commit 
that to HEAD and appropriate backports.  Then the larger O_DIRECT 
remapping can proceed forward after that, along with cleanup to the 
writethrough choices and unifying test_fsync against the server.  I 
wrote a patch that shuffled around a lot of this code last night, but 
the first thing I coded was junk because of some mistaken assumptions.  
Have been coming to same realizations about how messy this really is you 
have.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-12-06 Thread Greg Smith


Tom Lane wrote:

The various testing that's been reported so far is all for
Linux and thus doesn't directly address the question of whether other
kernels will have similar performance properties.


Survey of some popular platforms:

Linux:  don't want O_DIRECT by default for reliability reasons, and 
there's no clear performance win in the default config with small 
wal_buffers


Solaris:  O_DIRECT doesn't work, there's another API support has never 
been added for; see 
http://blogs.sun.com/jkshah/entry/postgresql_wal_sync_method_and


Windows:  Small reported gains for O_DIRECT, i.e 10% at 
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01615.php


FreeBSD:  It probably works there, but I've never seen good performance 
tests of it on this platform.


Mac OS X:  Like Solaris, there's a similar mechanism but it's not 
O_DIRECT; see 
http://stackoverflow.com/questions/2299402/how-does-one-do-raw-io-on-mac-os-x-ie-equivalent-to-linuxs-o-direct-flag 
for notes about the F_NOCACHE  feature used.  Same basic situation as 
Solaris; there's an API, but PostgreSQL doesn't use it yet.


So my guess is that some small percentage of Windows users might notice 
a change here, and some testing on FreeBSD would be useful too.  That's 
about it for platforms that I think anybody needs to worry about.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

Re: [HACKERS] WIP patch for parallel pg_dump

2010-12-05 Thread Greg Smith


Joachim Wieland wrote:

Regarding snapshot cloning and dump consistency, I brought this up
already several months ago and asked if the feature is considered
useful even without snapshot cloning.


In addition, Joachim submitted a synchronized snapshot patch that looks 
to me like it slipped through the cracks without being fully explored.  
Since it's split in the official archives the easiest way to read the 
thread is at 
http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg143866.html


Or you can use these two:
http://archives.postgresql.org/pgsql-hackers/2010-01/msg00916.php
http://archives.postgresql.org/pgsql-hackers/2010-02/msg00363.php

That never made it into a CommitFest proper that I can see, it just 
picked up review mainly from Markus.  The way I read that thread, there 
were two objections:


1) This mechanism isn't general enough for all use-cases outside of 
pg_dump, which doesn't make it wrong when the question is how to get 
parallel pg_dump running


2) Running as superuser is excessive.  Running as the database owner was 
suggested as likely to be good enough for pg_dump purposes.


Ultimately I think that stalled because without a client that needed it 
the code wasn't so interesting yet.  But now there is one; should that 
get revived again?  It seems like all of the pieces needed to build 
what's really desired here are available, it's just the always 
non-trivial task of integrating them together the right way that's needed.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] libpq changes for synchronous replication

2010-12-05 Thread Greg Smith

The one time this year top-posting seems appropriate...this patch seems 
stalled waiting for some sort of response to the concerns Alvaro raised 
here.


Alvaro Herrera wrote:

Excerpts from Fujii Masao's message of jue nov 25 10:47:12 -0300 2010:

  

The attached patch s/CopyXLog/CopyBoth/g and adds the description
about CopyBoth into the COPY section.



I gave this a look.  It seems good, but I'm not sure about this bit:

+   case 'W':   /* Start Copy Both */
+   /*
+* We don't need to use getCopyStart here since 
CopyBothResponse
+* specifies neither the copy format nor the number of 
columns in
+* the Copy data. They should be always zero.
+*/
+   conn-result = PQmakeEmptyPGresult(conn, PGRES_COPY_OUT);
+   if (!conn-result)
+   return;
+   conn-asyncStatus = PGASYNC_COPY_BOTH;
+   conn-copy_already_done = 0;
+   break;

I guess this was OK when this was conceived as CopyXlog, but since it's
now a generic mechanism, this seems a bit unwise.  Should this be
reconsidered so that it's possible to change the format or number of
columns?

(The paragraph added to the docs is also a bit too specific about this
being used exclusively in streaming replication, ISTM)

  

While modifying the code, it occurred to me that we might have to add new
ExecStatusType like PGRES_COPY_BOTH and use that for CopyBoth mode,
for the sake of consistency. But since it's just alias of PGRES_COPY_BOTH
for now, i.e., there is no specific behavior for that ExecStatusType, I don't
think that it's worth adding that yet.



I'm not so sure about this.  If we think that it's worth adding a new
possible state, we should do so now; we will not be able to change this
behavior later.

[HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid

2010-12-05 Thread Greg Smith


Boxuan Zhai wrote:

I have updated the MERGE patch for two main problems.


The patch inside the .tar.gz file you attached isn't right; that 
extracts to a tiny file of junk characters.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] wCTE behaviour

2010-12-05 Thread Greg Smith


Marko Tiikkaja wrote:
This is almost exactly the patch from 2010-02 without 
CommandCounterIncrement()s.  It's still a bit rough around the edges 
and needs some more comments, but I'm posting it here anyway.


This patch passes all regression tests, but feel free to try to break 
it, there are probably ways to do that.  This one also has the always 
run DMLs to completion, and exactly once behaviour.


So this patch was marked Ready for Committer, but a) no committer has 
picked it up yet and b) Marko has made changes here that nobody else has 
tested out yet that I've seen on the last.  Accordingly, that 
classification may have been optimistic.  It seems to me that another 
testing run-through from someone like David might be appropriate to 
build some confidence this latest patch should be a commit candidate.  
If there is a committer intending to work on this as-is, they haven't 
identified themselves.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-12-05 Thread Greg Smith


Jeff Janes wrote:

I've attached a tiny patch to apply over yours, to deal with this and
with the case where no files are synced.
  


Thanks for that.  That obvious error eluded me because in most of the 
patch update testing I was doing (on ext3), the longest sync was always 
about the same length as the total sync time.


Attached patch (in correct diff form this time!) collects up all 
changes.  That includes elimination of a potential race condition if 
someone changes log_checkpoints while a long sync phase is executing.  I 
don't know whether that can happen, and it obviously won't give accurate 
stats going back to the beginning of the checkpoint in that case, but 
it  tries to defend aginst producing complete garbage if that value 
changes out from under it.


This is the first version of this patch I feel fairly good about; no 
open concerns left on my side.  Jeff, since you're now the de-facto 
credited reviewer of this one by virtue of suggesting bug fixes, could 
you take this update out for a spin too?



Combining this instrumentation patch with the backend sync one already
committed, the net result under log_min_messages=debug1is somewhat
undesirable in that I can now see the individual sync times for the
syncs done by the checkpoint writer, but I do not get times for the
syncs done by the backend (I only get informed of their existence).
  


On a properly working system, backend syncs shouldn't be happening.  So 
if you see them, I think the question shouldn't be how long are they 
taking?, it's how do I get rid of them?  From that perspective, 
knowing of their existence is sufficient to suggest the necessary tuning 
changes, such as dropping bgwriter_delay.


When you get into a situation where they are showing up, a whole lot of 
them can happen in a very brief period; enough that I'd actually be 
concerned about the added timing overhead, something I normally don't 
care about very much.  That's the main reason I didn't bother 
instrumenting those too--the situations where they happen are ones 
already running low on resources.


Big writes for things that can only be written out at checkpoint time, 
on the other hand, are unavoidable in the current design.  Providing 
detail on them is going to be relevant unless there's a major 
refactoring of how those happen.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us


diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ae9ed8f..6adc139 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*** LogCheckpointEnd(bool restartpoint)
*** 6955,6964 
  {
  	long		write_secs,
  sync_secs,
! total_secs;
  	int			write_usecs,
  sync_usecs,
! total_usecs;
  
  	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
  
--- 6955,6969 
  {
  	long		write_secs,
  sync_secs,
! total_secs,
! longest_secs,
! average_secs;
  	int			write_usecs,
  sync_usecs,
! total_usecs,
! longest_usecs,
! average_usecs;
! 	double		average_sync_time;
  
  	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
  
*** LogCheckpointEnd(bool restartpoint)
*** 6974,6991 
  		CheckpointStats.ckpt_sync_end_t,
  		sync_secs, sync_usecs);
  
  	if (restartpoint)
  		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 write_secs, write_usecs / 1000,
  			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
  	else
  		elog(LOG, checkpoint complete: wrote %d buffers (%.1f%%); 
  			 %d transaction log file(s) added, %d removed, %d recycled; 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
  			 CheckpointStats.ckpt_bufs_written,
  			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  			 CheckpointStats.ckpt_segs_added,
--- 6979,7017 
  		CheckpointStats.ckpt_sync_end_t,
  		sync_secs, sync_usecs);
  
+ 	/*
+ 	 * Timing values returned from CheckpointStats are in milliseconds.
+ 	 * Convert to the second plus microsecond form that TimestampDifference
+ 	 * returns for homogeneous printing.
+ 	 */
+ 	longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000);
+ 	longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync -
+ 		longest_secs * 1000);
+ 
+ 	average_sync_time = 0;
+ 	if (CheckpointStats.ckpt_sync_rels  0) 
+ 		average_sync_time = CheckpointStats.ckpt_agg_sync_time /
+ 			CheckpointStats.ckpt_sync_rels;
+ 	average_secs = (long) (average_sync_time / 1000);
+ 	average_usecs = 1000 * (average_sync_time - average_secs * 1000);
+ 
  	if (restartpoint)
  		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
! 			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; 
! 			 sync files=%d

Re: [HACKERS] Spread checkpoint sync

2010-12-05 Thread Greg Smith


Heikki Linnakangas wrote:
If you fsync() a file with one dirty page in it, it's going to return 
very quickly, but a 1GB file will take a while. That could be 
problematic if you have a thousand small files and a couple of big 
ones, as you would want to reserve more time for the big ones. I'm not 
sure what to do about it, maybe it's not a problem in practice.


It's a problem in practice allright, with the bulk-loading situation 
being the main one you'll hit it.  If somebody is running a giant COPY 
to populate a table at the time the checkpoint starts, there's probably 
a 1GB file of dirty data that's unsynced around there somewhere.  I 
think doing anything about that situation requires an additional leap in 
thinking about buffer cache evicition and fsync absorption though.  
Ultimately I think we'll end up doing sync calls for relations that have 
gone cold for a while all the time as part of BGW activity, not just 
at checkpoint time, to try and avoid this whole area better.  That's a 
lot more than I'm trying to do in my first pass of improvements though.


In the interest of cutting the number of messy items left in the 
official CommitFest, I'm going to mark my patch here Returned with 
Feedback and continue working in the general direction I was already 
going.  Concept shared, underlying patches continue to advance, good 
discussion around it; those were my goals for this CF and I think we're 
there.


I have a good idea how to autotune the sync spread that's hardcoded in 
the current patch.  I'll work on finishing that up and organizing some 
more extensive performance tests.  Right now I'm more concerned about 
finishing the tests around the wal_sync_method issues, which are related 
to this and need to get sorted out a bit more urgently.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

2010-12-05 Thread Greg Smith


Josh Berkus wrote:

I modified test_fsync in two ways to run this; first, to make it support
O_DIRECT, and second to make it run in the *current* directory.


Patch please?  I agree with the latter change; what test_fsync does is 
surprising.


I suggested a while ago that we refactor test_fsync to use a common set 
of source code as the database itself for detecting things related to 
wal_sync_method, perhaps just extract that whole set of DEFINE macro 
logic to somewhere else.  That happened at a bad time in the development 
cycle (right before a freeze) and nobody ever got back to the idea 
afterwards.  If this code is getting touched, and it's clear it is in 
some direction, I'd like to see things change so it's not possible for 
the two to diverge again afterwards.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] CommitFest 2010-11: Status report at 2/3 of scheduled time

2010-12-05 Thread Greg Smith

After an unfortunate two weeks where I was lost in Nyquil-land and not 
paying as much attention as I should have, I just finished a pass 
checking over all the open 26 CommitFest items.  Just under 50% of the 
patches that were open at the start of the CF are still hanging around, 
with 6 in Ready for Committer and the rest split evently waiting for 
reviewers or the author.  A few larger patches or patch sets deserve 
some commentary.


The multi-part Extensions patch continues to chug along.  A committer 
needs to pick up pg_execute_from_file() next to keep progress on that 
series going, as the other two remaining patches depend on it (and may 
need to be rebased after it goes in).  Help is also needed on coding 
review of the main Extensions patch.  The as yet untouched ALTER 
EXTENSION ... SET SCHEMA ... is the last patch in the set to apply, and 
while it needs a reviewer eventually I don't think that's the blocking 
point on the general extension work yet.


As for the also large SQL/MED set, I'm not sure whether it's posible to 
get review work started on its two dependent components--file_fdw and 
postgresql_fdw--until the main patch is finished off.  Given the size 
of code involved, it wouldn't surprise me if everyone is still be 
chewing on that series into the next CF.  Glad to see it all submitted 
for this one though.


Most of the remaining patches that are in Needs Review have already 
gone through one round of review.  There are two notable exceptions.  
While some libpq work has been getting finished for Sync Rep, neither of 
the two main patches for that have received any love for almost two 
months now.  This is the spot where I start to get worried about seeing 
this finish for 9.1.  I'm now trying to find time to work on those; 
anybody else need this feature who can help?


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-12-05 Thread Greg Smith


Rob Wultsch wrote:

Forgive me, but is all of this a step on the slippery slope to
direct io? And is this a bad thing


I don't really think so.  There's an important difference in my head 
between direct I/O, where the kernel is told write this immediately!, 
and what I'm trying to achive.  I want to give the kernel an opportunity 
to write blocks out in an efficient way, so that it can take advantage 
of elevator sorting, write combining, and similar tricks.  But, 
eventually, those writes have to make it out to disk.  Linux claims to 
have concepts like a deadline for I/O to happen, but they turn out to 
not be so effective once the system gets backed up with enough writes.  
Since fsync time is the only effective deadline, I'm progressing from 
the standpoint that adjusting when it happens relative to the write will 
help, while still allowing the kernel an opportunity to get the writes 
out on its own schedule.


When ends up happening if you push toward fully sync I/O is the design 
you see in some other databases, where you need multiple writer 
processes.  Then requests for new pages can continue to allocate as 
needed, while keeping any one write from blocking things.  That's one 
sort of a way to simulate asynchronous I/O, and you can substitute true 
async I/O instead in many of those implementations.  We didn't have much 
luck with portability on async I/O when that was last experimented with, 
and having multiple background writer processes seems like overkill; 
that whole direction worries me.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Proposed Windows-specific change: Enable crash dumps (like core files)

2010-12-04 Thread Greg Smith


Craig Ringer wrote:

On 11/24/2010 05:18 AM, Magnus Hagander wrote:


Or you set the handler always, and have the handler only actually
create the dump if the directory exists. That way you can add the
directory and still get  a dump from both existing backends and the
postmaster itself without a restart.


That's way smarter. No extra filesystem access during startup, even if 
it is cheap.


I added a commenting referencing this bit to the CF entry so it doesn't 
get forgotten.  Magnus raised a few other issues in his earlier review 
too.  Discussion of this patch seems to have jumped the gun a bit into 
talking about backpatching before the code for HEAD was completely done, 
then stalled there.  Are we going to see an updated patch that addresses 
submitted feedback in this cycle?


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-12-04 Thread Greg Smith


Greg Stark wrote:

Using sync_file_range you can specify the set of blocks to sync and
then block on them only after some time has passed. But there's no
documentation on how this relates to the I/O scheduler so it's not
clear it would have any effect on the problem. 


I believe this is the exact spot we're stalled at in regards to getting 
this improved on the Linux side, as I understand it at least.  *The* 
answer for this class of problem on Linux is to use sync_file_range, and 
I don't think we'll ever get any sympathy from those kernel developers 
until we do.  But that's a Linux specific call, so doing that is going 
to add a write path fork with platform-specific code into the database.  
If I thought sync_file_range was a silver bullet guaranteed to make this 
better, maybe I'd go for that.  I think there's some relatively 
low-hanging fruit on the database side that would do better before going 
to that extreme though, thus the patch.



We might still have to delay the begining of the sync to allow the dirty blocks 
to be synced
naturally and then when we issue it still end up catching a lot of
other i/o as well.
  


Whether it's lots or not is really workload dependent.  I work from 
the assumption that the blocks being written out by the checkpoint are 
the most popular ones in the database, the ones that accumulate a high 
usage count and stay there.  If that's true, my guess is that the writes 
being done while the checkpoint is executing are a bit less likely to be 
touching the same files.  You raise a valid concern, I just haven't seen 
that actually happen in practice yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-12-01 Thread Greg Smith


Heikki Linnakangas wrote:

Do you have any idea how to autotune the delay between fsyncs?


I'm thinking to start by counting the number of relations that need them 
at the beginning of the checkpoint.  Then use the same basic math that 
drives the spread writes, where you assess whether you're on schedule or 
not based on segment/time progress relative to how many have been sync'd 
out of that total.  At a high level I think that idea translates over 
almost directly into the existing write spread code.  Was hoping for a 
sanity check from you in particular about whether that seems reasonable 
or not before diving into the coding.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4

2010-12-01 Thread Greg Smith


Tom Lane wrote:

I think the best answer is to get out of the business of using
O_DIRECT by default, especially seeing that available evidence
suggests it might not be a performance win anyway.
  


I was concerned that open_datasync might be doing a better job of 
forcing data out of drive write caches.  But the tests I've done on 
RHEL6 so far suggest that's not true; the write guarantees seem to be 
the same as when using fdatasync.  And there's certainly one performance 
regression possible going from fdatasync to open_datasync, the case 
where you're overflowing wal_buffers before you actually commit.


Below is a test of the troublesome behavior on the same RHEL6 system I 
gave test_fsync performance test results from at 
http://archives.postgresql.org/message-id/4ce2ebf8.4040...@2ndquadrant.com


This confirms that the kernel now defining O_DSYNC behavior as being 
available, but not actually supporting it when running the filesystem in 
journaled mode, is the problem here.  That's clearly a kernel bug and no 
fault of PostgreSQL, it's just never been exposed in a default 
configuration before.  The RedHat bugzilla report seems a bit unclear 
about what's going on here, may be worth updating that to note the 
underlying cause.


Regardless, I'm now leaning heavily toward the idea of avoiding 
open_datasync by default given this bug, and backpatching that change to 
at least 8.4.  I'll do some more database-level performance tests here 
just as a final sanity check on that.  My gut feel is now that we'll 
eventually be taking something like Marti's patch, adding some more 
documentation around it, and applying that to HEAD as well as some 
number of back branches.


$ mount | head -n 1
/dev/sda7 on / type ext4 (rw)
$ cat $PGDATA/postgresql.conf | grep wal_sync_method
#wal_sync_method = fdatasync# the default is the first option
$ pg_ctl start
server starting
LOG:  database system was shut down at 2010-12-01 17:20:16 EST
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
$ psql -c show wal_sync_method
wal_sync_method
-
open_datasync

[Edit /etc/fstab, change mount options to be data=journal and reboot]

$ mount | grep journal
/dev/sda7 on / type ext4 (rw,data=journal)
$ cat postgresql.conf | grep wal_sync_method
#wal_sync_method = fdatasync# the default is the first option
$ pg_ctl start
server starting
LOG:  database system was shut down at 2010-12-01 12:14:50 EST
PANIC:  could not open file pg_xlog/00010001 (log file 
0, segment 1): Invalid argument

LOG:  startup process (PID 2690) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
$ pg_ctl stop

$ vi $PGDATA/postgresql.conf
$ cat $PGDATA/postgresql.conf | grep wal_sync_method
wal_sync_method = fdatasync# the default is the first option
$ pg_ctl start
server starting
LOG:  database system was shut down at 2010-12-01 12:14:40 EST
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-11-30 Thread Greg Smith


Jeff Janes wrote:

For the individual file sync times emitted under debug1, it would be
very handy if the file being synced was identified, for example
relation base/16384/16523. Rather than being numbered sequentially
within a given checkpoint.
  


I was numbering them sequentially so that it's straightforward to graph 
the sync times in an external analysis tool, but the relation data is 
helpful too.  New patch reflecting all upthread suggestions is 
attached.  The output looks like this now at DEBUG1:


LOG:  checkpoint starting: xlog
DEBUG:  checkpoint sync: number=1 file=base/16424/11645 
time=11589.549000 msec

DEBUG:  checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16425_fsm 
time=2.921000 msec

DEBUG:  checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_fsm 
time=1.654000 msec

DEBUG:  checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec
DEBUG:  checkpoint sync: number=10 file=base/16424/16428_vm 
time=2.613000 msec

DEBUG:  checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_fsm 
time=2.638000 msec

DEBUG:  checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
DEBUG:  checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
DEBUG:  checkpoint sync: number=15 file=base/16424/16437_vm 
time=8.686000 msec
DEBUG:  checkpoint sync: number=16 file=base/16424/16425_vm 
time=5.984000 msec
LOG:  checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log 
file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s, 
total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s


I kept the units for the DEBUG level ones in msec because that's a 
better scale for the common really short syncs during good behavior.  
But the summary info in seconds now appears at the end of the existing 
checkpoint complete message, so only one line to parse for those 
looking to analyze the gross checkpoint data.  That looks to work well 
enough for finding situations like the big ext3 spikes.  You can easily 
see one in this example by the fact that longest=11.589 s is almost 
the entirety of sync=11.715 s.  That's the really key thing there's 
currently no visibility into, that's made obvious with this patch.


This might be ready for some proper review now.  I know there's at least 
one blatant bug still in here I haven't found yet, related to how the 
averages are computed.  I saw this once:


LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log 
file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, 
total=0.001 s; sync files=0, longest=0.000 s, 
average=-9223372036854775808.-2147483 s


After an immediate checkpoint, so at least one path not quite right yet.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD


diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ede6ceb..6f6eb3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7063,10 +7063,15 @@ LogCheckpointEnd(bool restartpoint)
 {
 	long		write_secs,
 sync_secs,
-total_secs;
+total_secs,
+longest_secs,
+average_secs;
 	int			write_usecs,
 sync_usecs,
-total_usecs;
+total_usecs,
+longest_usecs,
+average_usecs;
+	double		average_sync_time;
 
 	CheckpointStats.ckpt_end_t = GetCurrentTimestamp();
 
@@ -7082,18 +7087,35 @@ LogCheckpointEnd(bool restartpoint)
 		CheckpointStats.ckpt_sync_end_t,
 		sync_secs, sync_usecs);
 
+	/*
+	 * Timing values returned from CheckpointStats are in milliseconds.
+	 * Convert to the second plus microsecond form that TimestampDifference
+	 * returns for homogeneous printing.
+	 */
+	longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000);
+	longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync - longest_secs * 1000);
+
+	average_sync_time = CheckpointStats.ckpt_longest_sync / CheckpointStats.ckpt_sync_rels;
+	average_secs = (long) (average_sync_time / 1000);
+	average_usecs = 1000 * (average_sync_time - average_secs * 1000);
+
 	if (restartpoint)
 		elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); 
-			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s,
+			 write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; 
+			 sync files=%d, longest=%ld.%03d s, average=%ld.%03d s,
 			 CheckpointStats.ckpt_bufs_written,
 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 			 write_secs, write_usecs / 1000,
 			 sync_secs, sync_usecs / 1000,
-			 total_secs, total_usecs / 1000);
+			 total_secs, total_usecs / 1000

Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Greg Smith


Ron Mayer wrote:

Might smoother checkpoints be better solved by talking
to the OS vendors  virtual-memory-tunning-knob-authors
to work with them on exposing the ideal knobs; rather than
saying that our only tool is a hammer(fsync) so the problem
must be handled as a nail.
  


Maybe, but it's hard to argue that the current implementation--just 
doing all of the sync calls as fast as possible, one after the other--is 
going to produce worst-case behavior in a lot of situations.  Given that 
it's not a huge amount of code to do better, I'd rather do some work in 
that direction, instead of presuming the kernel authors will ever make 
this go away.  Spreading the writes out as part of the checkpoint rework 
in 8.3 worked better than any kernel changes I've tested since then, and 
I'm not real optimisic about this getting resolved at the system level.  
So long as the database changes aren't antagonistic toward kernel 
improvements, I'd prefer to have some options here that become effective 
as soon as the database code is done.


I've attached an updated version of the initial sync spreading patch 
here, one that applies cleanly on top of HEAD and over top of the sync 
instrumentation patch too.  The conflict that made that hard before is 
gone now.


Having the pg_stat_bgwriter.buffers_backend_fsync patch available all 
the time now has made me reconsider how important one potential bit of 
refactoring here would be.  I managed to catch one of the situations 
where really popular relations were being heavily updated in a way that 
was competing with the checkpoint on my test system (which I can happily 
share the logs of), with the instrumentation patch applied but not the 
spread sync one:


LOG:  checkpoint starting: xlog
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 7747 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 42688 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 9723 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 58117 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 165128 of relation base/16424/16437
[330 of these total, all referring to the same two relations]

DEBUG:  checkpoint sync: number=1 file=base/16424/16448_fsm 
time=10132.83 msec

DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 
msec
DEBUG:  checkpoint sync: number=9 file=base/16424/16437_fsm 
time=0.001000 msec

DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.00 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_vm 
time=0.001000 msec
DEBUG:  checkpoint sync: number=13 file=base/16424/16425_vm 
time=0.001000 msec
LOG:  checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log 
file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, 
total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s


Note here how the checkpoint was hung on trying to get 16448_fsm written 
out, but the backends were issuing constant competing fsync calls to 
these other relations.  This is very similar to the production case this 
patch was written to address, which I hadn't been able to share a good 
example of yet.  That's essentially what it looks like, except with the 
contention going on for minutes instead of seconds.


One of the ideas Simon and I had been considering at one point was 
adding some better de-duplication logic to the fsync absorb code, which 
I'm reminded by the pattern here might be helpful independently of other 
improvements.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 620b197..501cab8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct
 
 static BgWriterShmemStruct *BgWriterShmem;
 
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
+/* Fraction of fsync absorb queue that needs to be filled before acting

Re: [HACKERS] Spread checkpoint sync

2010-11-30 Thread Greg Smith


Jeff Janes wrote:

Have you tested out this absorb during syncing phase code without
the sleep between the syncs?
I.e. so that it still a tight loop, but the loop alternates between
sync and absorb, with no intentional pause?
  


Yes; that's how it was developed.  It helped to have just the extra 
absorb work without the pauses, but that alone wasn't enough to really 
improve things on the server we ran into this problem badly on.



I ask because I don't have a mental model of how the pause can help.
Given that this dirty data has been hanging around for many minutes
already, what is a 3 second pause going to heal?
  


The difference is that once an fsync call is made, dirty data is much 
more likely to be forced out.  It's the one thing that bypasses all 
other ways the kernel might try to avoid writing the data--both the 
dirty ratio guidelines and the congestion control logic--and forces 
those writes to happen as soon as they can be scheduled.  If you graph 
the amount of data shown Dirty: by /proc/meminfo over time, once the 
sync calls start happening it's like a descending staircase pattern, 
dropping a little bit as each sync fires. 


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-11-26 Thread Greg Smith


Jeff Janes wrote:

For the individual file sync times emitted under debug1, it would be
very handy if the file being synced was identified, for example
relation base/16384/16523.


I was in the middle of looking into adding that already, so consider 
that part of the target for the next update I'm working on.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-11-23 Thread Greg Smith


Robert Haas wrote:

Did this get eaten by the email goblin, or you're still working on it?
  


Fell behind due to an unfortunately timed bit of pneumonia.  Hurray for 
the health benefits of cross country flights.  Will fix this up, rebase 
my other patch, and head toward some more review/'Fest cleanup now that 
I'm feeling better.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal Write Scaling

2010-11-23 Thread Greg Smith


Eliot Gable wrote:
However, I have not seen much in the way of a serious attempt at 
multi-master write scaling. 


Scaling writes across nodes using PL/Proxy works.

Of course, I am assuming the disk system would be RAID 1, RAID 10, 
RAID 5, or RAID 6 for reliability purposes and that it is sufficiently 
redundant that you don't have to worry about an outage of your storage 
system.


The idea that you'll have a system that needs better write scalability 
that isn't limited by the storage system is an unusual one, not the 
expected case.  And the trend everywhere in the industry is away from 
giant redundant systems, and toward having multiple cheaper redundant 
copies of all the data instead.  It's impossible to protect against 
things like environmental failure at any single location.  Once you've 
accepted that you have to be able to replicate this beast too if you 
want high availability, you're back at having a multi-node problem 
again.  This is why the most active work is on distributed designs that 
start on that basis, rather than projects trying to build more scalable 
monoliths. 


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Greg Smith


Jeff Janes wrote:

And for very large memory
systems, even 1% may be too much to cache (dirty*_ratio can only be
set in integer percent points), so recent kernels introduced
dirty*_bytes parameters.  I like these better because they do what
they say.  With the dirty*_ratio, I could never figure out what it was
a ratio of, and the results were unpredictable without extensive
experimentation.
  


Right, you can't set dirty_background_ratio low enough to make this 
problem go away.  Even attempts to set it to 1%, back when that that was 
the right size for it, seem to be defeated by other mechanisms within 
the kernel.  Last time I looked at the related source code, it seemed 
the congestion control logic that kicks in to throttle writes was a 
likely suspect.  This is why I'm not real optimistic about newer 
mechanism like the dirty_background_bytes added 2.6.29 to help here, as 
that just gives a mapping to setting lower values; the same basic logic 
is under the hood.


Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly 
due to the same congestion mechanism. 


Yes, but how much work do we want to put into redoing the checkpoint
logic so that the sysadmin on a particular OS and configuration and FS
can avoid having to change the kernel parameters away from their
defaults?  (Assuming of course I am correctly understanding the
problem, always a dangerous assumption.)
  


I've been trying to make this problem go away using just the kernel 
tunables available since 2006.  I adjusted them carefully on the server 
that ran into this problem so badly that it motivated the submitted 
patch, months before this issue got bad.  It didn't help.  Maybe if they 
were running a later kernel that supported dirty_background_bytes that 
would have worked better.  During the last few years, the only thing 
that has consistently helped in every case is the checkpoint spreading 
logic that went into 8.3.  I no longer expect that the kernel developers 
will ever make this problem go away the way checkpoints are written out 
right now, whereas the last good PostgreSQL work in this area definitely 
helped.


The basic premise of the current checkpoint code is that if you write 
all of the buffers out early enough, by the time syncs execute enough of 
the data should have gone out that those don't take very long to 
process.  That was usually true for the last few years, on systems with 
a battery-backed cache; the amount of memory cached by the OS was 
relatively small relative to the RAID cache size.  That's not the case 
anymore, and that divergence is growing bigger.


The idea that the checkpoint sync code can run in a relatively tight 
loop, without stopping to do the normal background writer cleanup work, 
is also busted by that observation.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Greg Smith


Robert Haas wrote:

Doing all the writes and then all the fsyncs meets this requirement
trivially, but I'm not so sure that's a good idea.  For example, given
files F1 ... Fn with dirty pages needing checkpoint writes, we could
do the following: first, do any pending fsyncs for files not among F1
.. Fn; then, write all pages for F1 and fsync, write all pages for F2
and fsync, write all pages for F3 and fsync, etc.  This might seem
dumb because we're not really giving the OS a chance to write anything
out before we fsync, but think about the ext3 case where the whole
filesystem cache gets flushed anyway.


I'm not horribly interested in optimizing for the ext3 case per se, as I 
consider that filesystem fundamentally broken from the perspective of 
its ability to deliver low-latency here.  I wouldn't want a patch that 
improved behavior on filesystem with granular fsync to make the ext3 
situation worst.  That's as much as I'd want design to lean toward 
considering its quirks.  Jeff Janes made a case downthread for why not 
make it the admin/OS's job to worry about this?  In cases where there 
is a reasonable solution available, in the form of switch to XFS or 
ext4, I'm happy to take that approach.


Let me throw some numbers out to give a better idea of the shape and 
magnitude of the problem case I've been working on here.  In the 
situation that leads that the near hour-long sync phase I've seen, 
checkpoints will start with about a 3GB backlog of data in the kernel 
write cache to deal with.  That's about 4% of RAM, just under the 5% 
threshold set by dirty_background_ratio.  Whether or not the 256MB write 
cache on the controller is also filled is a relatively minor detail I 
can't monitor easily.  The checkpoint itself?  250MB each time. 

This proportion is why I didn't think to follow the alternate path of 
worrying about spacing the write and fsync calls out differently.  I 
shrunk shared_buffers down to make the actual checkpoints smaller, which 
helped to some degree; that's what got them down to smaller than the 
RAID cache size.  But the amount of data cached by the operating system 
is the real driver of total sync time here.  Whether or not you include 
all of the writes from the checkpoint itself before you start calling 
fsync didn't actually matter very much; in the case I've been chasing, 
those are getting cached anyway.  The write storm from the fsync calls 
themselves forcing things out seems to be the driver on I/O spikes, 
which is why I started with spacing those out.


Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog 
takes a minimum of 10 minutes of real time.  There are about 300 1GB 
relation files involved in the case I've been chasing.  This is where 
the 3 second delay number came from; 300 files, 3 seconds each, 900 
seconds = 15 minutes of sync spread.  You can turn that math around to 
figure out how much delay per relation you can afford while still 
keeping checkpoints to a planned end time, which isn't done in the patch 
I submitted yet.


Ultimately what I want to do here is some sort of smarter write-behind 
sync operation, perhaps with a LRU on relations with pending fsync 
requests.  The idea would be to sync relations that haven't been touched 
in a while in advance of the checkpoint even.  I think that's similar to 
the general idea Robert is suggesting here, to get some sync calls 
flowing before all of the checkpoint writes have happened.  I think that 
the final sync calls will need to get spread out regardless, and since 
doing that requires a fairly small amount of code too that's why we 
started with that.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changes to Linux OOM killer in 2.6.36

2010-11-19 Thread Greg Smith


Kevin Grittner wrote:

Greg Smith  wrote:
 
  

oom_adj is deprecated, scheduled for removal in August 2010:

 
That surprised me so I checked the URL.  I believe you have a typo

there and it's August, 2012.
  


This is why I include references, so that when the cold medicine hits me 
in the middle of proofreading my message and I sent it anyway you aren't 
mislead.  Yes, 2012, only a few months before doomsday.  The aproaching 
end of the world then means any bugs left can be marked WONTFIX.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us

[HACKERS] Changes to Linux OOM killer in 2.6.36

2010-11-18 Thread Greg Smith

Last month's new Linux kernel 2.6.36 includes a rewrite of the out of 
memory killer: 


http://lwn.net/Articles/391222/
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a63d83f427fbce97a6cea0db2e64b0eb8435cd10

The new badness method totals the task's RSS and swap as a percentage 
of RAM, where the old one scored starting with the total memory used by 
the process.  I *think* that this is an improvement for PostgreSQL, 
based on the sort of data I see with:


ps -o pid,rss,size,vsize,args -C postgres

But I haven't tested with one of the new kernels yet to be sure.  
Something to look at next time I get in that bleeding edge kernel kind 
of mood.


One thing that's definitely changed is the interface used to control 
turning off the OOM killer.  There's a backward compatibility 
translation right now that maps the current -17 bit mask value the 
PostgreSQL code sends to /proc/pid/oom_adj into the new units scale.  
However, oom_adj is deprecated, scheduled for removal in August 2010:  
http://www.mjmwired.net/kernel/Documentation/feature-removal-schedule.txt


So eventually, if the OOM disabling code is still necessary in 
PostgreSQL, it will need to do this sort of thing instead:


echo -1000  /proc/pid/oom_score_adj

I've seen kernel stuff get deprecated before the timeline before for 
code related reasons (when the compatibility bits were judged too 
obtrusive to keep around anymore), but since this translation bit is 
only a few lines of code I wouldn't expect that to happen here.


I don't think it's worth doing anything to the database code until tests 
on the newer kernel confirm whether this whole thing is even necessary 
anymore.  Wanted to pass along the info while I was staring at it 
though.  Thanks to Daniel Farina for pointing this out.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improving prep_buildtree used in VPATH builds

2010-11-18 Thread Greg Smith


Gurjeet Singh wrote:
Seems like it would improve performance in general, but more so under 
load conditions when you actually need it. I am not sure if -depth 
option is supported by all incarnations of 'find'.


Given the way directory writes are cached by the filesystem, I'm not 
sure why the load at the time matters so much.  If what you mean by that 
is you're low on memory, that would make more sense to me.


Anyway, -depth is in POSIX as of 2001.  It seems to be in all the 
major SysV UNIX variants going back further then that, and of course 
it's in GNU find.  But it looks like BSD derived systems from before 
that POSIX standard originally called this -d instead.  So there's 
some potential for this to not work on older systems; it works fine on 
my test FreeBSD 7.2 system.  Maybe someone who has access to some 
ancient BSD-ish system can try this out?  The simplest test case similar 
to what the patch adds seems to be if this runs, returning 
subdirectories in depth-first order before their parent:


$ find / -depth -type d -print

If that fails somewhere, it may turn out to require another configure 
check just to determine whether you can use this configuration time 
optimization.  That's certainly possible to add to the patch if it got 
committed and turns out to break one of the buildfarm members.  It seems 
to me like this whole thing may be a bit more work than it's worth, 
given this is a fairly small and difficult to reproduce speedup in only 
one stage of the build process.  I'd think that if configure takes 
longer than it has to because the system is heavily loaded, the amount 
compilation time is going to suffer from that would always dwarf this 
component of total build time.  But if this was slow enough at some 
point to motivate you to write a patch for it, maybe that assumption is 
wrong.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI update

2010-11-15 Thread Greg Smith


Kevin Grittner wrote:

That went in an hour and a half before the CF deadline, but I just
didn't feel right putting it into the CF in that shape.


Then keep on working on it and we can revisit its state when you're 
happy with it.  The purpose of the CommitFest cut-off is not to block 
work on long-term development just because a deadline passed; it's to 
make sure patches which might otherwise never get a review are looked at 
eventually.  Finding a reviewer for these larger and complicated patches 
is a very different sort of job than finding one for an average patch 
anyway.


I could use a brief reminder of how this bit fits into the serializable 
lock consistency patch that's already sitting into the CF queue as 
Ready for Committer though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instrument checkpoint sync calls

2010-11-15 Thread Greg Smith


So my task list is:

0) Rebase against the HEAD that just code related to this touched today

1) Assume that log_checkpoints is sufficient control over whether the 
timing overhead added is worth collecting, and therefore remove the 
half-baked idea of also wrapping with a compile-time option.


2) Have the sync summary returned upwards, so it can be put onto the 
same line as the rest of the rest of the log_checkpoint info.


All seems reasonable to me.  Will rev a new patch by tomorrow.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Count backend self-sync calls

2010-11-15 Thread Greg Smith


Jeff Janes wrote:

Do you know where this competition is happening?  Is it on the
platters, or is it in the hard drive write cache (I thought high-end
hardware had tagged writes to avoid that), or in the kernel?
  


Kernel.  Linux systems with lots of memory will happily queue up 
gigabytes of memory in their write cache, only getting serious about 
writing it out to disk when demanded to by fsync.




This makes sense if we just need to append to a queue.  But once the
queue is full and we are about to do a backend fsync, might it make
sense to do a little more work to look for dups?
  


One of the paths I'd like to follow is experimenting with both sorting 
writes by file and looking for duplication in the queues.  I think a 
basic, simple sync spreading approach needs to get finished first 
through; this sort of thing would then be an optimization on top of it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] a new problem in MERGE

2010-11-14 Thread Greg Smith

Boxuan Zhai wrote:
I have plan to fix the above two bugs together. (in fact, I have
already started coding in merge_v202 edition). My question is how
should I make my update be consistent with yours. Is it possible for
you to give me an edition that I can work on?

I just got this reconciled with HEAD again. There have been two changes
I made in the code you'll eventually want in your working copy:

1) Fix NIL/NULL confusion:
https://github.com/greg2ndQuadrant/postgres/commit/9013ba9e81490e3623add1b029760817021297c0

2) Update ExecMerge to accept and pass through an oldtuple value. This
is needed to make the code compatible with the PostgreSQL git HEAD after
the changes made in the 2009-09 CommitFest. Bit rot updates made:
https://github.com/greg2ndQuadrant/postgres/commit/be03bd201720f42a666f7e356ec8507c1357f502

I'm not sure if how I did (2) is correct for all cases, but at least the
code compiles again now and the server will start. Attached is an
updated patch that applies to HEAD as of right now, and that code has
been pushed to
https://github.com/greg2ndQuadrant/postgres/tree/merge-unstable with the
changes rebased to be the last two commits.

It fails make installcheck on my system. But as the initial diffs I
looked at relate to enums and such, I don't think that's a problem with
your patch. Will investigate further here with some of my own patches
I'm working on today. Hopefully this is enough to unblock what you were
looking for more details from me about.

--
Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD

merge-v203-20101114.gz
Description: GNU Zip compressed data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_stat_bgwriter broken?

2010-11-14 Thread Greg Smith

I'm behind on my list mail so maybe this has been mentioned already, but 
when I just tried pg_stat_bgwriter from a build against today's HEAD I 
got this:


pgbench=# select count(*) FROM pg_stat_bgwriter;
ERROR:  did not find '}' at end of input node

Can someone confirm if this broke recently, or is it just me?  Last time 
I would have tested this myself was a few weeks ago.


Regardless, I was thinking of adding some basic sanity checking on this 
view, that runs from the regression tests to catch this class of problem 
in the future.  It's kind of sloppy that this and the bgwriter counter 
reset aren't tested as part of make check.  I think these two would 
always produce stable results:


postgres=# SELECT count(*) FROM pg_stat_bgwriter;
count
---
1
postgres=# SELECT pg_stat_reset_shared('bgwriter');
pg_stat_reset_shared
--

(1 row)

And the first one of those is similarly broken on my install.  Thoughts 
on whether adding those to the regression tests would be a worthwhile 
patch?  I'll do the work, just thinking out loud about the concept.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_bgwriter broken?

2010-11-14 Thread Greg Smith


Tom Lane wrote:

Worksforme.  You probably need a full recompile and/or initdb


Yeah, sorry about the noise.  This went away after some more intensive 
rebuilding.  I think I still want to add some regression testing of this 
view as suggested.  If that had been there, I'd have been a lot more 
confident it was my mistake, and not something that just slipped through 
without being tested.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Instrument checkpoint sync calls

2010-11-14 Thread Greg Smith

Attached patch adds some logging for each individual fsync call made 
during a checkpoint, along with a summary at the end.  You need to have 
the following to see all of the detail:


log_checkpoints=on
log_min_messages=debug1

And here's a sample:

LOG:  checkpoint starting: immediate force wait
DEBUG:  checkpoint sync: file=1 time=1.946000 msec
DEBUG:  checkpoint sync: file=2 time=0.666000 msec
DEBUG:  checkpoint sync: file=3 time=0.004000 msec
LOG:  checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec
LOG:  checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log 
file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s, 
total=0.003 s


I think that it's reasonable for the sort of people who turn 
log_checkpoints on to also get the sync summary line, thus it being 
logged at LOG level.  The detail on individual syncs might go to DEBUG2 
or lower instead of DEBUG1 where I put it, that part I don't have a 
strong opinion on.  It's at DEBUG1 to make testing the patch without a 
gigantic amount of log data also coming in easier.


Right now the code is written such that all the calls that grab timing 
information are wrapped around ifdef DEBUG_FSYNC, which is a variable 
set to 1 that could be a compile-time option like DEBUG_DEADLOCK, to 
allow turning this code path off at build time.  I personally think that 
if you're already making an fsync call and have log_checkpoints on, the 
additional overhead of also timing that fsync is minimal even on 
platforms where timing is slow (I don't have such a system to test that 
assumption however).  And I've seen enough value in troubleshooting 
nasty checkpoint sync problems using this patch to feel it's worth 
having even if it does add some overhead.


I'm a little concerned about log_checkpoints changing on me in the 
middle of the execution of a checkpoint, which would cause some problems 
here.  Not sure if that's something I should explicitly code for, given 
that all I think it will do is screw up one of the timing results.  It 
does seem a risk from the last minute self-review I just did of the code.


I'll give a sample program that stresses the system, generating slow 
timing results and other types of bad behavior, along with the next 
patch I submit here shortly.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us


diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1219fcf..c72da72 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -20,6 +20,7 @@
 
 #include catalog/catalog.h
 #include miscadmin.h
+#include portability/instr_time.h
 #include postmaster/bgwriter.h
 #include storage/fd.h
 #include storage/bufmgr.h
@@ -29,6 +30,11 @@
 #include utils/memutils.h
 #include pg_trace.h
 
+/*
+ * For now just leave this on all the time, eventually this will
+ * need to be a full compiler flag like DEBUG_DEADLOCK
+ */
+#define DEBUG_FSYNC	1
 
 /* interval for calling AbsorbFsyncRequests in mdsync */
 #define FSYNCS_PER_ABSORB		10
@@ -927,6 +933,15 @@ mdsync(void)
 	PendingOperationEntry *entry;
 	int			absorb_counter;
 
+#ifdef DEBUG_FSYNC
+	/* Statistics on sync times */
+	int processed = 0;
+	instr_time sync_start,sync_end,diff;
+	double elapsed;
+	double longest = 0;
+	double total_elapsed = 0;
+#endif
+
 	/*
 	 * This is only called during checkpoints, and checkpoints should only
 	 * occur in processes that have created a pendingOpsTable.
@@ -1069,9 +1084,32 @@ mdsync(void)
 seg = _mdfd_getseg(reln, entry-tag.forknum,
 			  entry-tag.segno * ((BlockNumber) RELSEG_SIZE),
    false, EXTENSION_RETURN_NULL);
+
+#ifdef DEBUG_FSYNC
+if (log_checkpoints)
+	INSTR_TIME_SET_CURRENT(sync_start);
+#endif
 if (seg != NULL 
 	FileSync(seg-mdfd_vfd) = 0)
+{
+
+#ifdef DEBUG_FSYNC
+	if (log_checkpoints)
+	{
+		INSTR_TIME_SET_CURRENT(sync_end);
+		diff=sync_end;
+		INSTR_TIME_SUBTRACT(diff, sync_start);
+		elapsed = (double) INSTR_TIME_GET_MICROSEC(diff) / 1000.0;
+		if (elapsed  longest)
+			longest = elapsed;
+		total_elapsed+=elapsed;
+		processed++;
+		/* Reduce the log level here to DEBUG2 in final patch? */
+		elog(DEBUG1, checkpoint sync: file=%d time=%f msec, processed, elapsed);
+	}
+#endif
 	break;		/* success; break out of retry loop */
+	}
 
 /*
  * XXX is there any point in allowing more than one retry?
@@ -1113,6 +1151,13 @@ mdsync(void)
 			elog(ERROR, pendingOpsTable corrupted);
 	}			/* end loop over hashtable entries */
 
+#ifdef DEBUG_FSYNC
+	/* Report on sync performance metrics */
+	if (log_checkpoints  (processed  0))
+		elog(LOG, checkpoint sync: files=%d longest=%f msec average=%f msec,
+			processed, longest, total_elapsed / processed);
+#endif
+
 	/* Flag successful completion of mdsync */
 	mdsync_in_progress = false;
 }

-- 
Sent via pgsql-hackers mailing list

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1566 matches

Mail list logo