Re: [HACKERS] Spread checkpoint sync
Greg Smith wrote: I think a helpful next step here would be to put Robert's fsync compaction patch into here and see if that helps. There are enough backend syncs showing up in the difficult workloads (scale=1000, clients =32) that its impact should be obvious. Initial tests show everything expected from this change and more. This took me a while to isolate because of issues where the filesystem involved degraded over time, giving a heavy bias toward a faster first test run, before anything was fragmented. I just had to do a whole new mkfs on the database/xlog disks when switching between test sets in order to eliminate that. At a scale of 500, I see the following average behavior: Clients TPS backend-fsync 16 557 155 32 587 572 64 628 843 128 621 1442 256 632 2504 On one run through with the fsync compaction patch applied this turned into: Clients TPS backend-fsync 16 637 0 32 621 0 64 721 0 128 716 0 256 841 0 So not only are all the backend fsyncs gone, there is a very clear TPS improvement too. The change in results at =64 clients are well above the usual noise threshold in these tests. The problem where individual fsync calls during checkpoints can take a long time is not appreciably better. But I think this will greatly reduce the odds of running into the truly dysfunctional breakdown, where checkpoint and backend fsync calls compete with one another, that caused the worst-case situation kicking off this whole line of research here. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: Based on what I saw looking at this, I'm thinking that the backend fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs spread uniformly throughout the test, but clusters of 100 or more that happen in very quick succession, followed by relief when the background writer gets around to emptying the queue. That's exactly the case. You'll be running along fine, the queue will fill, and then hundreds of them can pile up in seconds. Since the worst of that seemed to be during the sync phase of the checkpoint, adding additional queue management logic to there is where we started at. I thought this compaction idea would be more difficult to implement than your patch proved to be though, so doing this first is working out quite well instead. This is what all the log messages from the patch look like here, at scale=500 and shared_buffers=256MB: DEBUG: compacted fsync request queue from 32768 entries to 11 entries That's an 8GB database, and from looking at the relative sizes I'm guessing 7 entries refer to the 1GB segments of the accounts table, 2 to its main index, and the other 2 are likely branches/tellers data. Since I know the production system I ran into this on has about 400 file segments on it regularly dirtied a higher shared_buffers than that, I expect this will demolish this class of problem on it, too. I'll have all the TPS over time graphs available to publish by the end of my day here, including tests at a scale of 1000 as well. Those should give a little more insight into how the patch is actually impacting high-level performance. I don't dare disturb the ongoing tests by copying all that data out of there until they're finished, will be a few hours yet. My only potential concern over committing this is that I haven't done a sanity check over whether it impacts the fsync mechanics in a way that might cause an issue. Your assumptions there are documented and look reasonable on quick review; I just haven't had much time yet to look for flaws in them. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: During each cluster, the system probably slows way down, and then recovers when the queue is emptied. So the TPS improvement isn't at all a uniform speedup, but simply relief from the stall that would otherwise result from a full queue. That does seem to be the case here. http://www.2ndquadrant.us/pgbench-results/index.htm now has results from my a long test series, at two database scales that caused many backend fsyncs during earlier tests. Set #5 is the existing server code, #6 is with the patch applied. There are zero backend fsync calls with the patch applied, which isn't surprising given how simple the schema is on this test case. An average of a 14% TPS gain appears at a scale of 500 and a 8% one at 1000; the attached CSV file summarizes the average figures for the archives. The gains do appear to be from smoothing out the dead period that normally occur during the sync phase of the checkpoint. For example, here are the fastest runs at scale=1000/clients=256 with and without the patch: http://www.2ndquadrant.us/pgbench-results/436/index.html (tps=361) http://www.2ndquadrant.us/pgbench-results/486/index.html (tps=380) Here the difference in how much less of a slowdown there is around the checkpoint end points is really obvious, and obviously an improvement. You can see the same thing to a lesser extent at the other end of the scale; here's the fastest runs at scale=500/clients=16: http://www.2ndquadrant.us/pgbench-results/402/index.html (tps=590) http://www.2ndquadrant.us/pgbench-results/462/index.html (tps=643) Where there are still very ugly maximum latency figures here in every case, these periods just aren't as wide with the patch in place. I'm moving onto some brief testing some of the newer kernel behavior here, then returning to testing the other checkpoint spreading ideas on top of this compation patch, presuming something like it will end up being committed first. I think it's safe to say I can throw away the changes to try and alter the fsync absorption code present in what I submitted before, as this scheme does a much better job of avoiding that problem than those earlier queue alteration ideas. I'm glad Robert grabbed the right one from the pile of ideas I threw out for what else might help here. P.S. Yes, I know I have other review work to do as well. Starting on the rest of that tomorrow. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books ,,Unmodified,,Compacted Fsync,,, scale,clients,tps,max_latency,tps,max_latency,TPS Gain,% Gain 500,16,557,17963.41,631,17116.31,74,13.3% 500,32,587,25838.8,655,24311.54,68,11.6% 500,64,628,35198.39,727,38040.39,99,15.8% 500,128,621,41001.91,687,48195.77,66,10.6% 500,256,632,49610.39,747,46799.48,115,18.2% ,,, 1000,16,306,39298.95,321,40826.58,15,4.9% 1000,32,314,40120.35,345,27910.51,31,9.9% 1000,64,334,46244.86,358,45138.1,24,7.2% 1000,128,343,72501.57,372,47125.46,29,8.5% 1000,256,321,80588.63,350,83232.14,29,9.0% -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use of O_DIRECT only for open_* sync options
Bruce Momjian wrote: xlogdefs.h says: /* * Because O_DIRECT bypasses the kernel buffers, and because we never * read those buffers except during crash recovery, it is a win to use * it in all cases where we sync on each write(). We could allow O_DIRECT * with fsync(), but because skipping the kernel buffer forces writes out * quickly, it seems best just to use it for O_SYNC. It is hard to imagine * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT. * Also, O_DIRECT is never enough to force data to the drives, it merely * tries to bypass the kernel cache, so we still need O_SYNC or fsync(). */ This seems wrong because fsync() can win if there are two writes before the sync call. Can kernels not issue fsync() if the write was O_DIRECT? If that is the cause, we should document it. The comment does look busted, because you did imagine exactly a case where they might be combined. The only incompatibility that I'm aware of is that O_DIRECT requires reads and writes to be aligned properly, so you can't use it in random application code unless it's aware of that. O_DIRECT and fsync are compatible; for example, MySQL allows combining the two: http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html (That whole bit of documentation around innodb_flush_method includes some very interesting observations around O_DIRECT actually) I'm starting to consider the idea that much of the performance gains seen on earlier systems with O_DIRECT was because it decreased CPU usage shuffling things into the OS cache, rather than its impact on avoiding pollution of said cache. On Linux for example, its main accomplishment is decribed like this: File I/O is done directly to/from user space buffers. http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html The earliest paper on the implementation suggests a big decrease in CPU overhead from that: http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html Impossible to guess whether that's more true (CPU cache pollution is a bigger problem now) or less true (drives are much slower relative to CPUs now) today. I'm trying to remain agnostic and let the benchmarks offer an opinion instead. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
+ old kernel crowd. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] auto-sizing wal_buffers
Fujii Masao wrote: +/* Minimum setting used for a lower bound on wal_buffers */ +#define XLOG_BUFFER_MIN4 Why didn't you use XLOG_BUFFER_MIN instead of XLOGbuffersMin? XLOG_BUFFER_MIN is not used anywhere for now. That's a typo; will fix. + if (XLOGbuffers (XLOGbuffersMin * 2)) + XLOGbuffers = XLOGbuffersMin * 2; + } Why is the minimum value 64kB only when wal_buffers is set to -1? This seems confusing for users. That's because the current default on older versions is 64kB. Since the automatic selection is going to be the new default, I hope, I don't want it to be possible it will pick a number smaller than the default of older versions. So the automatic lower limit is 64kB, while the actual manually set lower limit remains 32kB, as before. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] auto-sizing wal_buffers
Josh Berkus wrote: I think we can be more specific on that last sentence; is there even any *theoretical* benefit to settings above 16MB, the size of a WAL segment? Certainly there have been no test results to show any. There was the set Marti just reminded about. The old wording suggested big enough to fix a single transaction was big enough, and that let to many people being confused and setting this parameter way too low. Since it's possible I'm wrong about 16MB being the upper limit, I didn't want the wording to specifically rule out people testing that size to see what happens. We believe there's never any advantage due to the forced wal segment switch, but having test results to the contrary floating around keeps me from being too aggressive in how the wording there goes. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
David Fetter wrote: I think I haven't communicated clearly what I'm suggesting, which is that we ship with both an UPSERT and a MERGE, the former being ugly, crude and simple, and the latter festooned with dire warnings about isolation levels and locking. I don't know that I completely agree with the idea that the UPSERT version should always be crude and the MERGE one necessarily heavy from a performance perspective. However, I do feel you raise a legitimate point that once the hard stuff is done, and the locking issues around SQL proper MERGE sorted, having an UPSERT/REPLACE synonym that maps to it under the hood may be a useful way to provide a better user experience. The way I see this, the right goal is to first provide the full spec behavior with good performance, and get all that plumbing right. There's nothing that says we can't provide an easier syntax than the admittedly complicated MERGE one to the users though. (I am tempted to make a joke about how you could probaby So, as for this patch...there's about half a dozen significant open issues with the current code, along with a smattering of smaller ones. The problems remaining are deep enough that I think it would be challenging to work through them for this CF under the best conditions. And given that we haven't heard from Boxuan since early December, we're definitely understaffed to tackle major revisions. My hope was that we'd get an updated patch from him before the CF deadline. Even without it, maybe get some more internal help here. Given my focus on checkpoint issues, Simon on Sync Rep, and Dimitri still on his Extensions push, that's second part isn't going to happen. I am marking MERGE officially Returned With Feedback for 9.1. Lots of progress made, much better community understanding of the unresolved issues now than when we started, but not in shape to go into this release. Since we still have some significant interest here in getting this finished, I'll see if I can get put together a better game-plan for how to get this actually done for 9.2 in time to discuss at release planning time. The main thing that's become much more obvious to me just recently is how the remaining issues left here relate to the true serialization work, so worrying about that first is probably the right order anyway. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] pgbench to the MAXINT
Euler Taveira de Oliveira wrote: (i) If we want to support and scale factor greater than 21474 we have to convert some columns to bigint; it will change the test. From the portability point it is a pity but as we have never supported it I'm not too worried about it. Why? Because it will use bigint columns only if the scale factor is greater than 21474. Is it a problem? I don't think so because generally people compare tests with the same scale factor. (ii) From the performance perspective, we need to test if the modifications don't impact performance. I don't create another code path for 64-bit modifications (it is too ugly) and I'm afraid some modifications affect the 32-bit performance. I'm in a position to test it though because I don't have a big machine ATM. Greg, could you lead these tests? (iii) I decided to copy scanint8() (called strtoint64 there) from backend (Robert suggestion [1]) because Tom pointed out that strtoll() has portability issues. I replaced atoi() with strtoint64() but didn't do any performance tests. (i): Completely agreed. (ii): There is no such thing as a big machine that is 32 bits now; anything that's 32 is a tiny system here in 2011. What I can do is check for degredation on the only 32-bit system I have left here, my laptop. I'll pick a sensitive test case and take a look. (iii) This is an important thing to test, particularly given it has the potential to impact 64-bit results too. Thanks for picking this up again and finishing the thing off. I'll add this into my queue of performance tests to run and we can see if this is worth applying. Probably take a little longer than the usual CF review time. But as this doesn't interfere with other code people are working on and is sort of a bug fix, I don't think it will be a problem if it takes a little longer to get this done. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Jeff Janes wrote: Have you ever tested Robert's other idea of having a metronome process do a periodic fsync on a dummy file which is located on the same ext3fs as the table files? I think that that would be interesting to see. To be frank, I really don't care about fixing this behavior on ext3, especially in the context of that sort of hack. That filesystem is not the future, it's not possible to ever really make it work right, and every minute spent on pandering to its limitations would be better spent elsewhere IMHO. I'm starting with the ext3 benchmarks just to provide some proper context for the worst-case behavior people can see right now, and to make sure refactoring here doesn't make things worse on it. My target is same or slightly better on ext3, much better on XFS and ext4. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Jim Nasby wrote: Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production system is in flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other means to track it That's why we already added pg_stat_bgwriter.buffers_backend_fsync to track the problem before trying to improve it. It was driving me crazy on a production server not having any visibility into when it happened. I haven't seen that we need anything beyond that so far. In the context of this new patch for example, if you get to where a backend does its own sync, you'll know it did a compaction as part of that. The existing statistic would tell you enough. There's now enough data in test set 3 at http://www.2ndquadrant.us/pgbench-results/index.htm to start to see how this breaks down on a moderately big system (well, by most people's standards, but not Jim for whom this is still a toy). Note the backend_sync column on the right, very end of the page; that's the relevant counter I'm commenting on: scale=175: Some backend fsync with 64 clients, 2/3 runs. scale=250: Significant backend fsync with 32 and 64 clients, every run. scale=500: Moderate to large backend fsync at any client count =16. This seems to be worst spot of those mapped. Above here, I would guess the TPS numbers start slowing enough that the fsync request queue activity drops, too. scale=1000: Backend fsync starting at 8 clients scale=2000: Backend fsync starting at 16 clients. By here I think the TPS volumes are getting low enough that clients are stuck significantly more often waiting for seeks rather than fsync. Looks like the most effective spot for me to focus testing on with this server is scales of 500 and 1000, with 16 to 64 clients. Now that I've got the scale fine tuned better, I may crank up the client counts too and see what that does. I'm glad these are appearing in reasonable volume here though, was starting to get nervous about only having NDA restricted results to work against. Some days you just have to cough up for your own hardware. I just tagged pgbench-tools-0.6.0 and pushed to GitHub/git.postgresql.org with the changes that track and report on buffers_backend_fsync if anyone else wants to try this out. It includes those numbers if you have a 9.1 with them, otherwise just reports 0 for it all the time; detection of the feature wasn't hard to add. The end portion of a config file for the program (the first part specifies host/username info and the like) that would replicate the third test set here is: MAX_WORKERS=4 SCRIPT=tpc-b.sql SCALES=1 10 100 175 250 500 1000 2000 SETCLIENTS=4 8 16 32 64 SETTIMES=3 RUNTIME=600 TOTTRANS= -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: compact fsync request queue on overflow
Chris Browne wrote: It was a little troublesome inducing it. I did so by cutting shared memory to minimum (128kB)... With higher shared memory, I couldn't readily induce compaction, which is probably a concurrency matter of not having enough volume of concurrent work going on. Quite. It's taken me 12 days of machine time running pgbench to find the spots where this problem occurs on a system with a reasonably sized shared_buffers (I'm testing against 256MB). It's one of those things it's hard to reproduce with test data. Thanks for the thorough code review. I've got a clear test plan I'm progressing through this week to beat on the performance measurement aspects of the patch. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] test_fsync open_sync test
Bruce Momjian wrote: Is there a value to this test_fsync test? Compare open_sync with different sizes: (This is designed to compare the cost of one large sync'ed write and two smaller sync'ed writes.) open_sync 16k write 242.563 ops/sec 2 open_sync 8k writes 752.752 ops/sec It compares the cost of doing larger vs. two smaller open_sync writes. Might be some value for determining things like what the optimal WAL block size to use is. All these tests are kind of hard to use effectively still, I'm not sure if it's time to start trimming tests yet until we've made more progress on interpreting results first. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Moving test_fsync to /contrib?
Alvaro Herrera wrote: I don't understand why it would be overkill. Are you saying people would complain because you installed a 25 kB executable that they might not want to use? Just for fun I checked /usr/bin and noticed that I have a pandoc executable, weighing 17 MB, that I have never used and I have no idea what is it for. It's for converting between the various types of text-like markup, i.e. reST, LaTex, Markdown, etc. Anyway, just because the rest of the world has no standards anymore doesn't mean we shouldn't. The changes Bruce has made recently have gotten this tool to where its output is starting to be readable and reliable. The sort of people who want to run this will certainly be fine with installing contrib to do it, because they may want to have things like pgbench too. There's really not enough demand for this to pollute the default server install footprint with any overhead from this tool, either in bytes or increased tool name squatting. And the fact that it's still a little rough around the edges nudges away from the standard server package too. Install in contrib as pg_test_fsync and I think you'll achieve the optimal subset of people who can be made happy here. Not having it packaged at all before wasn't a big deal, because it was so hard to collect good data from only developer-level people were doing it anyway. Now that it is starting to be more useful in that role for less experienced users, we need to make it easier for more people to run it, to collect feedback toward further improving its quality. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Bruce Momjian wrote: Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00? The idea of having a dead period doing no work at all between write phase and sync phase may have some merit. I don't have enough test data yet on some more fundamental issues in this area to comment on whether that smaller optimization would be valuable. It may be a worthwhile concept to throw into the sequencing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We need to log aborted autovacuums
Robert Haas wrote: On Sat, Jan 15, 2011 at 11:14 AM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Smith g...@2ndquadrant.com writes: Does try_relation_open need to have a lock acquisition timeout when AV is calling it? Hmm. I think when looking at the AV code, I've always subconsciously assumed that try_relation_open would fail immediately if it couldn't get the lock. That certainly seems like it would be a more appropriate way to behave than delaying indefinitely. I'm confused how that's not happening already. What does try mean, otherwise? Apparently try means acquire the requested lock on the oid of the relation, waiting for any amount of time for that part, and then check if the relation really exists or not once it's got it. In this context, it means it will try to open the relation, but might fail if it doesn't actually exist anymore. The relation not existing once it tries the check done after the lock is acquired is the only way it will return a failure. try_relation_open calls LockRelationOid, which blocks. There is also a ConditionalLockRelationOid, which does the same basic thing except it exits immediately, with a false return code, if it can't acquire the lock. I think we just need to nail down in what existing cases it's acceptable to have try_relation_oid use ConditionalLockRelationOid instead, which would make it do what all us reading the code thought it did all along. There are four callers of try_relation_open to be concerned about here: src/backend/commands/vacuum.cvacuum_rel onerel = try_relation_open(relid, lmode); src/backend/commands/analyze.canalyze_rel onerel = try_relation_open(relid, ShareUpdateExclusiveLock); src/backend/commands/cluster.ccluster_rel OldHeap = try_relation_open(tableOid, AccessExclusiveLock); src/backend/commands/lockcmds.c LockTableRecurse * Apply LOCK TABLE recursively over an inheritance tree rel = try_relation_open(reloid, NoLock); I think that both the vacuum_rel and analyze_rel cases are capable of figuring out if they are the autovacuum process, and if so calling the fast non-blocking version of this. I wouldn't want to mess with the other two, which rely upon the current behavior as far as I can see. Probably took me longer to write this e-mail than the patch will take. Since I've got trivial patch fever this weekend and already have the test case, I'll do this one next. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books
Re: [HACKERS] We need to log aborted autovacuums
Simon Riggs wrote: I'm fairly confused by this thread. That's becuase you think it has something to do with cancellation, which it doesn't. The original report here noted a real problem but got the theorized cause wrong. It turns out the code that acquires a lock when autovacuum decides it is going to process something will wait forever for that lock to be obtained. It cannot be the case that other locks on the table are causing it to cancel, or as you say it would be visible in the logs. Instead the AV worker will just queue up and wait for its turn as long as it takes. That does mean there's all sorts of ways that your AV workers can all get stuck where they are waiting for a table, and there's no way to know when that's happening either from the logs; you'll only see it in ps or pg_stat_activity. Given that I think it's actually a mild denial of service attack vector, this really needs an improvement. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We need to log aborted autovacuums
Tom Lane wrote: No, I don't believe we should be messing with the semantics of try_relation_open. It is what it is. With only four pretty simple callers to the thing, and two of them needing the alternate behavior, it seemed a reasonable place to modify to me. I thought the nowait boolean idea was in enough places that it was reasonable to attach to try_relation_open. Attached patch solves the wait for lock forever problem, and introduces a new log message when AV or auto-analyze fail to obtain a lock on something that needs to be cleaned up: DEBUG: autovacuum: processing database gsmith INFO: autovacuum skipping relation 65563 --- cannot open or obtain lock INFO: autoanalyze skipping relation 65563 --- cannot open or obtain lock My main concern is that this may cause AV to constantly fail to get access to a busy table, where in the current code it would queue up and eventually get the lock needed. A secondary issue is that while the autovacuum messages only show up if you have log_autovacuum_min_duration set to not -1, the autoanalyze ones can't be stopped. If you don't like the way I structured the code, you can certainly do it some other way instead. I thought this approach was really simple and not unlike similar code elsewhere. Here's the test case that worked for me here again: psql SHOW log_autovacuum_min_duration; DROP TABLE t; CREATE TABLE t(s serial,i integer); INSERT INTO t(i) SELECT generate_series(1,10); SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables WHERE relname='t'; DELETE FROM t WHERE s5; \q psql BEGIN; LOCK t; Leave that open, then go to anther session with old tail -f on the logs to wait for the errors to show up. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 25d9fde..4193fff 100644 *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** relation_open(Oid relationId, LOCKMODE l *** 918,936 * * Same as relation_open, except return NULL instead of failing * if the relation does not exist. * */ Relation ! try_relation_open(Oid relationId, LOCKMODE lockmode) { Relation r; ! Assert(lockmode = NoLock lockmode MAX_LOCKMODES); /* Get the lock first */ if (lockmode != NoLock) ! LockRelationOid(relationId, lockmode); ! /* * Now that we have the lock, probe to see if the relation really exists * or not. --- 918,951 * * Same as relation_open, except return NULL instead of failing * if the relation does not exist. + * + * If called with nowait enabled, this will immediately exit + * if a lock is requested and it can't be acquired. The + * return code in this case doesn't distinguish between this + * situation and the one where the relation was locked, but + * doesn't exist. Callers using nowait must not care that + * they won't be able to tell the difference. * */ Relation ! try_relation_open(Oid relationId, LOCKMODE lockmode, bool nowait) { Relation r; ! bool locked; Assert(lockmode = NoLock lockmode MAX_LOCKMODES); /* Get the lock first */ if (lockmode != NoLock) ! { ! if (nowait) ! { ! locked=ConditionalLockRelationOid(relationId, lockmode); ! if (!locked) ! return NULL; ! } ! else ! LockRelationOid(relationId, lockmode); ! } /* * Now that we have the lock, probe to see if the relation really exists * or not. diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 7bc5f11..24bfb16 100644 *** a/src/backend/commands/analyze.c --- b/src/backend/commands/analyze.c *** analyze_rel(Oid relid, VacuumStmt *vacst *** 147,156 * concurrent VACUUM, which doesn't matter much at the moment but might * matter if we ever try to accumulate stats on dead tuples.) If the rel * has been dropped since we last saw it, we don't need to process it. */ ! onerel = try_relation_open(relid, ShareUpdateExclusiveLock); if (!onerel) ! return; /* * Check permissions --- this should match vacuum's check! --- 147,168 * concurrent VACUUM, which doesn't matter much at the moment but might * matter if we ever try to accumulate stats on dead tuples.) If the rel * has been dropped since we last saw it, we don't need to process it. + * + * If this is a manual analyze, opening will wait forever to acquire + * the requested lock on the relation. Autovacuum will just give up + * immediately if it can't get the lock. This prevents a series of locked + * relations from potentially hanging all of the AV workers waiting + * for locks. */ ! onerel = try_relation_open(relid
Re: [HACKERS] Spread checkpoint sync
I have finished a first run of benchmarking the current 9.1 code at various sizes. See http://www.2ndquadrant.us/pgbench-results/index.htm for many details. The interesting stuff is in Test Set 3, near the bottom. That's the first one that includes buffer_backend_fsync data. This iall on ext3 so far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04. The results are classic Linux in 2010: latency pauses from checkpoint sync will easily leave the system at a dead halt for a minute, with the worst one observed this time dropping still for 108 seconds. That one is weird, but these two are completely averge cases: http://www.2ndquadrant.us/pgbench-results/210/index.html http://www.2ndquadrant.us/pgbench-results/215/index.html I think a helpful next step here would be to put Robert's fsync compaction patch into here and see if that helps. There are enough backend syncs showing up in the difficult workloads (scale=1000, clients =32) that its impact should be obvious. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We need to log aborted autovacuums
Josh Berkus wrote: The lack of vacuum could be occurring for any of 4 reasons: 1) Locking 2) You have a lot of tables and not enough autovac_workers / too much sleep time 3) You need to autovac this particular table more frequently, since it gets dirtied really fast 4) The table has been set with special autovac settings which keep it from being autovac'd We can currently distinguish between cased 2, 3, 4 based on existing available facts. However, distinguishing case 1 from 2 or 3, in particular, isn't currently possible except by methods which require collecting a lot of ad-hoc monitoring data over a period of time. This makes the effort required for the diagnosis completely out of proportion with the magnitude of the problem. Since no one coded up something more exciting, I just did a round of self-review of the little logging patch I sent upthread to see if it was a worthy CF candidate. I agree that *something* should be done, and I'd rather have a logging-based solution than nothing at all here. So that patch is, to use a polite UK term I've adopted recently, rubbish. But unless I'm confused (it is late), I think there's a problem here that's much worse than what you described--in the case where someone has grabbed a really large lock on a table that needs cleanup at least. I started with the following test case: SHOW log_autovacuum_min_duration; DROP TABLE t; CREATE TABLE t(s serial,i integer); INSERT INTO t(i) SELECT generate_series(1,10); SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables WHERE relname='t'; DELETE FROM t WHERE s5; \q Which, so long as the min duration is 0, does mostly what I expected. I followed immediately by starting a new session, doing: BEGIN; LOCK t; Before autovacuum got to the table; I left this session open, stuck there idle in a transaction holding a lock. I threw some extra logging in the code path leading up to where we'd both like the lack of lock acquisition to be logged, and it shows the following (at DEBUG3): DEBUG: t: vac: 4 (threshold 50), anl: 14 (threshold 50) DEBUG: autovacuum: t: vac needed INFO: vacuum_rel: processing 16528 INFO: vacuum_rel: trying to open relation 16528 All but the first one of those lines are not in the current code, and as noted already that one existing line is at DEBUG3. This was trying to simulate the case you were complaining about: autovacuum has been triggered, it knows there is work to be done, and it can't do said work. It hasn't triggered the error message I tried to add yet through, because it's not hitting that spot. Look where it's actually blocked at: $ ps -eaf | grep postgres gsmith5490 4880 0 04:09 ?00:00:00 postgres: autovacuum worker process gsmith waiting $ psql -c select procpid,now() - query_start as runtime,current_query from pg_stat_activity where current_query like 'autovacuum%' procpid | runtime |current_query -+-+- 5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE public.t I haven't just failed to vacuum the table that needs it without any entry in the logs saying as much. I've actually tied down an autovacuum worker and kept it from doing anything else until that lock is released! When vacuum_rel inside of vacuum.c does this on a relation it wants to acquire a lock on: onerel = try_relation_open(relid, lmode); It just hangs out there forever, if the only issue is not being able to get the lock it wants. The abort code path I tried to insert logging into only occurs if the relation was deleted. Watch what's happened while I've been typing: procpid | runtime |current_query -+-+- 5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE public.t I've now hit the AV worker with a denial of service attack, essentially, for over 35 minutes. I'd bet that if I was a malicious user, I could make AV come to a halt altogether for the whole server with this approach, even with relatively limited rights on the database. Does try_relation_open need to have a lock acquisition timeout when AV is calling it? I don't see any other way to get a log mesage noting when this problem passed by in there, you can only see it live in these process view. And the current behavior doesn't feel right to me. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith g...@2ndquadrant.com wrote: One of the ideas Simon and I had been considering at one point was adding some better de-duplication logic to the fsync absorb code, which I'm reminded by the pattern here might be helpful independently of other improvements. Hopefully I'm not stepping on any toes here, but I thought this was an awfully good idea and had a chance to take a look at how hard it would be today while en route from point A to point B. The answer turned out to be not very, so PFA a patch that seems to work. I tested it by attaching gdb to the background writer while running pgbench, and it eliminate the backend fsyncs without even breaking a sweat. No toe damage, this is great, I hadn't gotten to coding for this angle yet at all. Suffering from an overload of ideas and (mostly wasted) test data, so thanks for exploring this concept and proving it works. I'm not sure what to do with the rest of the work I've been doing in this area here, so I'm tempted to just combine this new bit from you with the older patch I submitted, streamline, and see if that makes sense. Expected to be there already, then how about spending 5 minutes first checking out that autovacuum lock patch again turned out to be a wild underestimate. Part of the problem is that it's become obvious to me the last month that right now is a terrible time to be doing Linux benchmarks that impact filesystem sync behavior. The recent kernel changes that are showing in the next rev of the enterprise distributions--like RHEL6 and Debian Squeeze both working to get a stable 2.6.32--have made testing a nightmare. I don't want to dump a lot of time into optimizing for 2.6.32 if this problem changes its form in newer kernels, but the distributions built around newer kernels are just not fully baked enough yet to tell. For example, the pre-release Squeeze numbers we're seeing are awful so far, but it's not really done yet either. I expect 3-6 months from today, that all will have settled down enough that I can make some sense of it. Lately my work with the latest distributions has just been burning time installing stuff that doesn't work quite right yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: Idea #2: At the beginning of a checkpoint when we scan all the buffers, count the number of buffers that need to be synced for each relation. Use the same hashtable that we use for tracking pending fsync requests. Then, interleave the writes and the fsyncs... Idea #3: Stick with the idea of a fixed delay between fsyncs, but compute how many fsyncs you think you're ultimately going to need at the start of the checkpoint, and back up the target completion time by 3 s per fsync from the get-go, so that the checkpoint still finishes on schedule. What I've been working on is something halfway between these two ideas. I have a patch, and it doesn't work right yet because I just broke it, but since I have some faint hope this will all come together any minute now I'm going to share it before someone announces a deadline has passed or something. (whistling). I'm going to add this messy thing and the patch you submitted upthread to the CF list; I'll review yours, I'll either fix the remaining problem in this one myself or rewrite to one of your ideas, and then it's onto a round of benchmarking. Once upon a time we got a patch from Itagaki Takahiro whose purpose was to sort writes before sending them out: http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php This didn't really work repeatedly for everyone because of the now well understood ext3 issues--I never replicated that speedup at the time for example. And this was before the spread checkpoint code was in 8.3. The hope was that it wasn't really going to be necessary after that anyway. Back to today...instead of something complicated, it struck me that if I just had a count of exactly how many files were involved in each checkpoint, that would be helpful. I could keep the idea of a fixed delay between fsyncs, but just auto-tune that delay amount based on the count. And how do you count the number of unique things in a list? Well, you can always sort them. I thought that if the sorted writes patch got back to functional again, it could serve two purposes. It would group all of the writes for a file together, and if you did the syncs in the same sorted order they would have the maximum odds of discovering the data was already written. So rather than this possible order: table block a 1 b 1 c 1 c 2 b 2 a 2 sync a sync b sync c Which has very low odds of the sync on a finishing quickly, we'd get this one: table block a 1 a 2 b 1 b 2 c 1 c 2 sync a sync b sync c Which sure seems like a reasonable way to improve the odds data has been written before the associated sync comes along. Also, I could just traverse the sorted list with some simple logic to count the number of unique files, and then set the delay between fsync writes based on it. In the above, once the list was sorted, easy to just see how many times the table name changes on a linear scan of the sorted data. 3 files, so if the checkpoint target gives me, say, a minute of time to sync them, I can delay 20 seconds between. Simple math, and exactly the sort I used to get reasonable behavior on the busy production system this all started on. There's some unresolved trickiness in the segment-driven checkpoint case, but one thing at a time. So I fixed the bitrot on the old sorted patch, which was fun as it came from before the 8.3 changes. It seemed to work. I then moved the structure it uses to hold the list of buffers to write, the thing that's sorted, into shared memory. It's got a predictable maximum size, relying on palloc in the middle of the checkpoint code seems bad, and there's some potential gain from not reallocating it every time through. Somewhere along the way, it started doing this instead of what I wanted: BadArgument(!(((header-context) != ((void *)0) (Node*)((header-context)))-type) == T_AllocSetContext, File: mcxt.c, Line: 589) (that's from initdb, not a good sign) And it's left me wondering whether this whole idea is a dead end I used up my window of time wandering down. There's good bits in the patch I submitted for the last CF and in the patch you wrote earlier this week. This unfinished patch may be a valuable idea to fit in there too once I fix it, or maybe it's fundamentally flawed and one of the other ideas you suggested (or I have sitting on the potential design list) will work better. There's a patch integration problem that needs to be solved here, but I think almost all the individual pieces are available. I'd hate to see this fail to get integrated now just for lack of time, considering the problem is so serious when you run into it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index dadb49d
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: I'll believe it when I see it. How about this: a 1 a 2 sync a b 1 b 2 sync b c 1 c 2 sync c Or maybe some variant, where we become willing to fsync a file a certain number of seconds after writing the last block, or when all the writes are done, whichever comes first. That's going to give worse performance than the current code in some cases. The goal of what's in there now is that you get a sequence like this: a1 b1 a2 [Filesystem writes a1] b2 [Filesystem writes b1] sync a [Only has to write a2] sync b [Only has to write b2] This idea works until you to get where the filesystem write cache is so large that it becomes lazier about writing things. The fundamental idea--push writes out some time before the sync, in hopes the filesystem will get to them before that said--it not unsound. On some systems, doing the sync more aggressively than that will be a regression. This approach just breaks down in some cases, and those cases are happening more now because their likelihood scales with total RAM. I don't want to screw the people with smaller systems, who may be getting considerable benefit from the existing sequence. Today's little systems--which are very similar to the high-end ones the spread checkpoint stuff was developed on during 8.3--do get some benefit from it as far as I know. Anyway, now that the ability to get logging on all this stuff went in during the last CF, it's way easier to just setup a random system to run tests in this area than it used to be. Whatever testing does happen should include, say, a 2GB laptop with a single hard drive in it. I think that's the bottom of what is reasonable to consider a reasonable target for tweaking write performance on, given hardware 9.1 is likely to be deployed on. How does the checkpoint target give you any time to sync them? Unless you squeeze the writes together more tightly, but that seems sketchy. Obviously the checkpoint target idea needs to be shuffled around some too. I was thinking of making the new default 0.8, and having it split the time in half for write and sync. That will make the write phase close to the speed people are seeing now, at the default of 0.5, while giving some window for spread sync too. The exact way to redistribute that around I'm not so concerned about yet. When I get to where that's the most uncertain thing left I'll benchmark the TPS vs. latency trade-off and see what happens. If the rest of the code is good enough but this just needs to be tweaked, that's a perfect thing to get beta feedback to finalize. Well you don't have to put it in shared memory on account of any of that. You can just hang it on a global variable. Hmm. Because it's so similar to other things being allocated in shared memory, I just automatically pushed it over to there. But you're right; it doesn't need to be that complicated. Nobody is touching it but the background writer. If we can find something that's a modest improvement on the status quo and we can be confident in quickly, good, but I'd rather have 9.1 go out the door on time without fully fixing this than delay the release. I'm not somebody who needs to be convinced of that. There are two near commit quality pieces of this out there now: 1) Keep some BGW cleaning and fsync absorption going while sync is happening, rather than starting it and ignoring everything else until it's done. 2) Compact fsync requests when the queue fills If that's all we can get for 9.1, it will still be a major improvement. I realize I only have a very short period of time to complete a major integration breakthrough on the pieces floating around before the goal here has to drop to something less ambitious. I head to the West Coast for a week on the 23rd; I'll be forced to throw in the towel at that point if I can't get the better ideas we have in pieces here all assembled well by then. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] auto-sizing wal_buffers
Fujii Masao wrote: +intXLOGbuffersMin = 8; XLOGbuffersMin is a fixed value. I think that defining it as a macro rather than a variable seems better. + if (XLOGbuffers 2048) + XLOGbuffers = 2048; Using XLOG_SEG_SIZE/XLOG_BLCKSZ rather than 2048 seems better. +#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers Typo: s/32kB/64kB Thanks, I've fixed all these issues and attached a new full patch, pushed to github, etc. Tests give same results back, and it's nice that it scale to reasonable behavior if someone changes their XLOG segment size. It should be possible to set the value back to the older minimum value of 32kB too. That's doesn't actually seem to work though; when I try it I get: $ psql -c SELECT name,unit,boot_val,setting,current_setting(name) FROM pg_settings WHERE name IN ('wal_buffers','shared_buffers') name | unit | boot_val | setting | current_setting +--+--+-+- shared_buffers | 8kB | 1024 | 131072 | 1GB wal_buffers| 8kB | -1 | 8 | 64kB Where I was expecting that setting to be 4 instead for 32kB. So there's probably some minor bug left in where I inserted this into the initialization sequence. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8e2a2c5..1c13f20 100644 *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** SET ENABLE_SEQSCAN TO OFF; *** 1638,1649 /indexterm listitem para ! The amount of memory used in shared memory for WAL data. The ! default is 64 kilobytes (literal64kB/). The setting need only ! be large enough to hold the amount of WAL data generated by one ! typical transaction, since the data is written out to disk at ! every transaction commit. This parameter can only be set at server ! start. /para para --- 1638,1659 /indexterm listitem para ! The amount of shared memory used for storing WAL data. The ! default setting of -1 adjusts this automatically based on the size ! of varnameshared_buffers/varname, making it 1/32 (about 3%) of ! the size of that normally larger shared memory block. Automatically ! set values are limited to a maximum sufficient to hold one WAL ! segment worth of data, normally 16 megabytes (literal16MB/). ! The smallest allowable setting is 32 kilobytes (literal32kB/). !/para ! !para ! Since the data is written out to disk at every transaction commit, ! the setting many only need to be be large enough to hold the amount ! of WAL data generated by one typical transaction. Larger values, ! typically at least a few megabytes, can improve write performance ! on a busy server where many clients are committing at once. ! Extremely large settings are unlikely to provide additional benefit. /para para diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 5b6a230..d057773 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 69,75 /* User-settable parameters */ int CheckPointSegments = 3; int wal_keep_segments = 0; ! int XLOGbuffers = 8; int XLogArchiveTimeout = 0; bool XLogArchiveMode = false; char *XLogArchiveCommand = NULL; --- 69,76 /* User-settable parameters */ int CheckPointSegments = 3; int wal_keep_segments = 0; ! int XLOGbuffers = -1; ! int XLOGbuffersMin = 8; int XLogArchiveTimeout = 0; bool XLogArchiveMode = false; char *XLogArchiveCommand = NULL; *** GetSystemIdentifier(void) *** 4780,4790 --- 4781,4819 /* * Initialization of shared memory for XLOG */ + + void XLOGTuneNumBuffers(void) + { + int one_segment = XLOG_SEG_SIZE / XLOG_BLCKSZ; + + /* + * If automatic setting was requested, use about 3% as much memory as + * requested for the buffer cache. Clamp the automatic maximum to the + * size of one XLOG segment, while still allowing a larger manual + * setting. Don't go below the default setting in earlier versions: + * twice the size of the minimum, which at default build options is 64kB. + */ + if (XLOGbuffers == -1) /* set automatically */ + { + XLOGbuffers = NBuffers / 32; + if (XLOGbuffers one_segment) + XLOGbuffers = one_segment; + if (XLOGbuffers (XLOGbuffersMin * 2)) + XLOGbuffers = XLOGbuffersMin * 2; + } + + /* Enforce a 32KB minimum in every case */ + if (XLOGbuffers XLOGbuffersMin) + XLOGbuffers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: That seems like a bad idea - don't we routinely recommend that people crank this up to 0.9? You'd be effectively bounding the upper range of this setting to a value to the less than the lowest value we recommend anyone use today. I was just giving an example of how I might do an initial split. There's a checkpoint happening now at time T; we have a rough idea that it needs to be finished before some upcoming time T+D. Currently with default parameters this becomes: Write: 0.5 * D; Sync: 0 Even though Sync obviously doesn't take zero. The slop here is enough that it usually works anyway. I was suggesting that a quick reshuffling to: Write: 0.4 * D; Sync: 0.4 * D Might be a good first candidate for how to split the time up better. The fact that this gives less writing time than the current biggest spread possible: Write: 0.9 * D; Sync: 0 Is true. It's also true that in the case where sync time really is zero, this new default would spread writes less than the current default. I think that's optimistic, but it could happen if checkpoints are small and you have a good write cache. Step back from that a second though. Ultimately, the person who is getting checkpoints at a 5 minute interval, and is being nailed by spikes, should have the option of just increasing the parameters to make that interval bigger. First you increase the measly default segments to a reasonable range, then checkpoint_completion_target is the second one you can try. But from there, you quickly move onto making checkpoint_timeout longer. At some point, there is no option but to give up checkpoints every 5 minutes as being practical, and make the average interval longer. Whether or not a refactoring here makes things slightly worse for cases closer to the default doesn't bother me too much. What bothers me is the way trying to stretch checkpoints out further fails to deliver as well as it should. I'd be OK with saying to get the exact same spread situation as in older versions, you may need to retarget for checkpoints every 6 minutes *if* in the process I get a much better sync latency distribution in most cases. Here's an interesting data point from the customer site this all started at, one I don't think they'll mind sharing since it helps make the situation more clear to the community. After applying this code to spread sync out, in order to get their server back to functional we had to move all the parameters involved up to where checkpoints were spaced 35 minutes apart. It just wasn't possible to write any faster than that without disrupting foreground activity. The whole current model where people think of this stuff in terms of segments and completion targets is a UI disaster. The direction I want to go in is where users can say make sure checkpoints happen every N minutes, and something reasonable happens without additional parameter fiddling. And if the resulting checkpoint I/O spike is too big, they just increase the timeout to N+1 or N*2 to spread the checkpoint further. Getting too bogged down thinking in terms of the current, really terrible interface is something I'm trying to break myself of. Long-term, I want there to be checkpoint_timeout, and all the other parameters are gone, replaced by an internal implementation of the best practices proven to work even on busy systems. I don't have as much clarity on exactly what that best practice is the way that, say, I just suggested exactly how to eliminate wal_buffers as an important thing to manually set. But that's the dream UI: you shoot for a checkpoint interval, and something reasonable happens; if that's too intense, you just increase the interval to spread further. There probably will be small performance regression possible vs. the current code with parameter combination that happen to work well on it. Preserving every one of those is something that's not as important to me as making the tuning interface simple and clear. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: What is the basis for thinking that the sync should get the same amount of time as the writes? That seems pretty arbitrary. Right now, you're allowing 3 seconds per fsync, which could be a lot more or a lot less than 40% of the total checkpoint time... Just that it's where I ended up at when fighting with this for a month on the system I've seen the most problems at. The 3 second number was reversed from a computation that said aim for an internal of X minutes; we have Y relations on average involved in the checkpoint. The direction my latest patch is strugling to go is computing a reasonable time automatically in the same way--count the relations, do a time estimate, add enough delay so the sync calls should be spread linearly over the given time range. the checkpoint activity is always going to be spikey if it does anything at all, so spacing it out *more* isn't obviously useful. One of the components to the write queue is some notion that writes that have been waiting longest should eventually be flushed out. Linux has this number called dirty_expire_centiseconds which suggests it enforces just that, set to a default of 30 seconds. This is why some 5-minute interval checkpoints with default parameters, effectively spreading the checkpoint over 2.5 minutes, can work under the current design. Anything you wrote at T+0 to T+2:00 *should* have been written out already when you reach T+2:30 and sync. Unfortunately, when the system gets busy, there is this congestion control logic that basically throws out any guarantee of writes starting shortly after the expiration time. It turns out that the only thing that really works are the tunables that block new writes from happening once the queue is full, but they can't be set low enough to work well in earlier kernels when combined with lots of RAM. Using the terminology of http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point you hit a point where a process generating disk writes will itself start writeback. This is anologous to the PostgreSQL situation where backends do their own fsync calls. The kernel will eventually move to where those trying to write new data are instead recruited into being additional sources of write flushing. That's the part you just can't make aggressive enough on older kernels; dirty writers can always win. Ideally, the system never digs itself into a hole larger than you can afford to wait to write out. It's a transacton speed vs. latency thing though, and the older kernels just don't consider the latency side well enough. There is new mechanism in the latest kernels to control this much better: dirty_bytes and dirty_background_bytes are the tunables. I haven't had a chance to test yet. As mentioned upthread, some of the bleding edge kernels that have this feature available in are showing such large general performance regressions in our tests, compared to the boring old RHEL5 kernel, that whether this feature works or not is irrelevant. I haven't tracked down which new kernel distributions work well performance-wise and which don't yet for PostgreSQL. I'm hoping that when I get there, I'll see results like http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages , where the ideal setting for dirty_bytes to keep latency under control with BBWC was 15MB. To put that into perspective, the lowest useful setting you can set dirty_ratio to is 5% of RAM. That's 410MB on my measly 8GB desktop, and 3.3GB on the 64GB production server I've been trying to tune. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] auto-sizing wal_buffers
Tom Lane wrote: I think we need to keep the override capability until the autotune algorithm has proven itself in the field for a couple of years. I agree with Josh that a negative value should be used to select the autotune method. Agreed on both fronts. Attached patch does the magic. Also available in branch walbuffers from git://github.com/greg2ndQuadrant/postgres.git By changing only shared_buffers I get the following quite reasonable automatic behavior: $ psql -c SELECT name,unit,boot_val,setting,current_setting(name) FROM pg_settings WHERE name IN ('wal_buffers','shared_buffers') name | unit | boot_val | setting | current_setting +--+--+-+- shared_buffers | 8kB | 1024 | 3072| 24MB wal_buffers| 8kB | -1 | 96 | 768kB shared_buffers | 8kB | 1024 | 4096| 32MB wal_buffers| 8kB | -1 | 128 | 1MB shared_buffers | 8kB | 1024 | 16384 | 128MB wal_buffers| 8kB | -1 | 512 | 4MB shared_buffers | 8kB | 1024 | 131072 | 1GB wal_buffers| 8kB | -1 | 2048| 16MB shared_buffers | 8kB | 1024 | 262144 | 2GB wal_buffers| 8kB | -1 | 2048| 16MB If you've set it to the auto-tuning behavior, you don't see that setting of -1 in the SHOW output; you see the value it's actually been set to. The only way to know that was set automatically is to look at boot_val as I've shown here. I consider this what admins would prefer, as the easy way to expose the value that was used. I would understand if people considered it a little odd though. Since you can't change it without a postgresql.conf edit and a server start anyway, and it's tersely documented in the sample postgresql.conf what -1 does, I don't see this being a problem for anyone in the field. To try and clear up some of the confusion around how the earlier documentation suggests larger values of this aren't needed, I added the following updated description of how this has been observed to work for admins in practice: ! Since the data is written out to disk at every transaction commit, ! the setting many only need to be be large enough to hold the amount ! of WAL data generated by one typical transaction. Larger values, ! typically at least a few megabytes, can improve write performance ! on a busy server where many clients are committing at once. ! Extremely large settings are unlikely to provide additional benefit. And to make this easy as possible to apply if I got this right, here's some proposed commit text: Automatically set wal_buffers to be proportional to the size of shared_buffers. Make it 1/32 as large when the auto-tuned behavior, which is the default and set with a value of -1, is used. The previous default of 64kB is still enforced as a minimum value. The maximum automatic value is limited to 16MB. (Note that this not exactly what I put in my own commit message if you grab from my repo, that had a typo) -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8e2a2c5..c3f5632 100644 *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** SET ENABLE_SEQSCAN TO OFF; *** 1638,1649 /indexterm listitem para ! The amount of memory used in shared memory for WAL data. The ! default is 64 kilobytes (literal64kB/). The setting need only ! be large enough to hold the amount of WAL data generated by one ! typical transaction, since the data is written out to disk at ! every transaction commit. This parameter can only be set at server ! start. /para para --- 1638,1659 /indexterm listitem para ! The amount of shared memory used for storing WAL data. The ! default setting of -1 adjusts this automatically based on the size ! of varnameshared_buffers/varname, making it 1/32 (about 3%) of ! the size of that normally larger shared memory block. Automatically ! set values are limited to a maximum of 16 megabytes ! (literal16MB/), sufficient to hold one WAL segment worth of data. ! The smallest allowable setting is 64 kilobytes (literal64kB/). !/para ! !para ! Since the data is written out to disk at every transaction commit, ! the setting many only need to be be large enough to hold the amount ! of WAL data generated by one typical transaction. Larger values, ! typically at least a few megabytes, can improve write performance ! on a busy server where many clients are committing at once. ! Extremely large settings
Re: [HACKERS] auto-sizing wal_buffers
Kevin Grittner wrote: I guess a manual override doesn't bother me too much, but I am a bit dubious of its value, and there is value in keeping the GUC count down... It's a risk-reward thing really. The reward for removing it is that a few lines of code and a small section of the documentation go away. It's not very big. The risk seems low, but it's not zero. Let's say this goes in, we get to 9.2 or later, and a survey suggests that no one has needed to ever set wal_buffers when deploying 9.1. At that point I think everyone would feel much better considering to nuke it altogether. I just looked at the code again when developing the patch, and there's really not enough benefit to removing it to worry about taking any risk right now. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We need to log aborted autovacuums
Josh Berkus wrote: It occurs to me that another way of diagnosis would simply be a way to cause the autovac daemon to spit out output we could camp on, *without* requiring the huge volumes of output also required for DEBUG3. This brings us back to the logging idea again. Right. I don't know how much you looked at the little patch I threw out there yet, but I think it's a reasonable 90% solution to the problem of how do I distinguish between a locking block and other reasons for AV not running? without generating a huge amount of log data. Since there's no real overhead or code complexity added by it, it's a relatively easy thing to throw in the codebase without annoying anyone. I really don't think that argument applies to either patch; last_autovac_attempt *or* the last_stats_reset time, since neither event is expected to occur frequently. If you have last_autovac_attempt (for example) being updated frequently, you clearly had a db problem bigger than the size of the stats table. It just returns to the usual why make everyone pay overhead for something that only happens to a small percentage of users? argument.[1] I personally have all sorts of additional data I'd like to instrument in myriad obtrusive spots of code. All I'm saying is that this one in particular I don't quite feel strongly enough about to make it the spot I try to establish that foothold at. If a stats tracking patch for this appeared that seemed reasonably well written (i.e. it tries to keep the added value NULL whenever it's not relevant), I'd be in favor of it going in. I'm just not so excited about the idea that I'm going to write it. I'm lukewarm to per-table last_stats_reset time just because there's so few places I'd actually use it in practice, relative to the guaranteed overhead it adds. [1] Silly aside: I was thinking today that I should draw a chart of all the common objections to code that show up here, looking like those maps you see when walking into a mall. Then we can give a copy to new submitters with a big you are here X marking where they have inadvertently wandered onto. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We need to log aborted autovacuums
Robert Treat wrote: This is a great use case for user level tracing support. Add a probe around these bits, and you can capture the information when you need it. Sure. I would also like a pony. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We need to log aborted autovacuums
Josh Berkus wrote: Or should it perhaps be a per-table counter in pg_stat_user_tables, given your statement above? Or even a timestamp: last_autovacuum_attempt, which would record the last time autovacuum was tried. If that's fairly recent and you have a large number of dead rows, you know what kind of problem you have and can turn on debug. These are both reasonable ideas. But there was just some kickback on Tomas's keeping timestamp of the lasts stats reset patch recently, from the perspective of trying to limit per-table stats bloat. I think it's relatively easy to make a case that this situation is difficult enough to diagnose that a little bit of extra low-level logging is worthwhile. That Josh and I have both been bit by it enough to be thinking about patches to make it easier to diagnost suggests it's obviously too hard to nail down. But is this so common and difficult to recognize that it's worth making all the table stats bigger? That's a harder call. It's already possible to detect the main symptom--dead row percentage is much higher than the autovacuum threshold, but there's been no recent autovacuum. That makes me less enthusiastic that there's such a genuine need to justify the overhead of storing more table stats just to detect the same thing a little more easily. I've been playing with the Munin PG plug-in more recently, and I was just thinking of adding a dead row trend graph/threshold to it to address this general area instead. We could argue both sides of the trade-off of tracking this directly in stats for some time, and I'd never expect there to be a clear victory for either perspective. I've run into this vacuum problem a few times, but certainly less than I've run into why is the stats table so huge? -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
Heikki Linnakangas wrote: You can of course LOCK TABLE as a work-around, if that's what you want. What I was trying to suggest upthread is that while there are other possible ways around this problem, the only one that has any hope of shipping with 9.1 is to do just that. So from my perspective, the rest of the discussion about the right way to proceed is moot for now. For some reason it didn't hit me until you said this that I could do the locking manually in my test case, without even touching the server-side code yet. Attached are a new pair of scripts where each pgbench UPDATE statement executes an explicit LOCK TABLE. Here's the result of a sample run here: $ pgbench -f update-merge.sql -T 60 -c 16 -j 4 -s 2 pgbench starting vacuum...end. transaction type: Custom query scaling factor: 2 query mode: simple number of clients: 16 number of threads: 4 duration: 60 s number of transactions actually processed: 84375 tps = 1405.953672 (including connections establishing) tps = 1406.137456 (excluding connections establishing) $ psql -c 'select count(*) as updated FROM pgbench_accounts WHERE NOT abalance=0' -d pgbench updated - 68897 (1 row) $ psql -c 'select count(*) as inserted FROM pgbench_accounts WHERE aid 10' -d pgbench inserted -- 34497 (1 row) No assertion crashes, no duplicate key failures. All the weird stuff I was running into is gone, so decent evidence the worst of the problems were all because the heavy lock I expecting just wasn't integrated into the patch. Congratulations to Boxuan: for the first time this is starting to act like a viable feature addition to me, just one with a moderately long list of limitations and performance issues. 1400 TPS worth of UPSERT on my modest 8-core desktop (single drive with cheating fsync) isn't uselessly slow. If I add SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; just after the BEGIN;, I don't see any serialization errors, and performance is exactly the same. Run a straight UPDATE over only the existing range of keys, and I get 7000 TPS instead. So the locking etc. is reducing performance to 20% of its normal rate, on this assertion+debug build. I can run this tomorrow (err, later today I guess looking at the time) on a proper system with BBWC and without asseritions to see if the magnitude of the difference changes, but I don't think that's the main issue here. Presuming the code quality issues and other little quirks I've documented (and new ones yet to be discovered) can get resolved here, and that's a sizeable open question, I could see shipping this with the automatic heavy LOCK TABLE in there. Then simple UPSERT could work out of the box via a straightforward MERGE. We'd need a big warning disclaiming that concurrent performance is very limited in this first release of the feature, but I don't know that this is at the unacceptable level of slow for smaller web apps and such. Until proper fine-grained concurrency is implemented, I think it would be PR suicide to release a version of this without a full table lock happening automatically though. The idea Robert advocated well, that it would be possible for advanced users to use even this rough feature in a smarter way to avoid conflicts and not suffer the full performance penalty, is true. But if you consider the main purpose here to be making it easier to get smaller MySQL apps and the like ported to PostgreSQL (which is what I see as goal #1), putting that burden on the user is just going to reinforce the old PostgreSQL is so much harder than MySQL stereotype. I'd much prefer to see everyone have a slow but simple to use UPSERT via MERGE available initially, rather than to worry about optimizing for the advanced user in a way that makes life harder for the newbies. The sort of people who must have optimal performance already have trigger functions available to them, that they can write and tweak for best performance. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books test-merge.sh Description: Bourne shell script \set nbranches :scale \set ntellers 10 * :scale \set naccounts 10 * :scale \setrandom aid 1 :naccounts \setrandom bid 1 :nbranches \setrandom tid 1 :ntellers \setrandom delta -5000 5000 BEGIN; -- Optional mode change -- SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; LOCK TABLE pgbench_accounts; MERGE INTO pgbench_accounts t USING (SELECT :aid,1+(:aid / 100)::integer,:delta,'') AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler); COMMIT; -- This syntax worked with MERGE v203 patch, but isn't compatible with v204 --MERGE INTO pgbench_accounts t USING (VALUES (:aid,1+(:aid / 100)::integer,:delta,'')) AS s
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
Robert Haas wrote: And even if it isn't, the MERGE syntax is insane if what you really want to do is insert or update ONE record. If all we have is MERGE, people will keep doing it with a PL/pgsql stored procedure or some crummy application logic just so that they don't have to spend several days trying to understand the syntax. Heck, I understand the syntax (or I think I do) and I still think it's more trouble than its worth I hoped that the manual would have a clear example of this is how you do UPSERT with MERGE, preferrably cross-linked to the existing Example 39-2. Exceptions with UPDATE/INSERT trigger implementation that's been the reference implementation for this for a long time, so people can see both alternatives. New users will cut and paste that example into their code, and in the beginning neither know nor care how MERGE actually works, so long as the example does what it claims. I would wager the majority of PL/pgsql implementations of this requirement start the exact same way. I don't think the learning curve there is really smaller, it's just that you've just already been through it. I've been purposefully ignoring the larger applications of MERGE in the interest of keeping focus on a managable subset. But the more general feature set is in fact enormously useful for some types of data warehouse applications. Build REPLACE, and you built REPLACE. Build MERGE that is REPLACE now and eventually full high-performance MERGE, and you've done something with a much brighter future. I don't think the concurrency hurdles here are unique to this feature either, as shown by the regular overlap noted with the other serialization work. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
t...@fuzzy.cz wrote: - I really am not sure about the changes made in pg_proc.h. I'm not sure how to assign OIDs to the new functions (I've simply chosen values that are were not used in this file), and I'm not sure about the other columns (I've copied and modified another function with the same parameter/return types) The description of the columns is at the beginning of pg_proc.h, the part that begins with CATALOG(pg_proc,1255)... The descriptions of some of the first 11 fields are mostly straighforward. The first fun part is that how may times the information expected in the second VARIABLE LENGTH FIELDS section repeats varies based on the parameters listed. The other thing that's usually confusing is that the types for the values are all expressed as type OID numbers. For example, if you see 25, that's the OID of the text type. You can see the whole list with: select oid,typname from pg_type; And if you go back to the file with that list in handle, a lot more of it should make sense. If you see a multiple parameter type list like 23 21, that's a function whose input values are of types (int4,int2), As for getting a new OID, if you go into src/include/catalog/ and run the unused_oids script, it will give you some help in figuring which have been used and which not. It's not worth getting too stressed about the number you choose in the patch submission, because commits between when you got your free number and when your patch is considered for commit can make your choice worthless anyway. There's a process referred to as catversion bump, where the database catalog version get updated to reflect things like new pg_proc information, that committers take care of as one of the last adjustments before final commit. Doing a final correction to the OID choice is a part every committer knows to look at. I wrote a talk that covered some additional trivia in this area, as well as other things people tend to get confused about in the source code, that you can find at http://www.pgcon.org/2010/schedule/attachments/142_HackingWithUDFs.pdf ; that might be helpful for some other things you might wonder about eventually. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] keeping a timestamp of the last stats reset (for a db, table and function)
Tomas Vondra wrote: OK, so here goes the simplified patch - it tracks one reset timestamp for a background writer and for each database. Adding timestamps like this was something I wanted to do after adding pg_stat_reset_shared to 9.0, so since you've beaten me to it I'll review your patch and make sure it all works the way I was hoping instead. The whole per-table idea was never going to fly given how few people touch this area at all, but the way you're tracking things now seems reasonable. When you post an updated version of a patch that's already being tracked on the CommitFest app, please try to remember to add that update to the tracker there. I just did that for this 12/23 update so that's covered already. Next problem is that the preferred method for submitted patches uses context diffs. See http://wiki.postgresql.org/wiki/Working_with_Git for some information about the somewhat annoying way you have to setup git to generate those. Don't worry about that for this round though. I don't care about the diff formatting given the code involved, but it's something you should sort out if you do another update. PS: I've noticed Magnus posted a patch to track recovery conflicts, adding a new view pg_stat_database_conflicts. I have not checked it yet but it should not influence this patch. I need to do some testing of that anyway, so I'll take a look at any potential clash as part of my review. I want to check how this interacts with track_functions resets too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump --split patch
Joel Jacobson wrote: To understand a change to my database functions, I would start by looking at the top-level, only focusing on the names of the functions modified/added/removed. At this stage, you want as little information as possible about each change, such as only the names of the functions. To do this, get a list of changes functions, you cannot compare two full schema plain text dumps using diff, as it would only reveal the lines changed, not the name of the functions, unless you are lucky to get the name of the function within the (by default) 3 lines of copied context. While you could increase the number of copied lines of context to a value which would ensure you would see the name of the function in the diff, that is not feasible if you want to quickly get a picture of the code areas modified, since you would then need to read through even more lines of diff output. I can agree on some use cases you've outlined, where there's merit to the general idea of your patch. But as an aside, you really should launch an investigation into some better diff tools if this is how you're doing this type of work. Last week I reviewed 3K lines worth of changes from two versions of a 12K line schema dump I'd never seen before in a couple of hours using kdiff3. I'd have killed myself before finishing if I had to do the same job with traditional diff as you're describing it here. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep Design
Simon Riggs wrote: Based upon that series of conversations, I've reworked the design so that there is (currently) only a single standby offering sync rep at any one time. Other standbys can request to be sync standbys but they only become the sync standby if the first one fails. Which was simple to do and bridges the challenges of an exactly identified sync standby and the fragility of too closely specifying the config. That seems like a good enough starting point to cover a lot of cases. Presuming the two servers each at two sites config that shows up in a lot of these discussions, people in the I need sync to a remote spot can get that, and if that site is unavailable for long enough to be kicked out they'll smoothly degrade to sync on a secondary local copy. And those who want high performance local sync and best-effort for the remote site can configure for that too, and if the local secondary dies they'll just degrade to the slow remote commits. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
Kevin Grittner wrote: Greg Smith wrote: I could see shipping this with the automatic heavy LOCK TABLE in there. How would you handle or document behavior in REPEATABLE READ isolation? The lock doesn't do much good unless you acquire it before you get your snapshot, right? Hand-wave and hope you offer a suggested implementation? I haven't gotten to thinking about this part just yet--am still assimilating toward a next move after the pleasant surprise that this is actually working to some degree now. You're right that turning the high-level idea of just lock the table actually has to be mapped into exact snapshot mechanics and pitfalls before moving in that direction will get very far. I'm probably not the right person to answer just exactly how feasibile that is this week though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
David Fetter wrote: How about implementing an UPSERT command as take the lock, do the merge? That way, we'd have both the simplicity for the simpler cases and a way to relax consistency guarantees for those who would like to do so. Main argument against is that path leads to a permanent non-standard wart to support forever, just to work around what should be a short-term problem. And I'm not sure whether reducing the goals to only this actually improves the ability to ship something in the near term too much. Many of the hard problems people are bothered by don't go away, it just makes deciding which side of the speed/complexity trade-off you're more interested in becomes more obvious. What I've been advocating is making that decision go away altogether by only worrying about the simple to use and slow path for now, but that's a highly debatable viewpoint I appreciate the resistence to, if it's possible to do at all. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We need to log aborted autovacuums
Josh Berkus wrote: I've been trying to diagnose in a production database why certain tables never get autovacuumed despite having a substantial % of updates. The obvious reason is locks blocking autovacuum from vacuuming the table ... Missed this dicussion when it popped up but have plenty to say about it now. What I do here is look for such anomolies using pg_stat_user_tables, that the dead rows number has exceeded the vacuum threshold. That comparison is logged in the code at DEBUG3: elog(DEBUG3, %s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f), NameStr(classForm-relname), vactuples, vacthresh, anltuples, anlthresh); But a rough computation isn't too hard to derive in a report, so long as you haven't customized per-table parameters. As you suggested in your message, the really bad cases here stick out a whole lot. If you pay the slightest amount of attention to the dead row percentages they jump right out at you. This all works easily on any version back to 8.3. Not having as much relevant data stored in pg_stat_user_tables makes the problem cases less obvious to spot in older versions. If I start seeing these badly maintained tables and suspect locking is getting in the way, I then dump traces from pg_locks+pg_stat_activity often enough that I can estimate how often someone has an interfering lock and what they're doing. Should the log level on this message go up from DEBUG3? I could see rewriting it so that it logs at DEBUG1 instead when Log_autovacuum_min_duration is set *and* when the trigger threshold is crossed, and at DEBUG3 the rest of the time. Given you can derive this with a bit of work in userland, I don't see this even being justified as an INFO or LOG level message. Anytime I can script a SQL-level monitor for something that's easy to tie into Nagios or something, I greatly prefer that to log file scraping for it anyway. What I'd like to do is add some logging code to autovacuum.c so that if log_autovacuum is any value other than -1, failure to vacuum due to locks gets logged. Does this make sense? The general idea is interesting and probably more productive for the situation you theorize is happening then messing with the logging discussed above. But that's not where the code needs to go--the lock isn't opened until much further down the function call stack. Attached quickie and only tested for compilation patch probably does what you want here. Since this would eliminate the messy follow-up step I sometimes have gone through, dumping pg_locks data to confirm or rule out locking issues messing with AV processing, I can see some potential that it may have simplified situations I've ran into in the past. And it's not out of line with the logging verbosity of similar failure mode tests that follow it. Right now failure to acquire a lock is just not considered a log-worthy issue, and I agree that it's worth considering whether it should be. If you could gather more info on whether this logging catches the problem cases you're seeing, that would really be the right test for the patch's usefulness. I'd give you solid 50/50 odds that you've correctly diagnosed the issue, and knowing for sure would make advocating for this logging a pretty easy sell to me at least. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 2f68df4..2124e25 100644 *** a/src/backend/commands/vacuum.c --- b/src/backend/commands/vacuum.c *** vacuum_rel(Oid relid, VacuumStmt *vacstm *** 851,856 --- 851,864 { PopActiveSnapshot(); CommitTransactionCommand(); + + if (IsAutoVacuumWorkerProcess() Log_autovacuum_min_duration = 0) + { + ereport(INFO, + (errmsg(skipping \%s\ --- cannot open or obtain lock, + RelationGetRelationName(onerel; + } + return; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] management of large patches
Robert Haas wrote: - MERGE - checkpoint improvements As far as these two go, the state of MERGE is still rougher than I would like. The code itself isn't too hard to read, and that the errors that are popping up tend to be caught by assertions (rather than just being mysterious crashes) makes me feel a little better that there's some defensive coding in there. It's still a 3648 line patch that touches grammar, planner, and executor bits though, and I've been doing mainly functional and coding style review so far. I'm afraid here's not too many committers in a good position to actually consume the whole scope of this thing for a commit level review. And the way larger patches tend to work here, I'd be surprised to find it passes through such a review without some as yet unidentified major beef appearing. Will see what we can do to help move this forward more before the CF start. The checkpoint changes I'm reworking are not really large from a code complexity or size perspective--I estimate around 350 lines of diff, with the rough version I submitted to CF2010-11 at 258. I suspect it will actually be the least complicated patch to consume from that list, from a committer perspective. The complexity there is mainly in the performance testing. I've been gearing up infrastructure the last couple weeks to automate and easily publish all the results I collect there. The main part that hasn't gone through any serious testing yet, auto-tuning the spread interval, will also be really easy to revert if a problem is found there. With Simon and I both reviewing each others work on this already, I hope we can keep this one from clogging the committer critical path you're worried about here. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Recovery conflict monitoring
Couple of doc suggestions: --- doc/src/sgml/high-availability.sgml + The number of query cancels and the reason for them can be viewed using + the structnamepg_stat_database_conflicts/ system view on the slave + server. For compleness sake, this should also mention the per-database summary, even though I'm not sure how valuable that view is. Also, on a standby server instead of on the slave server here. slave is mentioned once as a synonym in high-availability.sgml once, but that's it, and there can be more than one standby you want to pull these stats from. *** doc/src/sgml/monitoring.sgml ! number of rows returned, fetched, inserted, updated and deleted, and ! total number of queries cancelled due to conflict with recovery. This would be clearer if it said you're talking about standby recovery here, and possibly that this info is only available on the standby. I could see someone reading this and thinking it's possible for general database crash recovery to produce cancelled queries, instead of the way connections are actually blocked until that's done. ! entrystructnamepg_stat_database_conflicts/ ! entryOne row per database, showing database OID, database name and ! the number of queries that have been cancelled in this database due to ! dropped tablespaces, lock timeouts, old snapshots, pinned buffers and ! deadlocks. A clarification that you're talking about standby query cancellation here might be helpful too. I don't think that's necessary for all of the detailed pg_stat_get_* functions that regular users are less likely to care about, just these higher level ones. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
Marko Tiikkaja wrote: I'm confused. Are you saying that the patch is supposed to lock the table against concurrent INSERT/UPDATE/DELETE/MERGE? Because I don't see it in the patch, and the symptoms you're having are a clear indication of the fact that it's not happening. I also seem to recall that people thought locking the table would be excessive. That's exactly what it should be doing. I thought I'd seen just that in one of the versions of this patch, but maybe that's a mistaken memory on my part. In advance of the planned but not available yet ability to lock individual index key values, locking the whole table is the only possible implementation that can work correctly here I'm aware of. In earlier versions, I think this code was running into issues before it even got to there. If you're right that things like the duplicate key error in the current version are caused exclusively by not locking enough, that may be the next necessary step here. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: new patch of MERGE (merge_204) a question about duplicated ctid
Erik Rijkers wrote: I get some whitespace-warnings, followed by error: $ git apply /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:481: trailing whitespace. /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:482: trailing whitespace. if (IsA(plan, ModifyTable) /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:550: trailing whitespace. /*print the action qual*/ /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:556: trailing whitespace. (act_plan-operation == CMD_INSERT || /home/rijkers/download/pgpatches/0091/merge/20101206/merge_204_2010DEC06.patch:560: trailing whitespace. error: patch failed: src/backend/optimizer/plan/planner.c:739 error: src/backend/optimizer/plan/planner.c: patch does not appl Maybe I'm doing something wrong, but I've never had good luck with git apply. I took this patch and applied it the 12/15 copy of HEAD I had checked out (trying to minimize drift in there since the patch was created) using: patch -p 1 merge_204_2010DEC06.patch There was one trivial conflict it produced src/backend/optimizer/plan/planner.c.rej for, and that fix was straightforward to apply by hand. The result is now sitting as the merge204 branch in my github repo: https://github.com/greg2ndQuadrant/postgres/tree/merge204 if you did want to try this out. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
I did some basic testing of the latest update here, but quickly hit a problem that wasn't in the previous version. Attached is the standalone test script that used to work, but now fails like this: psql:simple.sql:12: ERROR: the vars in merge action tlist of qual should only belongs to the source table or target table This test case is intended to implement the common UPSERT situation that is one of the main requests that MERGE is intended to satisfy, using this syntax: MERGE INTO Stock t USING (VALUES(10,100)) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; If you can suggest an alternate way to express this that works with the new patch, I might switch to that and retry. I was never 100% sure this was the right way to write this, and I don't have another database with MERGE support here to try against. (Aside: if someone else does, I'd be really curious to see if the attached test case works or not on another database system. I think we need to include compatibility testing with other MERGE implementations into the test mix here soon.) Regardless, this failure suggests that you need to add this sort of test to the regression test set. We need to have an example of an UPSERT using constant data in there to make sure this continues to work in the future. This is a good week for me in terms of having time for PostgreSQL hacking, so if you can suggest something here or update the patch I'll try it soon afterwards. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us DROP TABLE Stock; CREATE TABLE Stock(item_id int UNIQUE, balance int); INSERT INTO Stock VALUES (10, 2200); INSERT INTO Stock VALUES (20, 1900); SELECT * FROM Stock ORDER BY item_id; MERGE INTO Stock t USING (VALUES(10,100)) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; SELECT * FROM Stock ORDER BY item_id; MERGE INTO Stock t USING (VALUES(30,2000)) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; SELECT * FROM Stock ORDER BY item_id; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
Marko Tiikkaja wrote: As far as I can tell, this should work. I played around with the patch and the problem seems to be the VALUES: INTO Stock t USING (SELECT 30, 2000) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; MERGE 1 Good catch...while I think the VALUES syntax should work, that is a useful workaround so I could keep testing. I rewrote like this (original syntax commented out): MERGE INTO Stock t -- USING (VALUES(10,100)) AS s(item_id,balance) USING (SELECT 10,100) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; And that got me back again to concurrent testing. Moving onto next two problems...the basic MERGE feature seems to have stepped backwards a bit too. I'm now seeing these quite often: ERROR: duplicate key value violates unique constraint pgbench_accounts_pkey DETAIL: Key (aid)=(176641) already exists. STATEMENT: MERGE INTO pgbench_accounts t USING (SELECT 176641,1+(176641 / 100)::integer,168,'') AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler); On my concurrent pgbench test, which had been working before. Possibly causing that, the following assertion is tripping: TRAP: FailedAssertion(!(epqstate-origslot != ((void *)0)), File: execMain.c, Line: 1762) That's coming from the following code: void EvalPlanQualFetchRowMarks(EPQState *epqstate) { ListCell *l; Assert(epqstate-origslot != NULL); foreach(l, epqstate-rowMarks) Stepping back to summarize...here's a list of issues I know about with the current v204 code: 1) VALUE syntax doesn't work anymore 2) Assertion failure in EvalPlanQualFetchRowMarks 3) Duplicate key bug (possibly a direct result of #3) 4) Attempts to use MERGE in a fuction spit back ERROR: table is not a known fuction 5) The ctid junk attr handling needs to be reviewed more carefully, based on author request. I've attached the current revisions of all my testing code in hopes that Boxuan might try and replicate these (this makes it simple to replicate #1 through #3), and therefore confirm whether changes made do better. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books DROP TABLE Stock; CREATE TABLE Stock(item_id int UNIQUE, balance int); INSERT INTO Stock VALUES (10, 2200); INSERT INTO Stock VALUES (20, 1900); SELECT * FROM Stock ORDER BY item_id; MERGE INTO Stock t -- USING (VALUES(10,100)) AS s(item_id,balance) USING (SELECT 10,100) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; SELECT * FROM Stock ORDER BY item_id; MERGE INTO Stock t -- USING (VALUES(30,2000)) AS s(item_id,balance) USING (SELECT 30,2000) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; SELECT * FROM Stock ORDER BY item_id; \set nbranches :scale \set ntellers 10 * :scale \set naccounts 10 * :scale \setrandom aid 1 :naccounts \setrandom bid 1 :nbranches \setrandom tid 1 :ntellers \setrandom delta -5000 5000 MERGE INTO pgbench_accounts t USING (SELECT :aid,1+(:aid / 100)::integer,:delta,'') AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler); -- This syntax worked with MERGE v203 patch, but isn't compatible with v204 --MERGE INTO pgbench_accounts t USING (VALUES (:aid,1+(:aid / 100)::integer,:delta,'')) AS s(aid,bid,balance,filler) ON s.aid=t.aid WHEN MATCHED THEN UPDATE SET abalance=abalance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.aid,s.bid,s.balance,s.filler); test-merge.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
Marko Tiikkaja wrote: I have no idea why it worked in the past, but the patch was never designed to work for UPSERT. This has been discussed in the past and some people thought that that's not a huge deal. It takes an excessively large lock when doing UPSERT, which means its performance under a heavy concurrent load can't be good. The idea is that if the syntax and general implementation issues can get sorted out, fixing the locking can be a separate performance improvement to be implemented later. Using MERGE for UPSERT is the #1 use case for this feature by a gigantic margin. If that doesn't do what's expected, the whole implementation doesn't provide the community anything really worth talking about. That's why I keep hammering on this particular area in all my testing. One of the reflexive I can't switch to PostgreSQL easily stopping points for MySQL users is I can't convert my ON DUPLICATE KEY UPDATE code. Every other use for MERGE is a helpful side-effect of adding the implementation in my mind, but not the primary driver of why this is important. My hints in this direction before didn't get adopted, so I'm saying it outright now: this patch must have an UPSERT implementation in its regression tests. And the first thing I'm going to do every time a new rev comes in is try and break it with the pgbench test I attached. If Boxuan can start doing that as part of his own testing, I think development here might start moving forward faster. I don't care so much about the rate at which concurrent UPSERT-style MERGE happens, so long as it doesn't crash. But that's where this has been stuck at for a while now. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unnecessary limit on max_standby_streaming_delay
Magnus Hagander wrote: The limit on max_standby_streaming_delay is currently 35 minutes (around) - or you have to set it to unlimited. This is because the GUC is limited to MAX_INT/1000, unit milliseconds. Is there a reason for the /1000, or is it just an oversight thinking the unit was in seconds? If we can get rid of the /1000, it would make the limit over three weeks, which seems much more reasonable. Or if we made it a 64-bit counter it would go away completely. The code for this uses TimestampTzPlusMilliseconds to determine its deadline: return TimestampTzPlusMilliseconds(rtime, max_standby_streaming_delay); Which is implemented like this: #define TimestampTzPlusMilliseconds(tz,ms) ((tz) + ((ms) * (int64) 1000)) My guess is that the range was limited at some point to avoid concerns of integer overflow in that multiplication, which I don't think actually is a risk due the int64 cast there. I confirmed the upper limit on this thing is the 36 minutes that Magnus suggests. This is a terrible limitation to have slipped through. For the pg_dump from the standby use case, the appropriate setting for most people is in the multiple hours range, but not unlimited (lest an out of control backup linger to overlap with what happens the next day). Whichever approach is taken to make this better, I think it deserves a backport too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] directory archive format for pg_dump
Moving onto the directory archive part of this patch, the feature seems to work as advertised; here's a quick test case: createdb pgbench pgbench -i -s 1 pgbench pg_dump -F d -f test pg_restore -k test pg_restore -l test createdb copy pg_restore -d copy test The copy made that way looked good. There's a good chunk of code in the patch that revolves around BLOB support. We need to get someone who is more familiar with those than me to suggest some tests for that part before this gets committed. If you could suggest how to test that code, that would be helpful. There's a number of small things that I'd like to see improved in new rev of this code pg_dump: help message for --file needs to mention that this is overloaded to also specify the output directory too. pg_dump: the documentation for --file should say the directory is created, and must not exist when you start. The code catches this well, but that expectation is not clear until you try it. pg_restore: the help message check the directory archive would be clearer as check an archive in directory format. There are some tab vs. space whitespace inconsistencies in the documentation added. The comments at the beginning of functions could be more consistent. Early parts of the code have a header for each function that's extensive. Maybe even a bit more than needed. I'm not sure why it's important to document here which of these functions is optional/mandatory for example, and getting rid of just those would trim a decent number of lines out of the patch. But then at the end, all of the new functions added aren't documented at all. Some of those are near trivial, but it would be better to have at least a small descriptive header for them. The comment header at the beginning of pg_backup_directory is a bit weird. I guess Philip Warner should still be credited as the author of the code this was based on, but it's a weird seeing a new file attributed solely to him. Also, there's an XXX in the identification field there that should be filled in with the file name. There's your feedback for this round. I hope we'll see an updated patch from you as part of the next CommitFest. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
Robert Haas wrote: I don't really have a position on whether or not this patch is ready to commit... but I do think that this is the sort of patch that is very likely to have some bugs almost no matter when we commit it I just updated the CF app to track Peter's latest update, which remains untested by anyone else for whether it fixes all the issues brought up. It would be nice to get a re-review to confirm things are still working in advance of CF 2011-01. Given the reviewer here is also a committer, that means it's possible this can go into the tree after that if everything checks out even outside of the regular CF schedule. In the interest of closing out this CF, I'm updating this one as returned for now though. That doesn't prevent it from going in anyway once it's confirmed ready, and I agree the sooner the better to help find breakage. But I don't think that's so important that it should block the critical path for the next alpha. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
Itagaki Takahiro wrote: We might need previous reviewers and active reviewers in commit-fest app. Or, should non-active reviewers delete their names? This is only really an issue with patches that get moved from one CF to the next, which doesn't happen that often. Patches that are marked Returned With Feedback instead get a new entry in the next CF instead, which avoids this problem. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Robert Haas wrote: On Wed, Dec 15, 2010 at 9:22 AM, Greg Smith g...@2ndquadrant.com wrote: patch I submit. Doesn't seem worth going through the trouble of committing that minor rework on its own, I'll slip it into the next useful thing that touches this area I do. Thanks for the hint, this would work better than what I did. Well, if I'm the one committing it, I'll pull that part out again and commit it separately. Not sure if that affects your calculus, but I much prefer patches that don't try to do ancillary things along the way. I meant that I'd bundle it into the block of time I spend on that, and likely submit with something else that touches the same area. Obviously the correction patch would be better on its own when being handed over to a committer. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books
Re: [HACKERS] [PATCH] V3: Idle in transaction cancellation
Andres Freund wrote: On Thursday 02 December 2010 22:21:37 Alvaro Herrera wrote: Excerpts from Andres Freund's message of sáb oct 30 05:49:21 -0300 2010: Ill set this up for the next commitfest, I don't think I can do much more without further input. Are you reserving about 20 bits for levels, and 12 for flags? Given the relatively scarce growth of levels, I think we could live with about 6 or 7 bits for level, rest for flags. The number I picked was absolutely arbitrary I admit. Neither did I think it would be likely to see more levels, nor did I forsee many flags, so I just chose some number I liked in that certain moment ;-) This bit of detail seems to have died down without being resolved; bumping it to add a reminder about that. I count four issues of various sizes left with this patch right now: 1) This levels bit 2) Can the approach used be simplified or the code made cleaner? 3) What is the interaction with Hot Standby error handling? 4) The usual code formatting nitpicking, Kevin mentioned braces being an issue Robert is already thinking about (2); I'll keep an eye out for someone who can test (3) now that it's been identified as a concern; and the other two are pretty small details once those are crossed. I don't see this as being ready to commit just yet though, so I don't see this going anywhere other than returned for now; will mark it as such. Hopefully this will gather enough additional review to continue moving forward now that the main issues are identified. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books
Re: [HACKERS] Tab completion for view triggers in psql
David Fetter wrote: That we're in the position of having prevN_wd for N = 1..5 as the current code exists is a sign that we need to refactor the whole thing, as you've suggested before. I'll work up a design and prototype for this by this weekend. Great. I don't think issues around tab completion are enough to block the next alpha though, and it sounds like the next stage of this needs to gel a bit more before it will be ready to commit anyway. I'm going to mark the remaining bits here as returned for now, and trust that you'll continue chugging away on this so we can get it into the next CF early. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions, patch v17
Dimitri tells me there's a V18 of this patch due real soon now. That may very well be ready for a committer, but even if that's the case it's going to take them some time to consume what was at last count an almost 10K line long context diff. In the interest of closing this CF out without further schedule overrun, I'm going to mark this as returned for now. Extensions are such a major improvement for PostgreSQL that I certainly hope review and eventual commit for 9.1 continues during or before the next CF. Both the multiple dependency patches that have been committed in the last month and the steady flow of updates to this one are good signs of progress. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CommitFest wrap-up
Tom Lane wrote: - Writeable CTEs - I think we need Tom to pick this one up. - Fix snapshot taking inconsistencies - Ready for committer. Can any committer pick this up? Will take a look at these two also. I marked you down at the listed committer for them both. That leaves serializable lock consistency as the only patch that's left in Ready for Committer state; everything else seemed to me to still have some kinks left to work out and I marked them returned. Given that patch has been in that state since 9/24 and Florian even took care of the bit rot yesterday, it's certainly been queued patiently enough waiting for attention. Obviously you, Robert, and other committers can work on one of the patches I bounced instead if you think they're close enough to slip in. But this serializable lock one is the only submission that I think hasn't gotten a fair share of attention yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books
Re: [HACKERS] Re: Proposed Windows-specific change: Enable crash dumps (like core files)
I've updated this entry in the CommitFest app to note that Craig had some implementation questions attached to his patch submission that I haven't seen anyone address yet, and to include a reference to Tom's latest question--which may make those questions moot, not sure. This pretty clearly need to sit on the stove a little bit longer before it's done regardless. I'm marking this one Returned With Feedback, and hopefully Craig will continue hammering on this to clean up the remaining details and resubmit in the next month. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Alvaro Herrera wrote: I gave this patch a look and it seems pretty good to me, except that I'm uncomfortable with the idea of mdsync filling in the details for CheckpointStats fields directly. Would it work to pass a struct (say SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to mdsync, have this function fill it, and return it back so that CheckPointBuffers copies the data from this struct into CheckpointStats? That was originally how I planned to write this bit of code. When I realized that the CheckpointStats structure was already visible there and stuffed with details that ultimately go into the same output line at the end, it just didn't seem worth the extra code complexity. The abstraction layer around md.c was not exactly airtight before I poked that extra little hole in there, and I was aiming via the principal of a smaller diff usually being the better patch . If you feel strongly that the result led to a bad abstraction violation, I'll submit a patch to refactor it to pass a structure instead before the next CF. I appreciate your concern, I'm just not sure it's worth spending time on. What I'd really like to do is refactor out major parts of the leaky md/smgr layers altogether instead, but that's obviously a bigger project. Another minor nitpick: inside the block when you call FileSync, why check for log_checkpoints at all? Seems to me that just checking for zero of sync_start should be enough. Alternatively, seems simpler to just have a local var with the value of log_checkpoints at the start of mdsync and use that throughout the function. (Surely if someone turns off log_checkpoints in the middle of a checkpoint, it's not really a problem that we collect and report stats during that checkpoint.) And now you're just getting picky! This is a useful observation though, and I'll try to include that fix along with the next general checkpoint overhaul patch I submit. Doesn't seem worth going through the trouble of committing that minor rework on its own, I'll slip it into the next useful thing that touches this area I do. Thanks for the hint, this would work better than what I did. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Robert Haas wrote: I took a look at this and it looks generally good, but I'm wondering why md.c is converting the results from an exact value to a floating point, only to have xlog.c turn around and convert back to an integer. I think it could just return milliseconds directly, or if you're worried about a checkpoint that takes more than 24 days to complete, seconds and microseconds. Attached patch now does something like this, except without the roll-over concern. INSTR_TIME_GET_MICROSEC returns a uint64 value. I just made that the storage format for all these values until they're converted for display. Test output: LOG: checkpoint starting: xlog DEBUG: checkpoint sync: number=1 file=base/16385/16480 time=10422.859 msec DEBUG: checkpoint sync: number=2 file=base/16385/16475_vm time=2896.614 msec DEBUG: checkpoint sync: number=3 file=base/16385/16475.1 time=57.836 msec DEBUG: checkpoint sync: number=4 file=base/16385/16466 time=20.080 msec DEBUG: checkpoint sync: number=5 file=base/16385/16463 time=74.926 msec DEBUG: checkpoint sync: number=6 file=base/16385/16482 time=74.263 msec DEBUG: checkpoint sync: number=7 file=base/16385/16475_fsm time=7.062 msec DEBUG: checkpoint sync: number=8 file=base/16385/16475 time=35.164 msec LOG: checkpoint complete: wrote 2143 buffers (52.3%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=1.213 s, sync=13.589 s, total=24.744 s; sync files=8, longest=10.422 s, average=1.698 s This shows the hard truncation used, so 10422.859 msec becomes 10.422 s. I don't think allowing up to 0.999ms of error there is a problem given the expected scale. But since none of the precision is lost until the end, that could be changed with only touching the final display formatting conversion of the value. Following your general idea further, why throw away any resolution inside of md.c; let xlog.c decide how to show it. Note that I also fixed the DEBUG level lines to only show their actual precision. Before that was printing 6 digits to the right of the decimal point each time, the last three of which were always 0. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 1ed9687..c9778df 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** LogCheckpointEnd(bool restartpoint) *** 6955,6964 { long write_secs, sync_secs, ! total_secs; int write_usecs, sync_usecs, ! total_usecs; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); --- 6955,6969 { long write_secs, sync_secs, ! total_secs, ! longest_secs, ! average_secs; int write_usecs, sync_usecs, ! total_usecs, ! longest_usecs, ! average_usecs; ! uint64 average_sync_time; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); *** LogCheckpointEnd(bool restartpoint) *** 6974,6991 CheckpointStats.ckpt_sync_end_t, sync_secs, sync_usecs); if (restartpoint) elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); ! write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s, CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000); else elog(LOG, checkpoint complete: wrote %d buffers (%.1f%%); %d transaction log file(s) added, %d removed, %d recycled; ! write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s, CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, CheckpointStats.ckpt_segs_added, --- 6979,7017 CheckpointStats.ckpt_sync_end_t, sync_secs, sync_usecs); + /* + * Timing values returned from CheckpointStats are in microseconds. + * Convert to the second plus microsecond form that TimestampDifference + * returns for homogeneous printing. + */ + longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 100); + longest_usecs = CheckpointStats.ckpt_longest_sync - + (uint64) longest_secs * 100; + + average_sync_time = 0; + if (CheckpointStats.ckpt_sync_rels 0) + average_sync_time = CheckpointStats.ckpt_agg_sync_time / + CheckpointStats.ckpt_sync_rels; + average_secs = (long) (average_sync_time / 100); + average_usecs = average_sync_time - (uint64) average_secs * 100; + if (restartpoint) elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); ! write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; ! sync files=%d, longest=%ld.%03d s, average=%ld.%03d s, CheckpointStats.ckpt_bufs_written, (double
Re: [HACKERS] Instrument checkpoint sync calls
Robert Haas wrote: I'm wondering why md.c is converting the results from an exact value to a floating point, only to have xlog.c turn around and convert back to an integer. I think it could just return milliseconds directly, or if you're worried about a checkpoint that takes more than 24 days to complete, seconds and microseconds. md.c is printing the value as a float, so I converted early to a double and then percolated it upward from there. More an artifact of how the code grew from its initial form than an intentional decision. I see your point that making elapsed, total_elapsed, ckpt_agg_sync_time, and ckpt_longest_sync all the same type of integer that INSTR_TIME_GET_MICROSEC returns would reduce the potential for rounding abberations. I could do another rev of the patch tomorrow with that change if you'd prefer it that way. I don't have a strong opinion about that implementation detail; given that xlog.c is printing a less fine-grained time anyway (seconds with 3 digits vs. msec with 3 digits) it seems unlikely to run into a real problem here. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CommitFest wrap-up
Robert Haas wrote: - instrument checkpoint sync calls - I plan to commit this in the next 48 hours. (Hopefully Greg Smith will do the cleanup I suggested, if not I'll do it.) Yes, doing that tonight so you can have a simple (hopefully) bit of work to commit tomorrow. - MERGE command - Returned with Feedback? Not sure where we stand with this. That's waiting for me to do another round of review. I'm getting to that soon I hope, maybe tomorrow. - SQL/MED - file_fdw - Discussion is ongoing, but I see no harm in marking this Returned with Feedback for now in anticipation of a new version before CF 2011-01. - SQL/MED - postgresql_fdw - Hasn't received as much review, I think, so should probably be moved to next CF as-is. I was thinking of just moving both of those into the next CF without adding any clear resolution code--then they can get worked on as feasible after the core goes in. All in all it's disappointing to have so many major patches outstanding at this point in the CommitFest, but I think we're just going to have to make the best of it. I've done a pretty lame job of pushings thing forward here, but I don't think things have progressed that badly. The community produced several large and/or multi-part patches that the CF could have choked on, and instead they've been broken into digestible chunks and kept chewing through them. I'm just glad that's happening in this CF, rather than a pile like this showing up for the last one. Thanks for the wrap-up summary, I was going to do something like that myself tonight but you beat me to it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final(?) proposal for wal_sync_method changes
Since any Windows refactoring has been postponed for now (I'll get back to performance checks on that platform later), during my testing time this week instead I did a round of pre-release review of the change Tom has now committed. All looks good to me, including the docs changes. I confirmed that: -Ubuntu system with an older kernel still has the same wal_sync_method (fdatasync) and performance after pulling the update -RHEL6 system changes as planned from using open_datasync to fdatasync once I updated to a HEAD after the commit On the RHEL6 system, I also checked the commit rate using pgbench with the attached INSERT only script, rather than relying on test_fsync. This is 7200 RPM drive, so theoretical max of 120 commits/second, on ext4; this is the same test setup I described in more detail back in http://archives.postgresql.org/message-id/4ce2ebf8.4040...@2ndquadrant.com $ psql -c show wal_sync_method wal_sync_method - fdatasync (1 row) $ pgbench -i -s 10 pgbench [gsm...@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench starting vacuum...end. transaction type: Custom query scaling factor: 10 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 6733 tps = 112.208795 (including connections establishing) tps = 112.216904 (excluding connections establishing) And then manually switched over to test performance of the troublesome old default: [gsm...@meddle ~]$ psql -c show wal_sync_method wal_sync_method - open_datasync [gsm...@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench starting vacuum...end. transaction type: Custom query scaling factor: 10 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 6672 tps = 111.185802 (including connections establishing) tps = 111.195089 (excluding connections establishing) This is interesting, because test_fsync consistently reported a rate of about half this when using open_datasync instead of the equal performance I'm getting from the database. I'll see if I can reproduce that further, but it's no reason to be concerned about the change that's been made I think. Just more evidence that test_fsync has quirks left to be sorted out. But that's not backbranch material, it should be part of 9.1 only refactoring, already in progress via the patch Josh submitted. There's a bit of time left to get that done. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books \set nbranches :scale \set ntellers 10 * :scale \set naccounts 10 * :scale \setrandom aid 1 :naccounts \setrandom bid 1 :nbranches \setrandom tid 1 :ntellers \setrandom delta -5000 5000 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Jeff Janes wrote: In my test cases, the syncs that the backends were doing were almost always to the same file that the checkpoint writer was already choking on (so they are entangled simply by virtue of that). So very quickly all the backends hit the same wall and thunked to a halt. This is probably a feature of trying to use pgbench as the basis to get a very artificial model. Yes--pgbench has some problems like you describe, ones that are a bit different than the way I've seen fsync writes get in each other's way in the production systems I've looked at. That's good if you really want to provoke this behavior, which is one reason why I've used as an example for my patches so far (the other being that it's already available in everyone's installation). But it's tough to get it to act more like a real-world system, which don't have quite so many localized updates, without cranking the scale way up. And that then tends to aggravate other problems too. The 8.3 checkpoint spreading work also got some useful results using the dbt-2 benchmark. I'm at the point where I think I need to return to that test program for what I'm doing now. I'd encourage you to try that out too if you get a chance. Thanks for the feedback and the review. I hope you appreciate now why I suggested you wait for the stuff I was submitting before getting back into the sorted checkpoint topic again. That should be a lot easier to make sense of with this instrumentation in place. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final(?) proposal for wal_sync_method changes
Josh Berkus wrote: Did you rerun test_sync with O_DIRECT entabled, using my patch? The figures you had from test_fsync earlier were without O_DIRECT. No--I was just focused on testing the changes that had already been committed. The use of O_DIRECT in the server but not test_fsync could very well be the reason for the difference; don't know yet. I'm trying to get through this CF before I start getting distracted by newer patches, I'll get to yours soon I hope. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Optimize commit_siblings in two ways to improve group commit.
Tom Lane wrote: http://archives.postgresql.org/pgsql-performance/2010-12/msg00073.php Possibly it should have been posted to -hackers instead, but surely you read -performance? Trying to figure out what exactly commit_delay and commit_siblings did under the hood was actually the motivation behind my first foray into reading the PostgreSQL source code. Ever since, I've been annoyed that the behavior didn't really help the way it's intended, but was not sure what would be better. The additional input from Jignesh this week on the performance list suddenly made it crystal clear what would preserve the good behavior he had seen, even improving things for his case, while also helping the majority who won't benefit from the commit_delay behavior at all a little. I immediately wrote the patch and breathed a sign of relief that it was finally going to get better. I then posted the patch and added it to the January CF. Unbeknownst to me until today, Simon had the same multi-year this itches and I can't make it stop feel toward these parameters, and that's how it jumped the standard process. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Optimize commit_siblings in two ways to improve group commit.
Tom Lane wrote: I'm not entirely convinced that zero commit_siblings is a better default than small positive values, but it's certainly plausible. Not being allowed to set it to zero was certainly a limitation worth abolishing though; that has been the case before now, for those who didn't see the thread on the performance list. I think that on the sort of high throughput system likely to benefit from this behavior, whether commit_siblings is zero or five doesn't matter very much--those people should cross the siblings threshold very quickly regardless. The main arguments in favor of making the default lower aren't as exciting now that it jumps out of the loop early once finding the requisite number. I like keeping the default at 5 though. It keeps the person who experiments with increasing commit_delay from suffering when there are in reality not a lot of active connections. There are essentially two foot-guns you have to aim before you run into the worst case here, which is making every single commit wait for the delay when there's really only one active process committing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Alvaro Herrera wrote: Why would multiple bgwriter processes worry you? Of course, it wouldn't work to have multiple processes trying to execute a checkpoint simultaneously, but what if we separated the tasks so that one process is in charge of checkpoints, and another one is in charge of the LRU scan? I was commenting more in the context of development resource allocation. Moving toward that design would be helpful, but it alone isn't enough to improve the checkpoint sync issues. My concern is that putting work into that area will be a distraction from making progress on those. If individual syncs take so long that the background writer gets lost for a while executing them, and therefore doesn't do LRU cleanup, you've got a problem that LRU-related improvements probably aren't enough to solve. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final(?) proposal for wal_sync_method changes
Tom Lane wrote: So actually, there is no difference between selecting fsync and fsync_writethrough on Windows, this comment and the SGML documentation to the contrary. Both settings result in invoking _commit() and presumably are safe. One wonders why we bothered to invent a separate fsync_writethrough setting on Windows. Quite; I documented some the details about mapping to _commit() long ago at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm but forgot to suggest fixing the mistakes in the docs afterwards (Windows is not exactly my favorite platform). http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php explains some of the history I think you're looking for here. Would someone verify via pgbench or similar test (*not* test_fsync) that on Windows, wal_sync_method = fsync or fsync_writethrough perform the same (ie tps ~= disk rotation rate) while open_datasync is too fast to be real? I'm losing confidence that I've found all the spaghetti ends here, and I don't have a Windows setup to try it myself. I can look into this tomorrow. The laptop I posted Ubuntu/RHEL6 test_fsync numbers from before also boots into Vista, so I can compare all those platforms on the same hardware. I just need to be aware of the slightly different sequential speeds on each partition of the drive. As far as your major battle plan goes, I think what we should do is find the simplest possible patch that just fixes the newer Linux kernel problem, preferrably without changing any other platform, then commit that to HEAD and appropriate backports. Then the larger O_DIRECT remapping can proceed forward after that, along with cleanup to the writethrough choices and unifying test_fsync against the server. I wrote a patch that shuffled around a lot of this code last night, but the first thing I coded was junk because of some mistaken assumptions. Have been coming to same realizations about how messy this really is you have. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Tom Lane wrote: The various testing that's been reported so far is all for Linux and thus doesn't directly address the question of whether other kernels will have similar performance properties. Survey of some popular platforms: Linux: don't want O_DIRECT by default for reliability reasons, and there's no clear performance win in the default config with small wal_buffers Solaris: O_DIRECT doesn't work, there's another API support has never been added for; see http://blogs.sun.com/jkshah/entry/postgresql_wal_sync_method_and Windows: Small reported gains for O_DIRECT, i.e 10% at http://archives.postgresql.org/pgsql-hackers/2007-03/msg01615.php FreeBSD: It probably works there, but I've never seen good performance tests of it on this platform. Mac OS X: Like Solaris, there's a similar mechanism but it's not O_DIRECT; see http://stackoverflow.com/questions/2299402/how-does-one-do-raw-io-on-mac-os-x-ie-equivalent-to-linuxs-o-direct-flag for notes about the F_NOCACHE feature used. Same basic situation as Solaris; there's an API, but PostgreSQL doesn't use it yet. So my guess is that some small percentage of Windows users might notice a change here, and some testing on FreeBSD would be useful too. That's about it for platforms that I think anybody needs to worry about. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books
Re: [HACKERS] WIP patch for parallel pg_dump
Joachim Wieland wrote: Regarding snapshot cloning and dump consistency, I brought this up already several months ago and asked if the feature is considered useful even without snapshot cloning. In addition, Joachim submitted a synchronized snapshot patch that looks to me like it slipped through the cracks without being fully explored. Since it's split in the official archives the easiest way to read the thread is at http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg143866.html Or you can use these two: http://archives.postgresql.org/pgsql-hackers/2010-01/msg00916.php http://archives.postgresql.org/pgsql-hackers/2010-02/msg00363.php That never made it into a CommitFest proper that I can see, it just picked up review mainly from Markus. The way I read that thread, there were two objections: 1) This mechanism isn't general enough for all use-cases outside of pg_dump, which doesn't make it wrong when the question is how to get parallel pg_dump running 2) Running as superuser is excessive. Running as the database owner was suggested as likely to be good enough for pg_dump purposes. Ultimately I think that stalled because without a client that needed it the code wasn't so interesting yet. But now there is one; should that get revived again? It seems like all of the pieces needed to build what's really desired here are available, it's just the always non-trivial task of integrating them together the right way that's needed. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq changes for synchronous replication
The one time this year top-posting seems appropriate...this patch seems stalled waiting for some sort of response to the concerns Alvaro raised here. Alvaro Herrera wrote: Excerpts from Fujii Masao's message of jue nov 25 10:47:12 -0300 2010: The attached patch s/CopyXLog/CopyBoth/g and adds the description about CopyBoth into the COPY section. I gave this a look. It seems good, but I'm not sure about this bit: + case 'W': /* Start Copy Both */ + /* +* We don't need to use getCopyStart here since CopyBothResponse +* specifies neither the copy format nor the number of columns in +* the Copy data. They should be always zero. +*/ + conn-result = PQmakeEmptyPGresult(conn, PGRES_COPY_OUT); + if (!conn-result) + return; + conn-asyncStatus = PGASYNC_COPY_BOTH; + conn-copy_already_done = 0; + break; I guess this was OK when this was conceived as CopyXlog, but since it's now a generic mechanism, this seems a bit unwise. Should this be reconsidered so that it's possible to change the format or number of columns? (The paragraph added to the docs is also a bit too specific about this being used exclusively in streaming replication, ISTM) While modifying the code, it occurred to me that we might have to add new ExecStatusType like PGRES_COPY_BOTH and use that for CopyBoth mode, for the sake of consistency. But since it's just alias of PGRES_COPY_BOTH for now, i.e., there is no specific behavior for that ExecStatusType, I don't think that it's worth adding that yet. I'm not so sure about this. If we think that it's worth adding a new possible state, we should do so now; we will not be able to change this behavior later.
[HACKERS] Re: new patch of MERGE (merge_204) a question about duplicated ctid
Boxuan Zhai wrote: I have updated the MERGE patch for two main problems. The patch inside the .tar.gz file you attached isn't right; that extracts to a tiny file of junk characters. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wCTE behaviour
Marko Tiikkaja wrote: This is almost exactly the patch from 2010-02 without CommandCounterIncrement()s. It's still a bit rough around the edges and needs some more comments, but I'm posting it here anyway. This patch passes all regression tests, but feel free to try to break it, there are probably ways to do that. This one also has the always run DMLs to completion, and exactly once behaviour. So this patch was marked Ready for Committer, but a) no committer has picked it up yet and b) Marko has made changes here that nobody else has tested out yet that I've seen on the last. Accordingly, that classification may have been optimistic. It seems to me that another testing run-through from someone like David might be appropriate to build some confidence this latest patch should be a commit candidate. If there is a committer intending to work on this as-is, they haven't identified themselves. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Jeff Janes wrote: I've attached a tiny patch to apply over yours, to deal with this and with the case where no files are synced. Thanks for that. That obvious error eluded me because in most of the patch update testing I was doing (on ext3), the longest sync was always about the same length as the total sync time. Attached patch (in correct diff form this time!) collects up all changes. That includes elimination of a potential race condition if someone changes log_checkpoints while a long sync phase is executing. I don't know whether that can happen, and it obviously won't give accurate stats going back to the beginning of the checkpoint in that case, but it tries to defend aginst producing complete garbage if that value changes out from under it. This is the first version of this patch I feel fairly good about; no open concerns left on my side. Jeff, since you're now the de-facto credited reviewer of this one by virtue of suggesting bug fixes, could you take this update out for a spin too? Combining this instrumentation patch with the backend sync one already committed, the net result under log_min_messages=debug1is somewhat undesirable in that I can now see the individual sync times for the syncs done by the checkpoint writer, but I do not get times for the syncs done by the backend (I only get informed of their existence). On a properly working system, backend syncs shouldn't be happening. So if you see them, I think the question shouldn't be how long are they taking?, it's how do I get rid of them? From that perspective, knowing of their existence is sufficient to suggest the necessary tuning changes, such as dropping bgwriter_delay. When you get into a situation where they are showing up, a whole lot of them can happen in a very brief period; enough that I'd actually be concerned about the added timing overhead, something I normally don't care about very much. That's the main reason I didn't bother instrumenting those too--the situations where they happen are ones already running low on resources. Big writes for things that can only be written out at checkpoint time, on the other hand, are unavoidable in the current design. Providing detail on them is going to be relevant unless there's a major refactoring of how those happen. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ae9ed8f..6adc139 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** LogCheckpointEnd(bool restartpoint) *** 6955,6964 { long write_secs, sync_secs, ! total_secs; int write_usecs, sync_usecs, ! total_usecs; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); --- 6955,6969 { long write_secs, sync_secs, ! total_secs, ! longest_secs, ! average_secs; int write_usecs, sync_usecs, ! total_usecs, ! longest_usecs, ! average_usecs; ! double average_sync_time; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); *** LogCheckpointEnd(bool restartpoint) *** 6974,6991 CheckpointStats.ckpt_sync_end_t, sync_secs, sync_usecs); if (restartpoint) elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); ! write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s, CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000); else elog(LOG, checkpoint complete: wrote %d buffers (%.1f%%); %d transaction log file(s) added, %d removed, %d recycled; ! write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s, CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, CheckpointStats.ckpt_segs_added, --- 6979,7017 CheckpointStats.ckpt_sync_end_t, sync_secs, sync_usecs); + /* + * Timing values returned from CheckpointStats are in milliseconds. + * Convert to the second plus microsecond form that TimestampDifference + * returns for homogeneous printing. + */ + longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000); + longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync - + longest_secs * 1000); + + average_sync_time = 0; + if (CheckpointStats.ckpt_sync_rels 0) + average_sync_time = CheckpointStats.ckpt_agg_sync_time / + CheckpointStats.ckpt_sync_rels; + average_secs = (long) (average_sync_time / 1000); + average_usecs = 1000 * (average_sync_time - average_secs * 1000); + if (restartpoint) elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); ! write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; ! sync files=%d
Re: [HACKERS] Spread checkpoint sync
Heikki Linnakangas wrote: If you fsync() a file with one dirty page in it, it's going to return very quickly, but a 1GB file will take a while. That could be problematic if you have a thousand small files and a couple of big ones, as you would want to reserve more time for the big ones. I'm not sure what to do about it, maybe it's not a problem in practice. It's a problem in practice allright, with the bulk-loading situation being the main one you'll hit it. If somebody is running a giant COPY to populate a table at the time the checkpoint starts, there's probably a 1GB file of dirty data that's unsynced around there somewhere. I think doing anything about that situation requires an additional leap in thinking about buffer cache evicition and fsync absorption though. Ultimately I think we'll end up doing sync calls for relations that have gone cold for a while all the time as part of BGW activity, not just at checkpoint time, to try and avoid this whole area better. That's a lot more than I'm trying to do in my first pass of improvements though. In the interest of cutting the number of messy items left in the official CommitFest, I'm going to mark my patch here Returned with Feedback and continue working in the general direction I was already going. Concept shared, underlying patches continue to advance, good discussion around it; those were my goals for this CF and I think we're there. I have a good idea how to autotune the sync spread that's hardcoded in the current patch. I'll work on finishing that up and organizing some more extensive performance tests. Right now I'm more concerned about finishing the tests around the wal_sync_method issues, which are related to this and need to get sorted out a bit more urgently. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+
Josh Berkus wrote: I modified test_fsync in two ways to run this; first, to make it support O_DIRECT, and second to make it run in the *current* directory. Patch please? I agree with the latter change; what test_fsync does is surprising. I suggested a while ago that we refactor test_fsync to use a common set of source code as the database itself for detecting things related to wal_sync_method, perhaps just extract that whole set of DEFINE macro logic to somewhere else. That happened at a bad time in the development cycle (right before a freeze) and nobody ever got back to the idea afterwards. If this code is getting touched, and it's clear it is in some direction, I'd like to see things change so it's not possible for the two to diverge again afterwards. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CommitFest 2010-11: Status report at 2/3 of scheduled time
After an unfortunate two weeks where I was lost in Nyquil-land and not paying as much attention as I should have, I just finished a pass checking over all the open 26 CommitFest items. Just under 50% of the patches that were open at the start of the CF are still hanging around, with 6 in Ready for Committer and the rest split evently waiting for reviewers or the author. A few larger patches or patch sets deserve some commentary. The multi-part Extensions patch continues to chug along. A committer needs to pick up pg_execute_from_file() next to keep progress on that series going, as the other two remaining patches depend on it (and may need to be rebased after it goes in). Help is also needed on coding review of the main Extensions patch. The as yet untouched ALTER EXTENSION ... SET SCHEMA ... is the last patch in the set to apply, and while it needs a reviewer eventually I don't think that's the blocking point on the general extension work yet. As for the also large SQL/MED set, I'm not sure whether it's posible to get review work started on its two dependent components--file_fdw and postgresql_fdw--until the main patch is finished off. Given the size of code involved, it wouldn't surprise me if everyone is still be chewing on that series into the next CF. Glad to see it all submitted for this one though. Most of the remaining patches that are in Needs Review have already gone through one round of review. There are two notable exceptions. While some libpq work has been getting finished for Sync Rep, neither of the two main patches for that have received any love for almost two months now. This is the spot where I start to get worried about seeing this finish for 9.1. I'm now trying to find time to work on those; anybody else need this feature who can help? -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Rob Wultsch wrote: Forgive me, but is all of this a step on the slippery slope to direct io? And is this a bad thing I don't really think so. There's an important difference in my head between direct I/O, where the kernel is told write this immediately!, and what I'm trying to achive. I want to give the kernel an opportunity to write blocks out in an efficient way, so that it can take advantage of elevator sorting, write combining, and similar tricks. But, eventually, those writes have to make it out to disk. Linux claims to have concepts like a deadline for I/O to happen, but they turn out to not be so effective once the system gets backed up with enough writes. Since fsync time is the only effective deadline, I'm progressing from the standpoint that adjusting when it happens relative to the write will help, while still allowing the kernel an opportunity to get the writes out on its own schedule. When ends up happening if you push toward fully sync I/O is the design you see in some other databases, where you need multiple writer processes. Then requests for new pages can continue to allocate as needed, while keeping any one write from blocking things. That's one sort of a way to simulate asynchronous I/O, and you can substitute true async I/O instead in many of those implementations. We didn't have much luck with portability on async I/O when that was last experimented with, and having multiple background writer processes seems like overkill; that whole direction worries me. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Proposed Windows-specific change: Enable crash dumps (like core files)
Craig Ringer wrote: On 11/24/2010 05:18 AM, Magnus Hagander wrote: Or you set the handler always, and have the handler only actually create the dump if the directory exists. That way you can add the directory and still get a dump from both existing backends and the postmaster itself without a restart. That's way smarter. No extra filesystem access during startup, even if it is cheap. I added a commenting referencing this bit to the CF entry so it doesn't get forgotten. Magnus raised a few other issues in his earlier review too. Discussion of this patch seems to have jumped the gun a bit into talking about backpatching before the code for HEAD was completely done, then stalled there. Are we going to see an updated patch that addresses submitted feedback in this cycle? -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Greg Stark wrote: Using sync_file_range you can specify the set of blocks to sync and then block on them only after some time has passed. But there's no documentation on how this relates to the I/O scheduler so it's not clear it would have any effect on the problem. I believe this is the exact spot we're stalled at in regards to getting this improved on the Linux side, as I understand it at least. *The* answer for this class of problem on Linux is to use sync_file_range, and I don't think we'll ever get any sympathy from those kernel developers until we do. But that's a Linux specific call, so doing that is going to add a write path fork with platform-specific code into the database. If I thought sync_file_range was a silver bullet guaranteed to make this better, maybe I'd go for that. I think there's some relatively low-hanging fruit on the database side that would do better before going to that extreme though, thus the patch. We might still have to delay the begining of the sync to allow the dirty blocks to be synced naturally and then when we issue it still end up catching a lot of other i/o as well. Whether it's lots or not is really workload dependent. I work from the assumption that the blocks being written out by the checkpoint are the most popular ones in the database, the ones that accumulate a high usage count and stay there. If that's true, my guess is that the writes being done while the checkpoint is executing are a bit less likely to be touching the same files. You raise a valid concern, I just haven't seen that actually happen in practice yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Heikki Linnakangas wrote: Do you have any idea how to autotune the delay between fsyncs? I'm thinking to start by counting the number of relations that need them at the beginning of the checkpoint. Then use the same basic math that drives the spread writes, where you assess whether you're on schedule or not based on segment/time progress relative to how many have been sync'd out of that total. At a high level I think that idea translates over almost directly into the existing write spread code. Was hoping for a sanity check from you in particular about whether that seems reasonable or not before diving into the coding. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Tom Lane wrote: I think the best answer is to get out of the business of using O_DIRECT by default, especially seeing that available evidence suggests it might not be a performance win anyway. I was concerned that open_datasync might be doing a better job of forcing data out of drive write caches. But the tests I've done on RHEL6 so far suggest that's not true; the write guarantees seem to be the same as when using fdatasync. And there's certainly one performance regression possible going from fdatasync to open_datasync, the case where you're overflowing wal_buffers before you actually commit. Below is a test of the troublesome behavior on the same RHEL6 system I gave test_fsync performance test results from at http://archives.postgresql.org/message-id/4ce2ebf8.4040...@2ndquadrant.com This confirms that the kernel now defining O_DSYNC behavior as being available, but not actually supporting it when running the filesystem in journaled mode, is the problem here. That's clearly a kernel bug and no fault of PostgreSQL, it's just never been exposed in a default configuration before. The RedHat bugzilla report seems a bit unclear about what's going on here, may be worth updating that to note the underlying cause. Regardless, I'm now leaning heavily toward the idea of avoiding open_datasync by default given this bug, and backpatching that change to at least 8.4. I'll do some more database-level performance tests here just as a final sanity check on that. My gut feel is now that we'll eventually be taking something like Marti's patch, adding some more documentation around it, and applying that to HEAD as well as some number of back branches. $ mount | head -n 1 /dev/sda7 on / type ext4 (rw) $ cat $PGDATA/postgresql.conf | grep wal_sync_method #wal_sync_method = fdatasync# the default is the first option $ pg_ctl start server starting LOG: database system was shut down at 2010-12-01 17:20:16 EST LOG: database system is ready to accept connections LOG: autovacuum launcher started $ psql -c show wal_sync_method wal_sync_method - open_datasync [Edit /etc/fstab, change mount options to be data=journal and reboot] $ mount | grep journal /dev/sda7 on / type ext4 (rw,data=journal) $ cat postgresql.conf | grep wal_sync_method #wal_sync_method = fdatasync# the default is the first option $ pg_ctl start server starting LOG: database system was shut down at 2010-12-01 12:14:50 EST PANIC: could not open file pg_xlog/00010001 (log file 0, segment 1): Invalid argument LOG: startup process (PID 2690) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure $ pg_ctl stop $ vi $PGDATA/postgresql.conf $ cat $PGDATA/postgresql.conf | grep wal_sync_method wal_sync_method = fdatasync# the default is the first option $ pg_ctl start server starting LOG: database system was shut down at 2010-12-01 12:14:40 EST LOG: database system is ready to accept connections LOG: autovacuum launcher started -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Jeff Janes wrote: For the individual file sync times emitted under debug1, it would be very handy if the file being synced was identified, for example relation base/16384/16523. Rather than being numbered sequentially within a given checkpoint. I was numbering them sequentially so that it's straightforward to graph the sync times in an external analysis tool, but the relation data is helpful too. New patch reflecting all upthread suggestions is attached. The output looks like this now at DEBUG1: LOG: checkpoint starting: xlog DEBUG: checkpoint sync: number=1 file=base/16424/11645 time=11589.549000 msec DEBUG: checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=53.53 msec DEBUG: checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec DEBUG: checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000 msec DEBUG: checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec DEBUG: checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000 msec DEBUG: checkpoint sync: number=9 file=base/16424/16442 time=7.92 msec DEBUG: checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000 msec DEBUG: checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec DEBUG: checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000 msec DEBUG: checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec DEBUG: checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec DEBUG: checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000 msec DEBUG: checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000 msec LOG: checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s, total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s I kept the units for the DEBUG level ones in msec because that's a better scale for the common really short syncs during good behavior. But the summary info in seconds now appears at the end of the existing checkpoint complete message, so only one line to parse for those looking to analyze the gross checkpoint data. That looks to work well enough for finding situations like the big ext3 spikes. You can easily see one in this example by the fact that longest=11.589 s is almost the entirety of sync=11.715 s. That's the really key thing there's currently no visibility into, that's made obvious with this patch. This might be ready for some proper review now. I know there's at least one blatant bug still in here I haven't found yet, related to how the averages are computed. I saw this once: LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s After an immediate checkpoint, so at least one path not quite right yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ede6ceb..6f6eb3b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7063,10 +7063,15 @@ LogCheckpointEnd(bool restartpoint) { long write_secs, sync_secs, -total_secs; +total_secs, +longest_secs, +average_secs; int write_usecs, sync_usecs, -total_usecs; +total_usecs, +longest_usecs, +average_usecs; + double average_sync_time; CheckpointStats.ckpt_end_t = GetCurrentTimestamp(); @@ -7082,18 +7087,35 @@ LogCheckpointEnd(bool restartpoint) CheckpointStats.ckpt_sync_end_t, sync_secs, sync_usecs); + /* + * Timing values returned from CheckpointStats are in milliseconds. + * Convert to the second plus microsecond form that TimestampDifference + * returns for homogeneous printing. + */ + longest_secs = (long) (CheckpointStats.ckpt_longest_sync / 1000); + longest_usecs = 1000 * (CheckpointStats.ckpt_longest_sync - longest_secs * 1000); + + average_sync_time = CheckpointStats.ckpt_longest_sync / CheckpointStats.ckpt_sync_rels; + average_secs = (long) (average_sync_time / 1000); + average_usecs = 1000 * (average_sync_time - average_secs * 1000); + if (restartpoint) elog(LOG, restartpoint complete: wrote %d buffers (%.1f%%); - write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s, + write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; + sync files=%d, longest=%ld.%03d s, average=%ld.%03d s, CheckpointStats.ckpt_bufs_written, (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, write_secs, write_usecs / 1000, sync_secs, sync_usecs / 1000, - total_secs, total_usecs / 1000); + total_secs, total_usecs / 1000
Re: [HACKERS] Spread checkpoint sync
Ron Mayer wrote: Might smoother checkpoints be better solved by talking to the OS vendors virtual-memory-tunning-knob-authors to work with them on exposing the ideal knobs; rather than saying that our only tool is a hammer(fsync) so the problem must be handled as a nail. Maybe, but it's hard to argue that the current implementation--just doing all of the sync calls as fast as possible, one after the other--is going to produce worst-case behavior in a lot of situations. Given that it's not a huge amount of code to do better, I'd rather do some work in that direction, instead of presuming the kernel authors will ever make this go away. Spreading the writes out as part of the checkpoint rework in 8.3 worked better than any kernel changes I've tested since then, and I'm not real optimisic about this getting resolved at the system level. So long as the database changes aren't antagonistic toward kernel improvements, I'd prefer to have some options here that become effective as soon as the database code is done. I've attached an updated version of the initial sync spreading patch here, one that applies cleanly on top of HEAD and over top of the sync instrumentation patch too. The conflict that made that hard before is gone now. Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the time now has made me reconsider how important one potential bit of refactoring here would be. I managed to catch one of the situations where really popular relations were being heavily updated in a way that was competing with the checkpoint on my test system (which I can happily share the logs of), with the instrumentation patch applied but not the spread sync one: LOG: checkpoint starting: xlog DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 7747 of relation base/16424/16442 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 42688 of relation base/16424/16437 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 9723 of relation base/16424/16442 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 58117 of relation base/16424/16437 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 165128 of relation base/16424/16437 [330 of these total, all referring to the same two relations] DEBUG: checkpoint sync: number=1 file=base/16424/16448_fsm time=10132.83 msec DEBUG: checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec DEBUG: checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec DEBUG: checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec DEBUG: checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec DEBUG: checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 msec DEBUG: checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000 msec DEBUG: checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec DEBUG: checkpoint sync: number=11 file=base/16424/16425 time=0.00 msec DEBUG: checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000 msec DEBUG: checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000 msec LOG: checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s Note here how the checkpoint was hung on trying to get 16448_fsm written out, but the backends were issuing constant competing fsync calls to these other relations. This is very similar to the production case this patch was written to address, which I hadn't been able to share a good example of yet. That's essentially what it looks like, except with the contention going on for minutes instead of seconds. One of the ideas Simon and I had been considering at one point was adding some better de-duplication logic to the fsync absorb code, which I'm reminded by the pattern here might be helpful independently of other improvements. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 620b197..501cab8 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -143,8 +143,8 @@ typedef struct static BgWriterShmemStruct *BgWriterShmem; -/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */ -#define WRITES_PER_ABSORB 1000 +/* Fraction of fsync absorb queue that needs to be filled before acting
Re: [HACKERS] Spread checkpoint sync
Jeff Janes wrote: Have you tested out this absorb during syncing phase code without the sleep between the syncs? I.e. so that it still a tight loop, but the loop alternates between sync and absorb, with no intentional pause? Yes; that's how it was developed. It helped to have just the extra absorb work without the pauses, but that alone wasn't enough to really improve things on the server we ran into this problem badly on. I ask because I don't have a mental model of how the pause can help. Given that this dirty data has been hanging around for many minutes already, what is a 3 second pause going to heal? The difference is that once an fsync call is made, dirty data is much more likely to be forced out. It's the one thing that bypasses all other ways the kernel might try to avoid writing the data--both the dirty ratio guidelines and the congestion control logic--and forces those writes to happen as soon as they can be scheduled. If you graph the amount of data shown Dirty: by /proc/meminfo over time, once the sync calls start happening it's like a descending staircase pattern, dropping a little bit as each sync fires. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Jeff Janes wrote: For the individual file sync times emitted under debug1, it would be very handy if the file being synced was identified, for example relation base/16384/16523. I was in the middle of looking into adding that already, so consider that part of the target for the next update I'm working on. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Robert Haas wrote: Did this get eaten by the email goblin, or you're still working on it? Fell behind due to an unfortunately timed bit of pneumonia. Hurray for the health benefits of cross country flights. Will fix this up, rebase my other patch, and head toward some more review/'Fest cleanup now that I'm feeling better. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Horizontal Write Scaling
Eliot Gable wrote: However, I have not seen much in the way of a serious attempt at multi-master write scaling. Scaling writes across nodes using PL/Proxy works. Of course, I am assuming the disk system would be RAID 1, RAID 10, RAID 5, or RAID 6 for reliability purposes and that it is sufficiently redundant that you don't have to worry about an outage of your storage system. The idea that you'll have a system that needs better write scalability that isn't limited by the storage system is an unusual one, not the expected case. And the trend everywhere in the industry is away from giant redundant systems, and toward having multiple cheaper redundant copies of all the data instead. It's impossible to protect against things like environmental failure at any single location. Once you've accepted that you have to be able to replicate this beast too if you want high availability, you're back at having a multi-node problem again. This is why the most active work is on distributed designs that start on that basis, rather than projects trying to build more scalable monoliths. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Jeff Janes wrote: And for very large memory systems, even 1% may be too much to cache (dirty*_ratio can only be set in integer percent points), so recent kernels introduced dirty*_bytes parameters. I like these better because they do what they say. With the dirty*_ratio, I could never figure out what it was a ratio of, and the results were unpredictable without extensive experimentation. Right, you can't set dirty_background_ratio low enough to make this problem go away. Even attempts to set it to 1%, back when that that was the right size for it, seem to be defeated by other mechanisms within the kernel. Last time I looked at the related source code, it seemed the congestion control logic that kicks in to throttle writes was a likely suspect. This is why I'm not real optimistic about newer mechanism like the dirty_background_bytes added 2.6.29 to help here, as that just gives a mapping to setting lower values; the same basic logic is under the hood. Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly due to the same congestion mechanism. Yes, but how much work do we want to put into redoing the checkpoint logic so that the sysadmin on a particular OS and configuration and FS can avoid having to change the kernel parameters away from their defaults? (Assuming of course I am correctly understanding the problem, always a dangerous assumption.) I've been trying to make this problem go away using just the kernel tunables available since 2006. I adjusted them carefully on the server that ran into this problem so badly that it motivated the submitted patch, months before this issue got bad. It didn't help. Maybe if they were running a later kernel that supported dirty_background_bytes that would have worked better. During the last few years, the only thing that has consistently helped in every case is the checkpoint spreading logic that went into 8.3. I no longer expect that the kernel developers will ever make this problem go away the way checkpoints are written out right now, whereas the last good PostgreSQL work in this area definitely helped. The basic premise of the current checkpoint code is that if you write all of the buffers out early enough, by the time syncs execute enough of the data should have gone out that those don't take very long to process. That was usually true for the last few years, on systems with a battery-backed cache; the amount of memory cached by the OS was relatively small relative to the RAID cache size. That's not the case anymore, and that divergence is growing bigger. The idea that the checkpoint sync code can run in a relatively tight loop, without stopping to do the normal background writer cleanup work, is also busted by that observation. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: Doing all the writes and then all the fsyncs meets this requirement trivially, but I'm not so sure that's a good idea. For example, given files F1 ... Fn with dirty pages needing checkpoint writes, we could do the following: first, do any pending fsyncs for files not among F1 .. Fn; then, write all pages for F1 and fsync, write all pages for F2 and fsync, write all pages for F3 and fsync, etc. This might seem dumb because we're not really giving the OS a chance to write anything out before we fsync, but think about the ext3 case where the whole filesystem cache gets flushed anyway. I'm not horribly interested in optimizing for the ext3 case per se, as I consider that filesystem fundamentally broken from the perspective of its ability to deliver low-latency here. I wouldn't want a patch that improved behavior on filesystem with granular fsync to make the ext3 situation worst. That's as much as I'd want design to lean toward considering its quirks. Jeff Janes made a case downthread for why not make it the admin/OS's job to worry about this? In cases where there is a reasonable solution available, in the form of switch to XFS or ext4, I'm happy to take that approach. Let me throw some numbers out to give a better idea of the shape and magnitude of the problem case I've been working on here. In the situation that leads that the near hour-long sync phase I've seen, checkpoints will start with about a 3GB backlog of data in the kernel write cache to deal with. That's about 4% of RAM, just under the 5% threshold set by dirty_background_ratio. Whether or not the 256MB write cache on the controller is also filled is a relatively minor detail I can't monitor easily. The checkpoint itself? 250MB each time. This proportion is why I didn't think to follow the alternate path of worrying about spacing the write and fsync calls out differently. I shrunk shared_buffers down to make the actual checkpoints smaller, which helped to some degree; that's what got them down to smaller than the RAID cache size. But the amount of data cached by the operating system is the real driver of total sync time here. Whether or not you include all of the writes from the checkpoint itself before you start calling fsync didn't actually matter very much; in the case I've been chasing, those are getting cached anyway. The write storm from the fsync calls themselves forcing things out seems to be the driver on I/O spikes, which is why I started with spacing those out. Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog takes a minimum of 10 minutes of real time. There are about 300 1GB relation files involved in the case I've been chasing. This is where the 3 second delay number came from; 300 files, 3 seconds each, 900 seconds = 15 minutes of sync spread. You can turn that math around to figure out how much delay per relation you can afford while still keeping checkpoints to a planned end time, which isn't done in the patch I submitted yet. Ultimately what I want to do here is some sort of smarter write-behind sync operation, perhaps with a LRU on relations with pending fsync requests. The idea would be to sync relations that haven't been touched in a while in advance of the checkpoint even. I think that's similar to the general idea Robert is suggesting here, to get some sync calls flowing before all of the checkpoint writes have happened. I think that the final sync calls will need to get spread out regardless, and since doing that requires a fairly small amount of code too that's why we started with that. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changes to Linux OOM killer in 2.6.36
Kevin Grittner wrote: Greg Smith wrote: oom_adj is deprecated, scheduled for removal in August 2010: That surprised me so I checked the URL. I believe you have a typo there and it's August, 2012. This is why I include references, so that when the cold medicine hits me in the middle of proofreading my message and I sent it anyway you aren't mislead. Yes, 2012, only a few months before doomsday. The aproaching end of the world then means any bugs left can be marked WONTFIX. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
[HACKERS] Changes to Linux OOM killer in 2.6.36
Last month's new Linux kernel 2.6.36 includes a rewrite of the out of memory killer: http://lwn.net/Articles/391222/ http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a63d83f427fbce97a6cea0db2e64b0eb8435cd10 The new badness method totals the task's RSS and swap as a percentage of RAM, where the old one scored starting with the total memory used by the process. I *think* that this is an improvement for PostgreSQL, based on the sort of data I see with: ps -o pid,rss,size,vsize,args -C postgres But I haven't tested with one of the new kernels yet to be sure. Something to look at next time I get in that bleeding edge kernel kind of mood. One thing that's definitely changed is the interface used to control turning off the OOM killer. There's a backward compatibility translation right now that maps the current -17 bit mask value the PostgreSQL code sends to /proc/pid/oom_adj into the new units scale. However, oom_adj is deprecated, scheduled for removal in August 2010: http://www.mjmwired.net/kernel/Documentation/feature-removal-schedule.txt So eventually, if the OOM disabling code is still necessary in PostgreSQL, it will need to do this sort of thing instead: echo -1000 /proc/pid/oom_score_adj I've seen kernel stuff get deprecated before the timeline before for code related reasons (when the compatibility bits were judged too obtrusive to keep around anymore), but since this translation bit is only a few lines of code I wouldn't expect that to happen here. I don't think it's worth doing anything to the database code until tests on the newer kernel confirm whether this whole thing is even necessary anymore. Wanted to pass along the info while I was staring at it though. Thanks to Daniel Farina for pointing this out. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving prep_buildtree used in VPATH builds
Gurjeet Singh wrote: Seems like it would improve performance in general, but more so under load conditions when you actually need it. I am not sure if -depth option is supported by all incarnations of 'find'. Given the way directory writes are cached by the filesystem, I'm not sure why the load at the time matters so much. If what you mean by that is you're low on memory, that would make more sense to me. Anyway, -depth is in POSIX as of 2001. It seems to be in all the major SysV UNIX variants going back further then that, and of course it's in GNU find. But it looks like BSD derived systems from before that POSIX standard originally called this -d instead. So there's some potential for this to not work on older systems; it works fine on my test FreeBSD 7.2 system. Maybe someone who has access to some ancient BSD-ish system can try this out? The simplest test case similar to what the patch adds seems to be if this runs, returning subdirectories in depth-first order before their parent: $ find / -depth -type d -print If that fails somewhere, it may turn out to require another configure check just to determine whether you can use this configuration time optimization. That's certainly possible to add to the patch if it got committed and turns out to break one of the buildfarm members. It seems to me like this whole thing may be a bit more work than it's worth, given this is a fairly small and difficult to reproduce speedup in only one stage of the build process. I'd think that if configure takes longer than it has to because the system is heavily loaded, the amount compilation time is going to suffer from that would always dwarf this component of total build time. But if this was slow enough at some point to motivate you to write a patch for it, maybe that assumption is wrong. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI update
Kevin Grittner wrote: That went in an hour and a half before the CF deadline, but I just didn't feel right putting it into the CF in that shape. Then keep on working on it and we can revisit its state when you're happy with it. The purpose of the CommitFest cut-off is not to block work on long-term development just because a deadline passed; it's to make sure patches which might otherwise never get a review are looked at eventually. Finding a reviewer for these larger and complicated patches is a very different sort of job than finding one for an average patch anyway. I could use a brief reminder of how this bit fits into the serializable lock consistency patch that's already sitting into the CF queue as Ready for Committer though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
So my task list is: 0) Rebase against the HEAD that just code related to this touched today 1) Assume that log_checkpoints is sufficient control over whether the timing overhead added is worth collecting, and therefore remove the half-baked idea of also wrapping with a compile-time option. 2) Have the sync summary returned upwards, so it can be put onto the same line as the rest of the rest of the log_checkpoint info. All seems reasonable to me. Will rev a new patch by tomorrow. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Count backend self-sync calls
Jeff Janes wrote: Do you know where this competition is happening? Is it on the platters, or is it in the hard drive write cache (I thought high-end hardware had tagged writes to avoid that), or in the kernel? Kernel. Linux systems with lots of memory will happily queue up gigabytes of memory in their write cache, only getting serious about writing it out to disk when demanded to by fsync. This makes sense if we just need to append to a queue. But once the queue is full and we are about to do a backend fsync, might it make sense to do a little more work to look for dups? One of the paths I'd like to follow is experimenting with both sorting writes by file and looking for duplication in the queues. I think a basic, simple sync spreading approach needs to get finished first through; this sort of thing would then be an optimization on top of it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a new problem in MERGE
Boxuan Zhai wrote: I have plan to fix the above two bugs together. (in fact, I have already started coding in merge_v202 edition). My question is how should I make my update be consistent with yours. Is it possible for you to give me an edition that I can work on? I just got this reconciled with HEAD again. There have been two changes I made in the code you'll eventually want in your working copy: 1) Fix NIL/NULL confusion: https://github.com/greg2ndQuadrant/postgres/commit/9013ba9e81490e3623add1b029760817021297c0 2) Update ExecMerge to accept and pass through an oldtuple value. This is needed to make the code compatible with the PostgreSQL git HEAD after the changes made in the 2009-09 CommitFest. Bit rot updates made: https://github.com/greg2ndQuadrant/postgres/commit/be03bd201720f42a666f7e356ec8507c1357f502 I'm not sure if how I did (2) is correct for all cases, but at least the code compiles again now and the server will start. Attached is an updated patch that applies to HEAD as of right now, and that code has been pushed to https://github.com/greg2ndQuadrant/postgres/tree/merge-unstable with the changes rebased to be the last two commits. It fails make installcheck on my system. But as the initial diffs I looked at relate to enums and such, I don't think that's a problem with your patch. Will investigate further here with some of my own patches I'm working on today. Hopefully this is enough to unblock what you were looking for more details from me about. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD merge-v203-20101114.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_stat_bgwriter broken?
I'm behind on my list mail so maybe this has been mentioned already, but when I just tried pg_stat_bgwriter from a build against today's HEAD I got this: pgbench=# select count(*) FROM pg_stat_bgwriter; ERROR: did not find '}' at end of input node Can someone confirm if this broke recently, or is it just me? Last time I would have tested this myself was a few weeks ago. Regardless, I was thinking of adding some basic sanity checking on this view, that runs from the regression tests to catch this class of problem in the future. It's kind of sloppy that this and the bgwriter counter reset aren't tested as part of make check. I think these two would always produce stable results: postgres=# SELECT count(*) FROM pg_stat_bgwriter; count --- 1 postgres=# SELECT pg_stat_reset_shared('bgwriter'); pg_stat_reset_shared -- (1 row) And the first one of those is similarly broken on my install. Thoughts on whether adding those to the regression tests would be a worthwhile patch? I'll do the work, just thinking out loud about the concept. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_bgwriter broken?
Tom Lane wrote: Worksforme. You probably need a full recompile and/or initdb Yeah, sorry about the noise. This went away after some more intensive rebuilding. I think I still want to add some regression testing of this view as suggested. If that had been there, I'd have been a lot more confident it was my mistake, and not something that just slipped through without being tested. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Instrument checkpoint sync calls
Attached patch adds some logging for each individual fsync call made during a checkpoint, along with a summary at the end. You need to have the following to see all of the detail: log_checkpoints=on log_min_messages=debug1 And here's a sample: LOG: checkpoint starting: immediate force wait DEBUG: checkpoint sync: file=1 time=1.946000 msec DEBUG: checkpoint sync: file=2 time=0.666000 msec DEBUG: checkpoint sync: file=3 time=0.004000 msec LOG: checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec LOG: checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s, total=0.003 s I think that it's reasonable for the sort of people who turn log_checkpoints on to also get the sync summary line, thus it being logged at LOG level. The detail on individual syncs might go to DEBUG2 or lower instead of DEBUG1 where I put it, that part I don't have a strong opinion on. It's at DEBUG1 to make testing the patch without a gigantic amount of log data also coming in easier. Right now the code is written such that all the calls that grab timing information are wrapped around ifdef DEBUG_FSYNC, which is a variable set to 1 that could be a compile-time option like DEBUG_DEADLOCK, to allow turning this code path off at build time. I personally think that if you're already making an fsync call and have log_checkpoints on, the additional overhead of also timing that fsync is minimal even on platforms where timing is slow (I don't have such a system to test that assumption however). And I've seen enough value in troubleshooting nasty checkpoint sync problems using this patch to feel it's worth having even if it does add some overhead. I'm a little concerned about log_checkpoints changing on me in the middle of the execution of a checkpoint, which would cause some problems here. Not sure if that's something I should explicitly code for, given that all I think it will do is screw up one of the timing results. It does seem a risk from the last minute self-review I just did of the code. I'll give a sample program that stresses the system, generating slow timing results and other types of bad behavior, along with the next patch I submit here shortly. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 1219fcf..c72da72 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -20,6 +20,7 @@ #include catalog/catalog.h #include miscadmin.h +#include portability/instr_time.h #include postmaster/bgwriter.h #include storage/fd.h #include storage/bufmgr.h @@ -29,6 +30,11 @@ #include utils/memutils.h #include pg_trace.h +/* + * For now just leave this on all the time, eventually this will + * need to be a full compiler flag like DEBUG_DEADLOCK + */ +#define DEBUG_FSYNC 1 /* interval for calling AbsorbFsyncRequests in mdsync */ #define FSYNCS_PER_ABSORB 10 @@ -927,6 +933,15 @@ mdsync(void) PendingOperationEntry *entry; int absorb_counter; +#ifdef DEBUG_FSYNC + /* Statistics on sync times */ + int processed = 0; + instr_time sync_start,sync_end,diff; + double elapsed; + double longest = 0; + double total_elapsed = 0; +#endif + /* * This is only called during checkpoints, and checkpoints should only * occur in processes that have created a pendingOpsTable. @@ -1069,9 +1084,32 @@ mdsync(void) seg = _mdfd_getseg(reln, entry-tag.forknum, entry-tag.segno * ((BlockNumber) RELSEG_SIZE), false, EXTENSION_RETURN_NULL); + +#ifdef DEBUG_FSYNC +if (log_checkpoints) + INSTR_TIME_SET_CURRENT(sync_start); +#endif if (seg != NULL FileSync(seg-mdfd_vfd) = 0) +{ + +#ifdef DEBUG_FSYNC + if (log_checkpoints) + { + INSTR_TIME_SET_CURRENT(sync_end); + diff=sync_end; + INSTR_TIME_SUBTRACT(diff, sync_start); + elapsed = (double) INSTR_TIME_GET_MICROSEC(diff) / 1000.0; + if (elapsed longest) + longest = elapsed; + total_elapsed+=elapsed; + processed++; + /* Reduce the log level here to DEBUG2 in final patch? */ + elog(DEBUG1, checkpoint sync: file=%d time=%f msec, processed, elapsed); + } +#endif break; /* success; break out of retry loop */ + } /* * XXX is there any point in allowing more than one retry? @@ -1113,6 +1151,13 @@ mdsync(void) elog(ERROR, pendingOpsTable corrupted); } /* end loop over hashtable entries */ +#ifdef DEBUG_FSYNC + /* Report on sync performance metrics */ + if (log_checkpoints (processed 0)) + elog(LOG, checkpoint sync: files=%d longest=%f msec average=%f msec, + processed, longest, total_elapsed / processed); +#endif + /* Flag successful completion of mdsync */ mdsync_in_progress = false; } -- Sent via pgsql-hackers mailing list