[HACKERS] Foreign Visual Studio builds
We got an interesting documentation document left by Christian Freund recently, in regards to http://www.postgresql.org/docs/9.0/interactive/install-windows-full.html ; it says: Regarding 16.1.3 - "perl mkvcbuild.pl" In case you use a German version of VC change line 69 in "Solution.pm" to: if ($line !~ /^Microsoft\s*\(R\) Visual C\+\+-Projekt-Generator - \D+(\d+)\.00\.\d+/) I'm going to approve the doc comment so that it might help someone else that runs into this in the short-term, but I thought it was was worth sharing as something that might be improved in the build code instead. That regex seems a bit too specific. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Debugging initdb breakage
David Fetter wrote: Where should we preserve this, other than the mailing list archives? http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD has most of the other trivia in this area, so I just added Alvaro's technique to the bottom of it with a quick intro to add some context. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Simplifying replication
Josh Berkus wrote: It is critical that we make replication easier to set up, administrate and monitor than it currently is. In my conversations with people, this is more important to our users and the adoption of PostgreSQL than synchronous replication is. You should enjoy one of the patches we're furiously working on then, which is aiming at some of the administration and monitoring pieces here. I have my own grand vision of how easy replication should be to setup too. Visions and plans are nice, but building functional pieces of them and delivering them to the community is what actually moves PostgreSQL forward. So far, multiple people have done that for sync rep, and what we're supposed to be focused on at this stage in the development cycle is finishing the work related to the open CommitFest item that includes that. I find this launch into a new round of bike-shedding a bit distracting. If you want this to be easier to use, which it's obvious to any observer it should be because what's delivered in 9.0 is way too complicated, please work on finding development resources to assign to that problem. Because that's the bottleneck on simplifying things, not ideas about what to do. I would recommend finding or assigning a developer to work on integrating base backup in to the streaming protocol as the biggest single thing that would improve the built-in replication. All of the rest of the trivia about what knobs to set and such are tiny details that make for only a minor improvement until that's taken care of. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Simplifying replication
Josh Berkus wrote: *shrug*. Robert asked me to write it up for the list based on the discussions around synch rep. Now you're going to bash me for doing so? Sorry, next time I'll make sure to bash Robert too. I don't have any problems with the basic ideas you're proposing, just concerns about when the right time to get into that whole giant subject is and who is going to work on. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Josh Berkus wrote: Under what bizarre set of circumstances would anyone have runaway connections from replicas to the master? Cloud computing deployments where additional replicas are brought up automatically in response to demand. It's easy to imagine a situation where a standby instance is spawned, starts to sync, and that additional load triggers *another* standby to come on board; repeat until the master is doing nothing but servicing standby sync requests. max_wal_senders provides a safety value for that. I think Magnus's idea to bump the default to 5 triages the worst of the annoyance here, without dropping the feature (which has uses) or waiting for new development to complete. I'd be in favor of just committing that change right now, before it gets forgotten about, and then if nobody else gets around to further work at least something improved here for 9.1. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Josh Berkus wrote: Well, now that you mention it, I also think that "hot standby" should be the default. Yes, I know about the overhead, but I also think that the number of our users who want easy replication *far* outnumber the users who care about an extra 10% WAL overhead. I think this whole situation is similar to the resistance to increasing default_statistics_target. There's additional overhead added by enabling both of these settings, in return for making it more likely for the average person to see useful behavior by default. If things are rejiggered so the advanced user has to turn things off in order to get optimal performance when not using these features, in return for the newbie being more likely to get things working in basic form, that's probably a net win for PostgreSQL advocacy. But much like default_statistics_target, there needs to be some more formal work done on quantifying just how bad each of these overheads really are first. We lost 3-7% on multiple simple benchmarks between 8.3 and 8.4[1] because of that retuning for ease of use on real-world workloads, and that's not something you want to repeat too often. [1] http://suckit.blog.hu/2009/09/29/postgresql_history and the tests Jignesh did while at Sun -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Josh Berkus wrote: If we could agree on some workloads, I could run some benchmarks. I'm not sure what those would be though, given that COPY and ALTER TABLE aren't generally included in most benchmarks. You can usefully and easily benchmark this by timing a simple pgbench initialization at a decently large scale. The COPY used to populate the giant accounts table takes advantage of the WAL bypass fast path if available, and you can watch performance tank the minute one of the options that disables it is turned on. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ask for review of MERGE
Robert Haas wrote: I think the right way to write UPSERT is something along the lines of: MERGE INTO Stock t USING (VALUES (10, 1)) s(item_id, balance) ON s.item_id = t.item_id ... That led in the right direction, after a bit more fiddling I was finally able to get something that does what I wanted: a single table UPSERT implemented with this MERGE implementation. Here's a log of a test session, suitable for eventual inclusion in the regression tests: CREATE TABLE Stock(item_id int UNIQUE, balance int); INSERT INTO Stock VALUES (10, 2200); INSERT INTO Stock VALUES (20, 1900); SELECT * FROM Stock ORDER BY item_id; item_id | balance -+- 10 |2200 20 |1900 MERGE INTO Stock t USING (VALUES(10,100)) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; MERGE 1 SELECT * FROM Stock ORDER BY item_id; item_id | balance -+- 10 |2300 20 |1900 MERGE INTO Stock t USING (VALUES(30,2000)) AS s(item_id,balance) ON s.item_id=t.item_id WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance) ; MERGE 1 SELECT * FROM Stock ORDER BY item_id; item_id | balance -+- 10 |2300 20 |1900 30 |2000 I'm still a little uncertain as to whether any of my other examples should have worked under the spec but just didn't work here, but I'll worry about that later. Here's what the query plan looks like on a MATCH: Merge (cost=0.00..8.29 rows=1 width=22) (actual time=0.166..0.166 rows=0 loops=1) Action 1: Update When Matched Action 2: Insert When Not Mactched MainPlan: -> Nested Loop Left Join (cost=0.00..8.29 rows=1 width=22) (actual time=0.050..0.061 rows=1 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=1) -> Index Scan using stock_item_id_key on stock t (cost=0.00..8.27 rows=1 width=14) (actual time=0.026..0.030 rows=1 loops=1) Index Cond: ("*VALUES*".column1 = item_id) Total runtime: 0.370 ms And here's a miss: Merge (cost=0.00..8.29 rows=1 width=22) (actual time=0.145..0.145 rows=0 loops=1) Action 1: Update When Matched Action 2: Insert When Not Mactched MainPlan: -> Nested Loop Left Join (cost=0.00..8.29 rows=1 width=22) (actual time=0.028..0.033 rows=1 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=8) (actual time=0.004..0.005 rows=1 loops=1) -> Index Scan using stock_item_id_key on stock t (cost=0.00..8.27 rows=1 width=14) (actual time=0.015..0.015 rows=0 loops=1) Index Cond: ("*VALUES*".column1 = item_id) Total runtime: 0.255 ms Next steps here: 1) Performance/concurrency tests against trigger-based UPSERT approach. 2) Finish bit rot cleanup against HEAD. 3) Work out more complicated test cases to try and fine more unexpected behavior edge cases and general bugs. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ask for review of MERGE
There are now two branches of MERGE code in review progress available. http://github.com/greg2ndQuadrant/postgres/tree/merge-unstable has the bit-rotted version that doesn't quite work against HEAD yet, while http://github.com/greg2ndQuadrant/postgres/tree/merge is what I'm still testing against until I get that sorted out. Attached is a tar file containing an initial concurrent MERGE test case. What it does is create is a pgbench database with a scale of 1. Then it runs a custom test that does an UPSERT using MERGE against pgbench_accounts, telling pgbench the database scale is actually 2. This means that approximately half of the statements executed will hit the MATCHED side of the merge and update existing rows, while half will hit the NOT MATCHED one and create new account records. Did some small tests using pgbench's debug mode where you see all the statements it executes, and all the output looked correct. Successive tests runs are not deterministically perfect performance comparisons, but with enough transactions I hope that averages out. For comparison sake, there's an almost identical test case that does the same thing using the pl/pgsql example function from the PostgreSQL manual for the UPSERT instead also in there. You just run test-merge.sh or test-function.sh and it runs the whole test, presuming you built and installed pgbench and don't mind that it will blow away any database named pgbench you already have. Since the current MERGE implementation is known not to be optimized for concurrency, my main goal here wasn't to see how fast it was. That number is interesting though. With the sole postgresql.conf change of: checkpoint_settings=32 And a debug/assertion build using 8 concurrent clients, I got 1607 TPS of UPSERT out of the trigger approach @ 6MB/s of writes to disk, while the MERGE one delivered 988 TPS @ 4.5MB/s of writes. Will explore this more later. This did seem to find a bug in the implementation after running for a while: TRAP: FailedAssertion("!(epqstate->origslot != ((void *)0))", File: "execMain.c", Line: 1732) Line number there is relative to what you can see at http://github.com/greg2ndQuadrant/postgres/blob/merge/src/backend/executor/execMain.c and that's a null pointer check at the beginning of EvalPlanQualFetchRowMarks. Haven't explored why this happened or how repeatable this is, but since Boxuan was looking for some bugs to chase I wanted to deliver him one to chew on. While the performance doesn't need to be great in V1, there needs to be at least some minimal protection against concurrency issues before this is commit quality. Will continue to shake this code out looking for them now that I have some basic testing that works for doing so. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us merge-test-v1.tar Description: Unix tar archive -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ask for review of MERGE
Marko Tiikkaja wrote: What's been bothering me is that so far there has not been an agreement on whether we need to protect against concurrency issues or not. In fact, there has been almost no discussion about the concurrency issues which AIUI have been the biggest single reason we don't have MERGE already. Sure there were, they just happened a long time ago. I tried to summarize the previous discussion at http://wiki.postgresql.org/wiki/SQL_MERGE with the following: "PostgreSQL doesn't have a good way to lock access to a key value that doesn't exist yet--what other databases call key range locking...Improvements to the index implementation are needed to allow this feature." You can sift through the other archive links in there, the discussion leading up to that conclusion is in there somewhere. And we had a discussion of this at the last developer's meeting where this was identified as a key issue to sort out before this was done; see the table at the end of http://wiki.postgresql.org/wiki/PgCon_2010_Developer_Meeting and note the warning associated with this feature. The deadlock on developing this feature has been that there's little benefit to building key range locking on its own, or even a good way to prove that it works, without a visible feature that uses it. It's much easier to justify development time on if the rest of MERGE is known to work, and just needs that plugged in to improve concurrency. And since Boxuan appeared at the right time to do the syntax and implementation part first as well, that's the order it's ended up happening in. We're only making the concurrency part a second priority right now in order to break the development into managable chunks. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ask for review of MERGE
Robert Haas wrote: I am also wondering exactly what the semantics are supposed to be under concurrency. We can't assume that it's OK to treat WHEN NOT MATCHED as WHEN MATCHED if a unique-key violation is encountered. The join condition could be something else entirely. I guess we could scan the target table with a dirty snapshot and block on any in-doubt tuples, similar to what EXCLUDE constraints do internally, but throwing MVCC out the window doesn't seem right either. You've answered Tom's question of why not to just INSERT and trap the error here. Presuming a unique key violation is what will kick back in this situation and to just follow the other side doesn't cover all the ways this can get used. I'm thinking of this problem now as being like the one that happens when a concurrent UPDATE happens. To quote from the docs here: "a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the second updater can proceed with updating the originally found row. If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row. The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition. If so, the second updater proceeds with its operation using the updated version of the row." What I think we can do with adding a key reservation is apply the same sort of logic to INSERTs too, given a way to lock an index entry before the row itself is complete. If MERGE hits a WHERE condition that finds a key lock entry in the index, it has to sit and wait for that other command to finish. There isn't any other sensible behavior I can see in that situation, just like there isn't one for UPDATE right now. If the transaction that was locking the index entry commits, it has to start over with re-evaluating the WHEN MATCHED part (the equivilent of the WHERE in the UPDATE case). But it shouldn't need to do a new JOIN again, because that condition was proven to be met already by the lock squatting on that index key, without even having the rest of the row there yet to check its data. If the other entry commits, it would then follow the WHEN MATCHED path and in the UPSERT case do an UPDATE. If the transaction who had the key reservervation rolls back, the re-evaluation of the MERGE WHEN MATCHED will fail, at which point it follows the WHERE FOUND INSERT path. As it will now be the locker of the key reservation it had been waiting on earlier at that point, it will be free to do the INSERT with no concerns about race conditions or retries. In the SERIALIZABLE case, again just try to follow the same slightly different rules that exist for concurrent UPDATE commands. There may be a tricky point here that we will need to warn people that UPSERT implementations need to make sure they only reference the columns of the primary key in the MERGE conditions for this to work as expected, because otherwise they might get back a unique key violation error anyway. I don't know how other databases deal with that particular problem. I have considered following the somewhat distasteful path of installing SQL Server or Oracle here just to see how they handle some of the pathological test cases I'm now thinking about for this command. This particular weak spot in MVCC is already there for UPDATE, and this approach seems to me to be the most logical way to extend the same basic implementation to also eliminate some concurrent MERGE race conditions, including the most common one the UPSERT case is stuck with. But I have no intention of halting work on the basic MERGE to wait for this problem to be solved even if I'm completely wrong about what I've outlined above. That style of thinking--don't even start until every a perfect solution for every possible problem is known--is something I consider a problem in this community's development model, one that has contributed to why no work has been done on this feature until now. This one splits nicely into "get the implemenation working" and "improve the concurrency" parts, and I haven't heard a good reason yet why those need to coupled together. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ask for review of MERGE
Robert Haas wrote: Well, there's no guarantee that any index at all exists on the target table, so any solution based on index locks can't be fully general. Sure, but in the most common use case I think we're targeting at first, no indexes means there's also no unique constraints either. I don't think people can reasonable expect to MERGE or anything else to guarantee they won't accidentally create two rows that conflict via the terms of some non-database enforced key. I am too brain-dead this particular Sunday to think of exactly how to deal with the 3-tuple case you described, except to say that I think it's OK for complicated situations to give up and throw a serialization error. I'm already collecting a list of pathological tests and will try to add something based on your problem case, then see what we can do with it later. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sorted writes for checkpoints
Itagaki Takahiro wrote: When I submitted the patch, I tested it on disk-based RAID-5 machine: http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php But there were no additional benchmarking reports at that time. We still need benchmarking before we re-examine the feature. For example, SSD and SSD-RAID was not popular at that time, but now they might be considerable. I did multiple rounds of benchmarking that, just none of it showed any improvement so I didn't bother reporting them in detail. I have recently figured out why the performance testing I did of that earlier patch probably failed to produce useful results on my system when I was testing it back then though. It relates to trivia around how ext3 handles fsync that's well understood now (the whole cache flushes out when one comes in), but wasn't back then yet. We have a working set of patches here that both rewrite the checkpoint logic to avoid several larger problems with how it works now, as well as adding instrumentation that makes it possible to directly measure and graph whether methods such as sorting writes provide any improvement or not to the process. My hope is to have those all ready for initial submission as part of CommitFest 2010-11, as the main feature addition from myself toward improving 9.1. I have a bunch of background information about this I'm presenting at PGWest next week, after which I'll start populating the wiki with more details and begin packaging the code too. I had hoped to revisit the checkpoint sorting details after that. Jeff or yourself are welcome to try your own tests in that area, I could use the help. But I think my measurement patches will help you with that considerably once I release them in another couple of weeks. Seeing a graph of latency sync times for each file is very informative for figuring out whether a change did something useful, more so than just staring at total TPS results. Such latency graphs are what I've recently started to do here, with some server-side changes that then feed into gnuplot. The idea of making something like the sorting logic into a pluggable hook seems like a waste of time to me, particulary given that the earlier implementation really needed to be allocated a dedicated block of shared memory to work well IMHO (and I believe that's still the case). That area isn't where the real problems are at here anyway, especially on large memory systems. How the sync logic works is the increasingly troublesome part of the checkpoint code, because the problem it has to deal with grows proportionately to the size of the write cache on the system. Typical production servers I deal with have about 8X as much RAM now as they did in 2007 when I last investigated write sorting. Regular hard drives sure haven't gotten 8X faster since then, and battery-backed caches (which used to have enough memory to absorb a large portion of a checkpoint burst) have at best doubled in size. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] SR fails to send existing WAL file after off-line copy
Last week we got this report from Matt Chesler: http://archives.postgresql.org/pgsql-admin/2010-10/msg00221.php that he was getting errors when trying to do a simple binary replication test. The problem is that what appears to be a perfectly good WAL segment doesn't get streamed to the standby. No one responded. One of our testers just ran into the same thing. I just investigated, and I'm baffled as to what's going on myself. Can't tell if this is a bug or an under-documented restriction, but this makes two reports of the problem now. (Mine is happening on a standard 9.0.0 RPM set, didn't notice any changes in 9.0.1 that would impact this; afraid to upgrade while I have a repeatable test case for this sitting here) The setup is intended to get a simple test replication setup going without even having to do the whole pg_start_backup shuffle, by copying the whole server directory when it's down. Basic steps are: -Follow the first set of instructions at http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial to setup a master compatible with replication, then duplicate it after stopping it using rsync. Note that you may have to manually create an empty pg_xlog directory on the standby, depending on what was there before you started replication. To rule out one possible source of problems here, I made an additional change on the master not listed there: [mas...@pyramid pg_log]$ psql -d postgres -c "show wal_keep_segments" wal_keep_segments --- 10 I wondered if having this set to 0 (the default) was causing the issue, thinking perhaps it doesn't do any looking for existing segments at all in that situation. Problem still happens for me. -Create a recovery.conf pointing to the master as described in the tutorial. -Start the standby. Make sure that it has reached the point where it's requesting WAL segments from the master; you want to see it looping doing periodic "FATAL: could not connect to the primary server: could not connect to server: Connection refused" before touching the master. -Start the master What I expect to happen now is that the current WAL file that was in progress at the point the data was copied over gets streamed over. That doesn't seem to happen. Instead, I see this on the standby: LOG: streaming replication successfully connected to primary FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 0001 has already been removed This on the master: LOG: replication connection authorized: user=rep host=127.0.0.1 port=52571 FATAL: requested WAL segment 0001 has already been removed Which is confusing because that file is certainly on the master still, and hasn't even been considered archived yet much less removed: [mas...@pyramid pg_log]$ ls -l $PGDATA/pg_xlog -rw--- 1 master master 16777216 Oct 31 16:29 0001 drwx-- 2 master master 4096 Oct 4 12:28 archive_status [mas...@pyramid pg_log]$ ls -l $PGDATA/pg_xlog/archive_status/ total 0 So why isn't SR handing that data over? Is there some weird unhandled corner case this exposes, but that wasn't encountered by the systems the tutorial was tried out on? I'm not familiar enough with the SR internals to reason out what's going wrong myself yet. Wanted to validate that Matt's report wasn't a unique one though, with a bit more detail included about the state the system gets into, and one potential fix (increasing wal_keep_segments) already tried without improvement. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SR fails to send existing WAL file after off-line copy
Heikki Linnakangas wrote: Yes, indeed there is a corner-case bug when you try to stream the very first WAL segment, with log==seg==0. I confirmed that the bug exists in only this case by taking my problem install and doing this: psql -d postgres -c "checkpoint; select pg_switch_xlog();" To force it to the next xlog file. With only that change, everything else then works. So we'll just need to warn people about this issue and provide that workaround, as something that only trivial new installs without much data loaded into them are likely to run into, until 9.0.2 ships with your fix in it. I'll update the docs on the wiki accordingly, once I've recovered from this morning's flight out West. I forgot to credit Robert Noles here for rediscovering this bug on one of our systems and bringing it to my attention. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+
Tom Lane wrote: If open_dsync is so bad for performance on Linux, maybe it's bad everywhere? Should we be rethinking the default preference order? And I've seen the expected sync write performance gain over fdatasync on a system with a battery-backed cache running VxFS on Linux, because working open_[d]sync means O_DIRECT writes bypassing the OS cache, and therefore reducing cache pollution from WAL writes. This doesn't work by default on Solaris because they have a special system call you have to execute for direct output, but if you trick the OS into doing that via mount options you can observe it there too. The last serious tests of this area I saw on that platform were from Jignesh, and they certainly didn't show a significant performance regression running in sync mode. I vaguely recall seeing a set once that showed a minor loss compared to fdatasync, but it was too close to make any definitive statement about reordering. I haven't seen any report yet of a serious performance regression in the new Linux case that was written by someone who understands fully how fsync and drive cache flushing are supposed to interact. It's been obvious for a year now that the reports from Phoronix about this had no idea what they were actually testing. I didn't see anything from Marti's report that definitively answers whether this is anything other than Linux finally doing the right thing to flush drive caches out when sync writes happen. There may be a performance regression here related to WAL data going out in smaller chunks than it used to, but in all the reports I've seen it that hasn't been isolated well enough to consider making any changes yet--to tell if it's a performance loss or a reliability gain we're seeing. I'd like to see some output from the 9.0 test_fsync on one of these RHEL6 systems on a system without a battery backed write cache as a first step here. That should start to shed some light on what's happening. I just bumped up the priority on the pending upgrade of my spare laptop to the RHEL6 beta I had been trying to find time for, so I can investigate this further myself. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+
Tom Lane wrote: I'm hoping that Greg Smith will take the lead on testing this, since he seems to have spent the most time in the area so far. It's not coincidence that the chapter of my book I convinced the publisher to release as a sample is the one that covers this area; this mess has been visibly approaching for some time now. I'm going to put RHEL6 onto a system and start collecting some proper slowdown numbers this week, then pass along a suggested test regime for others. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sorted writes for checkpoints
Jeff Janes wrote: Assuming the ordering is useful, the only way the OS can do as good a job as the checkpoint code can, is if the OS stores the entire checkpoint worth of data as dirty blocks and doesn't start writing until an fsync comes in. This strikes me as a pathologically configured OS/FS. (And would explain problems with fsyncs) This can be exactly the situation with ext3 on Linux, which I believe is one reason the write sorting patch didn't go anywhere last time it came up--that's certainly what I tested it on. The slides for my talk "Righting Your Writes" are now up at http://projects.2ndquadrant.com/talks and an example showing this is on page 9. I'm hoping to get the 3 patches shown in action or described in that talk submitted to the list before the next CommitFest. You really need timing of individual sync calls to figure out what's going on here, and what happens is completely dependent on filesystem. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CommitFest 2010-11: Call for Reviewers
It's the time of year again where leaves are falling, temperatures dropping, and patches cry out looking for one last review before the start of the holiday season. The patches submitted so far are listed at https://commitfest.postgresql.org/action/commitfest_view?id=8 , featuring a mix of brand new ones with some going through their second or third round of rework. Expect to see a furious weekend of last minute patches that arrive just before the deadline too. The official cut-off to start the CommitFest is Mon Nov 15 00:00:00 UTC. If you're interested in reviewing a patch but haven't done so before, the process is outlined at the following page: http://wiki.postgresql.org/wiki/Reviewing_a_Patch http://wiki.postgresql.org/wiki/RRReviewers All you have to do to "claim" a patch is update the CommitFest application page to put your name down as the reviewer, then do the review in a timely fashion. Look for the patches with the bright red "Nobody" listed in the reviewer field. If you have questions about which patch to work on, please try to keep e-mail traffic related to topics like patch selection and volunteering for a new round-robin assignment to the pgsql-rrreviewers mailing list. Whereas the pgsql-hackers list is the right destination for the actual review itself. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a new problem in MERGE
Boxuan Zhai wrote: I have plan to fix the above two bugs together. (in fact, I have already started coding in merge_v202 edition). My question is how should I make my update be consistent with yours. Is it possible for you to give me an edition that I can work on? I just got this reconciled with HEAD again. There have been two changes I made in the code you'll eventually want in your working copy: 1) Fix NIL/NULL confusion: https://github.com/greg2ndQuadrant/postgres/commit/9013ba9e81490e3623add1b029760817021297c0 2) Update ExecMerge to accept and pass through an oldtuple value. This is needed to make the code compatible with the PostgreSQL git HEAD after the changes made in the 2009-09 CommitFest. Bit rot updates made: https://github.com/greg2ndQuadrant/postgres/commit/be03bd201720f42a666f7e356ec8507c1357f502 I'm not sure if how I did (2) is correct for all cases, but at least the code compiles again now and the server will start. Attached is an updated patch that applies to HEAD as of right now, and that code has been pushed to https://github.com/greg2ndQuadrant/postgres/tree/merge-unstable with the changes rebased to be the last two commits. It fails "make installcheck" on my system. But as the initial diffs I looked at relate to enums and such, I don't think that's a problem with your patch. Will investigate further here with some of my own patches I'm working on today. Hopefully this is enough to unblock what you were looking for more details from me about. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD merge-v203-20101114.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_stat_bgwriter broken?
I'm behind on my list mail so maybe this has been mentioned already, but when I just tried pg_stat_bgwriter from a build against today's HEAD I got this: pgbench=# select count(*) FROM pg_stat_bgwriter; ERROR: did not find '}' at end of input node Can someone confirm if this broke recently, or is it just me? Last time I would have tested this myself was a few weeks ago. Regardless, I was thinking of adding some basic sanity checking on this view, that runs from the regression tests to catch this class of problem in the future. It's kind of sloppy that this and the bgwriter counter reset aren't tested as part of "make check". I think these two would always produce stable results: postgres=# SELECT count(*) FROM pg_stat_bgwriter; count --- 1 postgres=# SELECT pg_stat_reset_shared('bgwriter'); pg_stat_reset_shared -- (1 row) And the first one of those is similarly broken on my install. Thoughts on whether adding those to the regression tests would be a worthwhile patch? I'll do the work, just thinking out loud about the concept. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stat_bgwriter broken?
Tom Lane wrote: Worksforme. You probably need a full recompile and/or initdb Yeah, sorry about the noise. This went away after some more intensive rebuilding. I think I still want to add some regression testing of this view as suggested. If that had been there, I'd have been a lot more confident it was my mistake, and not something that just slipped through without being tested. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Instrument checkpoint sync calls
Attached patch adds some logging for each individual fsync call made during a checkpoint, along with a summary at the end. You need to have the following to see all of the detail: log_checkpoints=on log_min_messages=debug1 And here's a sample: LOG: checkpoint starting: immediate force wait DEBUG: checkpoint sync: file=1 time=1.946000 msec DEBUG: checkpoint sync: file=2 time=0.666000 msec DEBUG: checkpoint sync: file=3 time=0.004000 msec LOG: checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec LOG: checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s, total=0.003 s I think that it's reasonable for the sort of people who turn log_checkpoints on to also get the sync summary line, thus it being logged at LOG level. The detail on individual syncs might go to DEBUG2 or lower instead of DEBUG1 where I put it, that part I don't have a strong opinion on. It's at DEBUG1 to make testing the patch without a gigantic amount of log data also coming in easier. Right now the code is written such that all the calls that grab timing information are wrapped around "ifdef DEBUG_FSYNC", which is a variable set to 1 that could be a compile-time option like DEBUG_DEADLOCK, to allow turning this code path off at build time. I personally think that if you're already making an fsync call and have log_checkpoints on, the additional overhead of also timing that fsync is minimal even on platforms where timing is slow (I don't have such a system to test that assumption however). And I've seen enough value in troubleshooting nasty checkpoint sync problems using this patch to feel it's worth having even if it does add some overhead. I'm a little concerned about log_checkpoints changing on me in the middle of the execution of a checkpoint, which would cause some problems here. Not sure if that's something I should explicitly code for, given that all I think it will do is screw up one of the timing results. It does seem a risk from the last minute self-review I just did of the code. I'll give a sample program that stresses the system, generating slow timing results and other types of bad behavior, along with the next patch I submit here shortly. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 1219fcf..c72da72 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -20,6 +20,7 @@ #include "catalog/catalog.h" #include "miscadmin.h" +#include "portability/instr_time.h" #include "postmaster/bgwriter.h" #include "storage/fd.h" #include "storage/bufmgr.h" @@ -29,6 +30,11 @@ #include "utils/memutils.h" #include "pg_trace.h" +/* + * For now just leave this on all the time, eventually this will + * need to be a full compiler flag like DEBUG_DEADLOCK + */ +#define DEBUG_FSYNC 1 /* interval for calling AbsorbFsyncRequests in mdsync */ #define FSYNCS_PER_ABSORB 10 @@ -927,6 +933,15 @@ mdsync(void) PendingOperationEntry *entry; int absorb_counter; +#ifdef DEBUG_FSYNC + /* Statistics on sync times */ + int processed = 0; + instr_time sync_start,sync_end,diff; + double elapsed; + double longest = 0; + double total_elapsed = 0; +#endif + /* * This is only called during checkpoints, and checkpoints should only * occur in processes that have created a pendingOpsTable. @@ -1069,9 +1084,32 @@ mdsync(void) seg = _mdfd_getseg(reln, entry->tag.forknum, entry->tag.segno * ((BlockNumber) RELSEG_SIZE), false, EXTENSION_RETURN_NULL); + +#ifdef DEBUG_FSYNC +if (log_checkpoints) + INSTR_TIME_SET_CURRENT(sync_start); +#endif if (seg != NULL && FileSync(seg->mdfd_vfd) >= 0) +{ + +#ifdef DEBUG_FSYNC + if (log_checkpoints) + { + INSTR_TIME_SET_CURRENT(sync_end); + diff=sync_end; + INSTR_TIME_SUBTRACT(diff, sync_start); + elapsed = (double) INSTR_TIME_GET_MICROSEC(diff) / 1000.0; + if (elapsed > longest) + longest = elapsed; + total_elapsed+=elapsed; + processed++; + /* Reduce the log level here to DEBUG2 in final patch? */ + elog(DEBUG1, "checkpoint sync: file=%d time=%f msec", processed, elapsed); + } +#endif break; /* success; break out of retry loop */ + } /* * XXX is there any point in allowing more than one retry? @@ -1113,6 +1151,13 @@ mdsync(void) elog(ERROR, "pendingOpsTable corrupted"); } /* end loop over hashtable entries */ +#ifdef DEBUG_FSYNC + /* Report on sync performance metrics */ + if (log_checkpoints && (processed > 0)) + elog(LO
[HACKERS] Count backend self-sync calls
The attached patch adds a new field to pg_stat_bgwriter, counting the number of times backends execute their own fsync calls. Normally, when a backend needs to fsync data, it passes a request to the background writer, which then absorbs the call into its own queue of work to do. However, under some types of heavy system load, the associated queue can fill. When this happens, backends are forced to do their own fsync call. This is potentially much worse than when they do a regular write. The really nasty situation is when the background writer is busy because it's executing a checkpoint. In that case, it's possible for the backend fsync calls to start competing with the ones the background writer is trying to get done, causing the checkpoint sync phase to execute slower than it should. I've seen the sync phase take over 45 minutes on a really busy server once it got into this condition, where hundreds of clients doing their own backend fsync calls were fighting against the checkpoint fsync work. With this patch, you can observe that happening as an upwards spike in pg_stat_bgwriter.buffers_backend_sync, which as documented is an inclusive subset of the total shown in buffers_backend. While it takes a busier system than I can useful show how to simulate here to show a really bad situation, I'm able to see some of these unabsorbed backend fsync calls when initializing a pgbench database, to prove they happen in the lab. The attached test program takes as its input a pgbench scale counter. It then creates a pgbench database (deleting any existing pgbench database, so watch out for that) and shows the values accumulated in pg_stat_bgwriter during that period. Here's an example, using the script's default scale of 100 on a server with 8GB of RAM and fake fsync (the hard drives are lying about it): -[ RECORD 1 ]+- now | 2010-11-14 16:08:41.36421-05 ... Initializing pgbench -[ RECORD 1 ]+-- now | 2010-11-14 16:09:46.713693-05 checkpoints_timed| 0 checkpoints_req | 0 buffers_checkpoint | 0 buffers_clean| 0 maxwritten_clean | 0 buffers_backend | 654716 buffers_backend_sync | 90 buffers_alloc| 803 This is with default sizing for memory structures. As you increase shared_buffers, one of the queues involved here increases proportionately, making it less likely to run into this problem. That just changes it to the kind of problem I've only seen on a larger system with a difficult to simulate workload. The production system getting hammered with this problem (running a web application) that prompted writing the patch had shared_buffers=4GB at the time. The patch also adds some logging to the innards involved here, to help with understanding problems in this area. I don't think that should be in the version committed as is. May want to drop the logging level or make it disabled in regular builds, since it is sitting somewhere it generates a lot of log data and adds overhead. It is nice for now, as it lets you get an idea how much fsync work *is* being absorbed by the BGW, as well as showing what relation is suffering from this issue. Example of both those things, with the default config for everything except log_checkpoints (on) and log_min_messages (debug1): DEBUG: Absorbing 4096 fsync requests DEBUG: Absorbing 150 fsync requests DEBUG: Unable to forward fsync request, executing directly CONTEXT: writing block 158638 of relation base/16385/16398 Here 4096 is the most entries the BGW will ever absorb at once, and all 90 of the missed sync calls are logged so you can see what files they came from. As a high-level commentary about this patch, I'm not sure what most end users will ever do with this data. At the same time, I wasn't sure what a typical end user would do with anything else in pg_stat_bgwriter either when it was added, and it turns out the answer is "wait for people who know the internals to write things that monitor it". For example, Magnus has updated recent versions of the Munin plug-in for PostgreSQL to usefully graph pg_stat_bgwriter data over time. As I have some data to suggest checkpoint problems on Linux in particular are getting worse as total system memory increases, I expect that having a way to easily instrument for this particular problem will be correspondingly more important in the future too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index dbf966b..f701b8f 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -264,8 +264,10 @@ postgres: user database host diff --git a/src/backend/catalog/system_vi
[HACKERS] Spread checkpoint sync
As far as concerns about this 10% setting not doing enough work, which is something I do see, you can always increase how often absorbing happens by decreasing bgwriter_delay now--giving other benefits too. For example, if you run the fsync-stress-v2.sh script I included with the last patch I sent, you'll discover the spread sync version of the server leaves just as many unabsorbed writes behind as the old code did. Those are happening because of periods the background writer is sleeping. They drop as you decrease the delay; here's a table showing some values I tested here, with all three patches installed: bgwriter_delaybuffers_backend_sync 200 ms90 50 ms28 25 ms3 There's a bunch of performance related review work that needs to be done here, in addition to the usual code review for the patch. My hope is that I can get enough of that done to validate this does what it's supposed to on public hardware that a later version of this patch is considered for the next CommitFest. It's a little more raw than I'd like still, but the idea has been tested enough here that I believe it's fundamentally sound and valuable. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 43a149e..0ce8e2b 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -143,8 +143,8 @@ typedef struct static BgWriterShmemStruct *BgWriterShmem; -/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */ -#define WRITES_PER_ABSORB 1000 +/* Fraction of fsync absorb queue that needs to be filled before acting */ +#define ABSORB_ACTION_DIVISOR 10 /* * GUC parameters @@ -382,7 +382,7 @@ BackgroundWriterMain(void) /* * Process any requests or signals received recently. */ - AbsorbFsyncRequests(); + AbsorbFsyncRequests(false); if (got_SIGHUP) { @@ -636,7 +636,7 @@ BgWriterNap(void) (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)) break; pg_usleep(100L); - AbsorbFsyncRequests(); + AbsorbFsyncRequests(true); udelay -= 100L; } @@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void) void CheckpointWriteDelay(int flags, double progress) { - static int absorb_counter = WRITES_PER_ABSORB; - /* Do nothing if checkpoint is being executed by non-bgwriter process */ if (!am_bg_writer) return; @@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress) ProcessConfigFile(PGC_SIGHUP); } - AbsorbFsyncRequests(); - absorb_counter = WRITES_PER_ABSORB; + AbsorbFsyncRequests(false); BgBufferSync(); CheckArchiveTimeout(); BgWriterNap(); } - else if (--absorb_counter <= 0) + else { /* - * Absorb pending fsync requests after each WRITES_PER_ABSORB write - * operations even when we don't sleep, to prevent overflow of the - * fsync request queue. + * Check for overflow of the fsync request queue. */ - AbsorbFsyncRequests(); - absorb_counter = WRITES_PER_ABSORB; + AbsorbFsyncRequests(false); + } +} + +/* + * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint + * + * This function is called after each file sync performed by mdsync(). + * It is responsible for keeping the bgwriter's normal activities in + * progress during a long checkpoint. + */ +void +CheckpointSyncDelay(void) +{ + pg_time_t now; + pg_time_t sync_start_time; + int sync_delay_secs; + + /* + * Delay after each sync, in seconds. This could be a parameter. But + * since ideally this will be auto-tuning in the near future, not + * assigning it a GUC setting yet. + */ +#define EXTRA_SYNC_DELAY 3 + + /* Do nothing if checkpoint is being executed by non-bgwriter process */ + if (!am_bg_writer) + return; + + sync_start_time = (pg_time_t) time(NULL); + + /* + * Perform the usual bgwriter duties. + */ + for (;;) + { + AbsorbFsyncRequests(false); + BgBufferSync(); + CheckArchiveTimeout(); + BgWriterNap(); + + /* + * Are we there yet? + */ + now = (pg_time_t) time(NULL); + sync_delay_secs = now - sync_start_time; + if (sync_delay_secs >= EXTRA_SYNC_DELAY) + break; } } @@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, * non-bgwriter processes, do nothing if not bgwriter. */ void -AbsorbFsyncRequests(void) +AbsorbFsyncRequests(bool force) { BgWriterRequest *requests = NULL; BgWriterRequest *request; int n; + /* + * Divide the size of the request queue by this to determine when + * absorption action needs to be taken. Default here aims to empty the + * queue whenever 1 / 10 = 10% of it is full. If this isn't good enough, + * you probably need to lower bgwriter_delay, rather than presume + * this needs to be a tunable you can decrease. +
Re: [HACKERS] pg_stat_bgwriter broken?
Tom Lane wrote: If the cause was what I suggested, most likely the other stats views were broken too --- were you able to pass the regression tests in that state? I was having a lot of problems with the system that included regression test issues, and yelped about this one earlier than I normally would because it was blocking work I wanted to get submitted today. Sorry about the noise, and thanks for the info about why this was happening. The way you've explained it now, I see why it's not worth adding a specific regression test for this in the future, and what else I should have checked when I ran into my problem (that other system stats views were also working or not). -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
Magnus Hagander wrote: I think that it's reasonable for the sort of people who turn log_checkpoints on to also get the sync summary line, thus it being logged at LOG level. In that case, wouldn't it make more sense to add a couple of more fields to the existing line? Easier to post-process the logfile that way... It might. One trade-off is that if you're looking at the sync write detail, the summary comes out in a similar form. And it was easy to put in here--I'd have to return some new data out of the sync phase call in order for that to show up in the main log. If there's general buy-in on the idea, I could do all of that. I personally think that if you're already making an fsync call and have log_checkpoints on, the additional overhead of also timing that fsync is minimal even on platforms where timing is slow (I don't have such a system to test that assumption however)... It sounds like it should be - but that should be possible to measure, no? What I was alluding to is that I know gettimeofday executes fast on my Linux system here, so even if I did measure the overhead and showed it's near zero that doesn't mean it will be so on every platform. The "how long does it take to find out the current time on every supported PostgreSQL platform?" question is one I'd like to have an answer to, but it's hard to collect properly. All I know is that I don't have any system where it's slow to properly test again here. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
Re: [HACKERS] Count backend self-sync calls
Robert Haas wrote: I think this one could be removed: + if (n > 0) + elog(DEBUG1,"Absorbing %d fsync requests",n); Right; that one is just there to let you know the patch is working, and how much the background writer does for you per pass, mainly for the purpose of reviewing the patch. But if this is generating a lot of log data or adding a lot of overhead, then you have bigger problems anyway: + elog(DEBUG1, "Unable to forward fsync request, executing directly"); The argument against this log line even existing is that if the field is added to pg_stat_bgwriter, that's probably how you want to monitor this data anyway. I don't think there's actually much value to it, which is one reason I didn't worry too much about matching style guidelines, translation, etc. You've found the two things that I think both need to go away before commit, but I left them in because I think they're valuable for testing the patch itself does what it's supposed to. Also, how about referring to this as buffers_backend_fsync consistently throughout, instead of dropping the "f" in some places? I started out touching code that called it just "sync", but then crossed to other code that called it "fsync", and made the external UI use that name. No objections to sorting that out within my patch so it's consistent. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CommitFest 2010-11: Call for Reviewers
Despite some jerk submitting three brand new patches at the last minute by himself, the November CommitFest is now closed for new submissions and ready for review! Only 7 of the 40 patches that need review have a reviewer assigned so far. Reviewers, mark down a patch you want to look at in the CommitFest app: https://commitfest.postgresql.org/action/commitfest_view?id=8 or ask to be assigned one on the pgsql-rrreviewers list, if you want to work on something but don't have a preference for which patch. There are a few larger patches in this round, but the majority of those already have a reviewer attached to them. Many of the remaining patches are pretty approachable even if you're relatively new to reviewing PostgreSQL code. See the usual for details: http://wiki.postgresql.org/wiki/Reviewing_a_Patch http://wiki.postgresql.org/wiki/RRReviewers -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+
Marti Raudsepp wrote: PostgreSQL's default settings change when built with Linux kernel headers 2.6.33 or newer. As discussed on the pgsql-performance list, this causes a significant performance regression: http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php NB! I am not proposing to change the default -- to the contrary -- this patch restores old behavior. Following our standard community development model, I've put this patch onto our CommitFest list: https://commitfest.postgresql.org/action/patch_view?id=432 and assigned myself as the reviewer. I didn't look at this until now because I already had some patch development and review work to finish before the CommitFest deadline we just crossed. Now I can go back to reviewing other people's work. P.S. There is no pgsql-patch list anymore; everything goes through the hackers list now. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI update
Kevin Grittner wrote: That went in an hour and a half before the CF deadline, but I just didn't feel right putting it into the CF in that shape. Then keep on working on it and we can revisit its state when you're happy with it. The purpose of the CommitFest cut-off is not to block work on long-term development just because a deadline passed; it's to make sure patches which might otherwise never get a review are looked at eventually. Finding a reviewer for these larger and complicated patches is a very different sort of job than finding one for an average patch anyway. I could use a brief reminder of how this bit fits into the "serializable lock consistency" patch that's already sitting into the CF queue as "Ready for Committer" though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Instrument checkpoint sync calls
So my task list is: 0) Rebase against the HEAD that just code related to this touched today 1) Assume that log_checkpoints is sufficient control over whether the timing overhead added is worth collecting, and therefore remove the half-baked idea of also wrapping with a compile-time option. 2) Have the sync summary returned upwards, so it can be put onto the same line as the rest of the rest of the log_checkpoint info. All seems reasonable to me. Will rev a new patch by tomorrow. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Count backend self-sync calls
Jeff Janes wrote: Do you know where this competition is happening? Is it on the platters, or is it in the hard drive write cache (I thought high-end hardware had tagged writes to avoid that), or in the kernel? Kernel. Linux systems with lots of memory will happily queue up gigabytes of memory in their write cache, only getting serious about writing it out to disk when demanded to by fsync. This makes sense if we just need to append to a queue. But once the queue is full and we are about to do a backend fsync, might it make sense to do a little more work to look for dups? One of the paths I'd like to follow is experimenting with both sorting writes by file and looking for duplication in the queues. I think a basic, simple sync spreading approach needs to get finished first through; this sort of thing would then be an optimization on top of it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Changes to Linux OOM killer in 2.6.36
Last month's new Linux kernel 2.6.36 includes a rewrite of the out of memory killer: http://lwn.net/Articles/391222/ http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a63d83f427fbce97a6cea0db2e64b0eb8435cd10 The new "badness" method totals the task's RSS and swap as a percentage of RAM, where the old one scored starting with the total memory used by the process. I *think* that this is an improvement for PostgreSQL, based on the sort of data I see with: ps -o pid,rss,size,vsize,args -C postgres But I haven't tested with one of the new kernels yet to be sure. Something to look at next time I get in that bleeding edge kernel kind of mood. One thing that's definitely changed is the interface used to control turning off the OOM killer. There's a backward compatibility translation right now that maps the current "-17" bit mask value the PostgreSQL code sends to /proc//oom_adj into the new units scale. However, oom_adj is deprecated, scheduled for removal in August 2010: http://www.mjmwired.net/kernel/Documentation/feature-removal-schedule.txt So eventually, if the OOM disabling code is still necessary in PostgreSQL, it will need to do this sort of thing instead: echo -1000 > /proc//oom_score_adj I've seen kernel stuff get deprecated before the timeline before for code related reasons (when the compatibility bits were judged too obtrusive to keep around anymore), but since this translation bit is only a few lines of code I wouldn't expect that to happen here. I don't think it's worth doing anything to the database code until tests on the newer kernel confirm whether this whole thing is even necessary anymore. Wanted to pass along the info while I was staring at it though. Thanks to Daniel Farina for pointing this out. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving prep_buildtree used in VPATH builds
Gurjeet Singh wrote: Seems like it would improve performance in general, but more so under load conditions when you actually need it. I am not sure if -depth option is supported by all incarnations of 'find'. Given the way directory writes are cached by the filesystem, I'm not sure why the load at the time matters so much. If what you mean by that is you're low on memory, that would make more sense to me. Anyway, "-depth" is in POSIX as of 2001. It seems to be in all the major SysV UNIX variants going back further then that, and of course it's in GNU find. But it looks like BSD derived systems from before that POSIX standard originally called this "-d" instead. So there's some potential for this to not work on older systems; it works fine on my test FreeBSD 7.2 system. Maybe someone who has access to some ancient BSD-ish system can try this out? The simplest test case similar to what the patch adds seems to be if this runs, returning subdirectories in depth-first order before their parent: $ find / -depth -type d -print If that fails somewhere, it may turn out to require another configure check just to determine whether you can use this configuration time optimization. That's certainly possible to add to the patch if it got committed and turns out to break one of the buildfarm members. It seems to me like this whole thing may be a bit more work than it's worth, given this is a fairly small and difficult to reproduce speedup in only one stage of the build process. I'd think that if configure takes longer than it has to because the system is heavily loaded, the amount compilation time is going to suffer from that would always dwarf this component of total build time. But if this was slow enough at some point to motivate you to write a patch for it, maybe that assumption is wrong. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changes to Linux OOM killer in 2.6.36
Kevin Grittner wrote: Greg Smith wrote: oom_adj is deprecated, scheduled for removal in August 2010: That surprised me so I checked the URL. I believe you have a typo there and it's August, 2012. This is why I include references, so that when the cold medicine hits me in the middle of proofreading my message and I sent it anyway you aren't mislead. Yes, 2012, only a few months before doomsday. The aproaching end of the world then means any bugs left can be marked WONTFIX. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
Re: [HACKERS] Spread checkpoint sync
Jeff Janes wrote: And for very large memory systems, even 1% may be too much to cache (dirty*_ratio can only be set in integer percent points), so recent kernels introduced dirty*_bytes parameters. I like these better because they do what they say. With the dirty*_ratio, I could never figure out what it was a ratio of, and the results were unpredictable without extensive experimentation. Right, you can't set dirty_background_ratio low enough to make this problem go away. Even attempts to set it to 1%, back when that that was the right size for it, seem to be defeated by other mechanisms within the kernel. Last time I looked at the related source code, it seemed the "congestion control" logic that kicks in to throttle writes was a likely suspect. This is why I'm not real optimistic about newer mechanism like the dirty_background_bytes added 2.6.29 to help here, as that just gives a mapping to setting lower values; the same basic logic is under the hood. Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly due to the same congestion mechanism. Yes, but how much work do we want to put into redoing the checkpoint logic so that the sysadmin on a particular OS and configuration and FS can avoid having to change the kernel parameters away from their defaults? (Assuming of course I am correctly understanding the problem, always a dangerous assumption.) I've been trying to make this problem go away using just the kernel tunables available since 2006. I adjusted them carefully on the server that ran into this problem so badly that it motivated the submitted patch, months before this issue got bad. It didn't help. Maybe if they were running a later kernel that supported dirty_background_bytes that would have worked better. During the last few years, the only thing that has consistently helped in every case is the checkpoint spreading logic that went into 8.3. I no longer expect that the kernel developers will ever make this problem go away the way checkpoints are written out right now, whereas the last good PostgreSQL work in this area definitely helped. The basic premise of the current checkpoint code is that if you write all of the buffers out early enough, by the time syncs execute enough of the data should have gone out that those don't take very long to process. That was usually true for the last few years, on systems with a battery-backed cache; the amount of memory cached by the OS was relatively small relative to the RAID cache size. That's not the case anymore, and that divergence is growing bigger. The idea that the checkpoint sync code can run in a relatively tight loop, without stopping to do the normal background writer cleanup work, is also busted by that observation. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: Doing all the writes and then all the fsyncs meets this requirement trivially, but I'm not so sure that's a good idea. For example, given files F1 ... Fn with dirty pages needing checkpoint writes, we could do the following: first, do any pending fsyncs for files not among F1 .. Fn; then, write all pages for F1 and fsync, write all pages for F2 and fsync, write all pages for F3 and fsync, etc. This might seem dumb because we're not really giving the OS a chance to write anything out before we fsync, but think about the ext3 case where the whole filesystem cache gets flushed anyway. I'm not horribly interested in optimizing for the ext3 case per se, as I consider that filesystem fundamentally broken from the perspective of its ability to deliver low-latency here. I wouldn't want a patch that improved behavior on filesystem with granular fsync to make the ext3 situation worst. That's as much as I'd want design to lean toward considering its quirks. Jeff Janes made a case downthread for "why not make it the admin/OS's job to worry about this?" In cases where there is a reasonable solution available, in the form of "switch to XFS or ext4", I'm happy to take that approach. Let me throw some numbers out to give a better idea of the shape and magnitude of the problem case I've been working on here. In the situation that leads that the near hour-long sync phase I've seen, checkpoints will start with about a 3GB backlog of data in the kernel write cache to deal with. That's about 4% of RAM, just under the 5% threshold set by dirty_background_ratio. Whether or not the 256MB write cache on the controller is also filled is a relatively minor detail I can't monitor easily. The checkpoint itself? <250MB each time. This proportion is why I didn't think to follow the alternate path of worrying about spacing the write and fsync calls out differently. I shrunk shared_buffers down to make the actual checkpoints smaller, which helped to some degree; that's what got them down to smaller than the RAID cache size. But the amount of data cached by the operating system is the real driver of total sync time here. Whether or not you include all of the writes from the checkpoint itself before you start calling fsync didn't actually matter very much; in the case I've been chasing, those are getting cached anyway. The write storm from the fsync calls themselves forcing things out seems to be the driver on I/O spikes, which is why I started with spacing those out. Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog takes a minimum of 10 minutes of real time. There are about 300 1GB relation files involved in the case I've been chasing. This is where the 3 second delay number came from; 300 files, 3 seconds each, 900 seconds = 15 minutes of sync spread. You can turn that math around to figure out how much delay per relation you can afford while still keeping checkpoints to a planned end time, which isn't done in the patch I submitted yet. Ultimately what I want to do here is some sort of smarter write-behind sync operation, perhaps with a LRU on relations with pending fsync requests. The idea would be to sync relations that haven't been touched in a while in advance of the checkpoint even. I think that's similar to the general idea Robert is suggesting here, to get some sync calls flowing before all of the checkpoint writes have happened. I think that the final sync calls will need to get spread out regardless, and since doing that requires a fairly small amount of code too that's why we started with that. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Single client performance on trivial SELECTs
This week several list regulars here waded into the MySQL Convention. I decided to revisit PostgreSQL vs. MySQL performance using the sysbench program as part of that. It's not important to what I'm going to describe to understand exactly what statements sysbench runs here or how to use it, but if anyone is curious I've got some more details about how I ran the tests in my talk slides at http://projects.2ndquadrant.com/talks The program has recently gone through some fixes that make it run a bit better both in general and against PostgreSQL. The write tests are still broken against PostgreSQL, but it now seems to do a reasonable job simulating a simple SELECT-only workload. A fix from Jignesh recently made its way into the database generation side of the code that makes it less tedious to test with it too. The interesting part was how per-client scaling compared between the two databases; graph attached. On my 8 core server, PostgreSQL scales nicely up to a steady 50K TPS. I see the same curve, almost identical numbers, with PostgreSQL and pgbench--no reason to suspect sysbench is doing anything shady. The version of MySQL I used hits around 67K TPS with innodb when busy with lots of clients. That part doesn't bother me; nobody expects PostgreSQL to be faster on trivial SELECT statements and the gap isn't that big. The shocking part was the single client results. I'm using to seeing Postgres get around 7K TPS per core on those, which was the case here, and I never considered that an interesting limitation to think about before. MySQL turns out to hit 38K TPS doing the same work. Now that's a gap interesting enough to make me wonder what's going on. Easy enough to exercise the same sort of single client test case with pgbench and put it under a profiler: sudo opcontrol --init sudo opcontrol --setup --no-vmlinux createdb pgbench pgbench -i -s 10 pgbench psql -d pgbench -c "vacuum" sudo opcontrol --start sudo opcontrol --reset pgbench -S -n -c 1 -T 60 pgbench sudo opcontrol --dump ; sudo opcontrol --shutdown opreport -l image:$HOME/pgwork/inst/test/bin/postgres Here's the top calls, from my laptop rather than the server that I generated the graph against. It does around 5.5K TPS with 1 clients and 10K with 2 clients, so same basic scaling: samples %image name symbol name 53548 6.7609 postgres AllocSetAlloc 32787 4.1396 postgres MemoryContextAllocZeroAligned 26330 3.3244 postgres base_yyparse 21723 2.7427 postgres hash_search_with_hash_value 20831 2.6301 postgres SearchCatCache 19094 2.4108 postgres hash_seq_search 18402 2.3234 postgres hash_any 15975 2.0170 postgres AllocSetFreeIndex 14205 1.7935 postgres _bt_compare 13370 1.6881 postgres core_yylex 10455 1.3200 postgres MemoryContextAlloc 10330 1.3042 postgres LockAcquireExtended 10197 1.2875 postgres ScanKeywordLookup 9312 1.1757 postgres MemoryContextAllocZero I don't know nearly enough about the memory allocator to comment on whether it's possible to optimize it better for this case to relieve any bottleneck. Might just get a small gain then push the limiter to the parser or hash functions. I was surprised to find that's where so much of the time was going though. P.S. When showing this graph in my talk, I pointed out that anyone who is making decisions about which database to use based on trivial SELECTs on small databases isn't going to be choosing between PostgreSQL and MySQL anyway--they'll be deploying something like MongoDB instead if that's the important metric. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us <> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Single client performance on trivial SELECTs
Heikki Linnakangas wrote: In this case you could just use prepared statements and get rid of all the parser related overhead, which includes much of the allocations. Trying that gives me around 9200 TPS instead of 5500 on my laptop, so a pretty big gain here. Will have to include that in my next round of graphs across multiple client loads once I'm home again and can run easily on my server. To provide a matching profile from the same system as the one I already submitted from, for archival sake, here's what the profile I get looks like with "-M prepared": samples %image name symbol name 33093 4.8518 postgres AllocSetAlloc 30012 4.4001 postgres hash_seq_search 27149 3.9803 postgres MemoryContextAllocZeroAligned 26987 3.9566 postgres hash_search_with_hash_value 25665 3.7628 postgres hash_any 16820 2.4660 postgres _bt_compare 14778 2.1666 postgres LockAcquireExtended 12263 1.7979 postgres AllocSetFreeIndex 11727 1.7193 postgres tas 11602 1.7010 postgres SearchCatCache 11022 1.6159 postgres pg_encoding_mbcliplen 10963 1.6073 postgres MemoryContextAllocZero 9296 1.3629 postgres MemoryContextCreate 8368 1.2268 postgres fmgr_isbuiltin 7973 1.1689 postgres LockReleaseAll 7423 1.0883 postgres ExecInitExpr 7309 1.0716 postgres pfree -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Single client performance on trivial SELECTs
I did some more research into the memory allocation hotspots found in the profile, in regards to what MySQL has done in those areas. (And by "research" I mean "went to the bar last night and asked Baron about it") The issue of memory allocation being a limiter on performance has been nagging that community for long enough that the underlying malloc used can even be swapped with a LD_PRELOAD trick: http://dev.mysql.com/doc/refman/5.5/en/mysqld-safe.html#option_mysqld_safe_malloc-lib Plenty of people have benchmarked that and seen a big difference between the implementations; some sample graphs: http://locklessinc.com/articles/mysql_performance/ http://blogs.sun.com/timc/entry/mysql_5_1_memory_allocator http://mysqlha.blogspot.com/2009/01/double-sysbench-throughput-with_18.html To quote from the last of those, "Malloc is a bottleneck for sysbench OLTP readonly", so this problem is not unique to PostgreSQL. As of 5.5 the better builds are all defaulting to TCMalloc, which is interesting but probably not as useful because it's focused on improving multi-threaded performance: http://goog-perftools.sourceforge.net/doc/tcmalloc.html I'm not sure exactly what is useful to be learned from that specific work. But it does suggest two things: one, this is far from an easy thing to fix. Two, the only reason MySQL does so well on it is because there was some focused work on it, taking a quite a while to accomplish, and involving many people. Doing better for PostgreSQL is something I see as more of a long-term goal, rather than something it would be reasonable to expect quick progress on. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Joshua Berkus wrote: Guys, can we *please* focus on the patch for now, rather than the formatting, which is fixable with sed? Never, and that's not true. Heikki was being nice; I wouldn't have even slogged through it long enough to ask the questions he did before kicking it back as unusable. A badly formatted patch makes it impossible to evaluate whether the changes from a submission are reasonable or not without the reviewer fixing it first. And you can't automate correcting it, it takes a lot of tedious manual work. Start doing a patch review every CommitFest cycle and you very quickly realize it's not an ignorable problem. And lack of discipline in minimizing one's diff is always a sign of other code quality issues. Potential contributors to PostgreSQL should know that a badly formatted patch faces an automatic rejection, because no reviewer can work with it easily. This fact is not a mystery; in fact it's documented at http://wiki.postgresql.org/wiki/Submitting_a_Patch : "The easiest way to get your patch rejected is to make lots of unrelated changes, like reformatting lines, correcting comments you felt were poorly worded etc. Each patch should have the minimum set of changes required to fulfil the single stated objective." I think I'll go improve that text next--something like "Ways to get your patch rejected" should be its own section. The problem here isn't whether someone used an IDE or not, it's that this proves they didn't read their own patch before submitting it. Reading one's own diff and reflecting on what you've changed is one of the extremely underappreciated practices of good open-source software development. Minimizing the size of that diff is perhaps the most important thing someone can do in order to make their changes to a piece of software better. Not saying something that leads in that direction would be a disservice to the submitter. P.S. You know what else I feel should earn an automatic rejection without any reviewer even looking at the code? Submitting a patch that claims to improve performance and not attaching the test case you used, along with detailed notes about before/after tests on your own hardware. A hand wave "it's faster" is never good enough, and it's extremely wasteful of our limited reviewer resources to try and duplicate what the submitter claimed. Going to add something about that to the submission guidelines too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Tom Lane wrote: * On the other side of the coin, I have seen many a patch that was written to minimize the length of the diff to the detriment of readability or maintainability of the resulting code, and that's *not* a good tradeoff. Sure. that's possible. But based on the reviews I've done, I'd say that the fact someone is even aware that minimizing their diff is something important to consider automatically puts them far ahead of the average new submitter. There are a high percentage of patches where the submitter generates a diff and sents it without even looking at it. That a person would look at their diff and go too far without trying to make it small doesn't happen nearly as much. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Robert Haas wrote: The OP says that this patch maintains the WAL-before-data rule without any explanation of how it accomplishes that seemingly quite amazing feat. I assume I'm going to have to read this patch at some point to refute this assertion, and I think that sucks. I don't think you have to read any patch that doesn't follow the submission guidelines. The fact that you do is a great contribution to the community. But if I were suggesting how your time would be best spent improving PostgreSQL, "reviewing patches that don't meet coding standards" would be at the bottom of the list. There's always something better for the project you could be working on instead. I just added http://wiki.postgresql.org/wiki/Submitting_a_Patch#Reasons_your_patch_might_be_returned , recycling some existing text, adding some new suggestions. I hope I got the tone of that text right. The intention was to have a polite but clear place to point submitters to when their suggestion doesn't meet the normal standards here, such that they might even get bounced before even entering normal CommitFest review. This MMAP patch looks like it has all 5 of the problems mentioned on that now more focused list. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
Joshua Berkus wrote: Then you can say that politely and firmly with direct reference to the problem, rather than making the submitter feel bad. That's exactly what happened. And then you responded that it was possible to use a patch without fixing the formatting first. That's not true, and those of us who do patch review are tired of even trying. Our project has an earned reputation for being rejection-happy curmudgeons. This is something I heard more than once at MySQLConf, including from one student who chose to work on Drizzle instead of PostgreSQL for that reason. I think that we could stand to go out of our way to be helpful to first-time submitters. I'll trade you anecdotes by pointing out that I heard from half a dozen business people that the heavy emphasis on quality control and standards was the reason they were looking into leaving MySQL derived distributions for PostgreSQL. I've spent days of time working on documentation to help new submitters get their patches improve to where they meet this community's standards. This thread just inspired another round of that. What doesn't help is ever telling someone they can ignore those and still do something useful we're interested in. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Radosław Smogura wrote: Yes, but, hmm... in Netbeans I had really long gaps (probably 8 spaces, from tabs), so deeper "ifs", comments at the and of variables, went of out my screen. I really wanted to not format this, but sometimes I needed. The guide at http://www.open-source-editor.com/editors/how-to-make-netbeans-use-tabs-for-indention.html seems to cover how to fix this in Netbeans. You want it to look like that screen shot: 4 spaces per indent with matching tab size of 4, and "Expand Tabs to Spaces" unchecked. Generally, if you look at the diff you've created, and your new code doesn't line up right with what's already there, that means the tab/space setup isn't quite right when you were editing. Reading the diff is useful for catching all sorts of other issues, too, so it's just generally a good practice. As Peter already mentioned, the big problem here is that you checked in a modified configure file. I also note that you use C++ style "//" comments, which aren't allowed under the coding guidelines--even though they work fine on many common platforms. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
Andrew Dunstan wrote: What makes you think this isn't possible to run pgindent? There are no secret incantations. The first hit newbies find looking for info about pgident is http://blog.hagander.net/archives/185-pgindent-vs-dash.html which sure looks like secret incantations to me. The documentation src/tools/pgindent/README reads like a magic spell too: find . -name '*.[ch]' -type f -print | \ egrep -v -f src/tools/pgindent/exclude_file_patterns | \ xargs -n100 pgindent src/tools/pgindent/typedefs.list And it doesn't actually work as written unless you've installed pgindent, entab/detab, and the specially patched NetBSD indent into the system PATH somewhere--unreasonable given that this may be executing on a source only tree that has never been installed.. The fact that the documention is only in the README and not with the rest of the code conventions isn't helping either. The last time I tried to do this a few years ago I failed miserably and never came back. I know way more about building software now though, and just got this to work for the first time. Attached is a WIP wrapper script for running pgident that builds all the requirements into temporary directories, rather than expecting you to install anything system-wide or into a PostgreSQL destination directory. Drop this into src/tools/pgindent, make it executable, and run it from that directory. Should do the right thing on any system that has "make" as an alias for "gmake" (TODO to be better about that in the file, with some other nagging things). When I just ran it against master I got a bunch of modified files, but most of them look like things that have been touched recently so I think it did the right thing. A test of my work here from someone who isn't running this for the first time would be helpful. If this works well enough, I think it would make a good helper script to include in the distribution. The loose ends to fix I can take care of easily enough once basic validation is finished. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us #!/bin/bash -ex # Presume that we're in the pgindent directory at the start. # TODO this should switch to the directory where this script is located at, # in case someone calls src/tools/pgident/run-pgident # Back to the base directory pushd ../../.. # Sanity check we're in the right place if [ ! -d src/tools/pgindent ] ; then echo run-pgindent can only be run from within the src/tools/pgindent directory, aborting popd exit 1 fi echo pgindent setting up environment wget ftp://ftp.postgresql.org/pub/dev/indent.netbsd.patched.tgz mkdir -p indent cd indent zcat ../indent.netbsd.patched.tgz | tar xvf - rm -f indent.netbsd.patched.tgz make INDENT_DIR=`pwd` cd .. pushd src/tools/entab directory make ln -s entab detab ENTAB_DIR=`pwd` popd export PATH="$INDENT_DIR:$ENTAB_DIR:$PATH" wget -O src/tools/pgindent/typedefs.list http://buildfarm.postgresql.org/cgi-bin/typedefs.pl # This cleanup can only happen if there is already a makefile; assume # that if there's isn't, this tree is clean enough if [ -f GNUmakefile ] ; then # TODO this may need to be "gmake" on some systems instead make maintainer-clean fi echo pgindent starting run find . -name '*.[ch]' -type f -print | \ egrep -v -f src/tools/pgindent/exclude_file_patterns | \ xargs -n100 src/tools/pgindent/pgindent src/tools/pgindent/typedefs.list # Cleanup of utilities built temporarily here unlink src/tools/entab/detab rm -rf indent popd echo pgindent run complete -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
Robert Haas wrote: But it turns out that it doesn't really matter. Whitespace or no whitespace, if you don't read the diff before you hit send, it's likely to contain some irrelevant cruft, whether whitespace changes or otherwise. Right. Presuming that pgident will actually solve anything leaps over two normally incorrect assumptions: -That the main tree was already formatted with pgident before you started, so no stray diffs will result from it touching things the submitter isn't even involved in. -There is no larger code formatting or diff issues except for spacing. This has been a nagging loose end for a while, so I'd like to see pgindent's rough edges get sorted out so it's easier to use. But whitespace errors because of bad editors are normally just a likely sign of a patch with bigger problems, rather than something that can get fixed and then submissions is good. There is no substitute for the discipline of reading your own diff before submission. I'll easily obsess over mine for an hour before I submit something major, and that time is always well spent. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
Andrew Dunstan wrote: Now we could certainly make this quite a bit slicker. Apart from anything else, we should change the indent source code tarball so it unpacks into its own directory. Having it unpack into the current directory is ugly and unfriendly. And we should get rid of the "make clean" line in the install target of entab's makefile, which just seems totally ill-conceived. I think the script I submitted upthread has most of the additional slickness needed here. Looks like we both were working on documenting a reasonable way to do this at the same time the other day. The idea of any program here relying on being able to write to /usr/local/bin as your example did makes this harder for people to run; that's why I made everything in the build tree and just pushed the appropriate directories into the PATH. Since I see providing a script to automate this whole thing as the preferred way to make this easier, re-packaging the indent source tarball to extract to a directory doesn't seem worth the backwards compatibility trouble it will introduce. Improving the entab makefile I don't have an opinion on. It might also be worth setting it up so that instead of having to pass a path to a typedefs file on the command line, we default to a file sitting in, say, /usr/local/etc. Then you'd just be able to say "pgindent my_file.c". OK, so I need to update my script to handle either indenting a single file, or doing all of them. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
On 04/21/2011 12:39 PM, Robert Haas wrote: In fact, I've been wondering if we shouldn't consider extending the support window for 8.2 past the currently-planned December 2011. There seem to be quite a lot of people running that release precisely because the casting changes in 8.3 were so painful, and I think the incremental effort on our part to extend support for another year would be reasonably small. The pending EOL for 8.2 is the only thing that keeps me sane when speaking with people who refuse to upgrade, yet complain that their 8.2 install is slow. This last month, that seems to be more than usual "why does autovacuum suck so much?" complaints that would all go away with an 8.3 upgrade. Extending the EOL is not doing any of these users a favor. Every day that goes by when someone is on a version of PostgreSQL that won't ever allow in-place upgrade is just making worse the eventual dump and reload they face worse. The time spent porting to 8.3 is a one-time thing; the suffering you get trying to have a 2011 sized database on 2006's 8.2 just keeps adding up the longer you postpone it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
're not participating in the thing that needs the most help. This is not a problem you make better with fuzzy management directives to be nicer to people. There are real software engineering issues about how to ensure good code quality at its core. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync reliability
On 04/21/2011 04:26 AM, Simon Riggs wrote: However, that begs the question of what happens with WAL. At present, we do nothing to ensure that "the entry in the directory containing the file has also reached disk". Well, we do, but it's not obvious why that is unless you've stared at this for far too many hours. A clear description of the possible issue you and Dan are raising showed up on LKML a few years ago: http://lwn.net/Articles/270891/ Here's the most relevant part, which directly addresses the WAL case: "[fsync] is unsafe for write-ahead logging, because it doesn't really guarantee any _ordering_ for the writes at the hard storage level. So aside from losing committed data, it can also corrupt structural metadata. With ext3 it's quite easy to verify that fsync/fdatasync don't always write a journal entry. (Apart from looking at the kernel code :-) Just write some data, fsync(), and observe the number of writes in /proc/diskstats. If the current mtime second _hasn't_ changed, the inode isn't written. If you write data, say, 10 times a second to the same place followed by fsync(), you'll see a little more than 10 write I/Os, and less than 20." There's a terrible hack suggested where you run fchmod to force the journal out in the next fsync that makes me want to track the poster down and shoot him, but this part raises a reasonable question. The main issue he's complaining about here is a moot one for PostgreSQL. If the WAL rewrites have been reordered but have not completed, the minute WAL replay hits the spot with a missing block the CRC32 will be busted and replay is finished. The fact that he's assuming a database would have such a naive WAL implementation that it would corrupt the database if blocks are written out of order in between fsync call returning is one of the reasons this whole idea never got more traction--hard to get excited about a proposal whose fundamentals rest on an assumption that doesn't turns out to be true on real databases. There's still the "fsync'd a data block but not the directory entry yet" issue as fall-out from this too. Why doesn't PostgreSQL run into this problem? Because the exact code sequence used is this one: open write fsync close And Linux shouldn't ever screw that up, or the similar rename path. Here's what the close man page says, from http://linux.die.net/man/2/close : "A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)" What this is alluding to is that if you fsync before closing, the close will write all the metadata out too. You're busted if your write cache lies, but we already know all about that issue. There was a discussion of issues around this on LKML a few years ago, with Alan Cox getting the good pull quote at http://lkml.org/lkml/2009/3/27/268 : "fsync/close() as a pair allows the user to correctly indicate their requirements." While fsync doesn't guarantee that metadata is written out, and neither does close, kernel developers seem to all agree that fsync-before-close means you want everything on disk. Filesystems that don't honor that will break all sorts of software. It is of course possible there are bugs in some part of this code path, where a clever enough test case might expose a window of strange file/metadata ordering. I think it's too weak of a theorized problem to go specifically chasing after though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench \for or similar loop
Alvaro Herrera wrote: Why do we have pgbench at all in the first place? Surely we could rewrite it in plpgsql with proper stored procedures. pgbench gives you a driver program with the following useful properties: 1) Multiple processes are spawned and each gets its own connection 2) A time/transaction limit is enforced across all of the connections at once 3) Timing information is written to a client-side log file 4) The work of running the clients can happen on a remote system, so that it's possible to just test the server-side performance 5) The program is similar enough to any other regular client, using the standard libpq interface, that connection-related overhead should be similar to a real workload. All of those have some challenges before you could duplicate them in a stored procedure context. My opinion of this feature is similar to the one Aiden already expressed: there's already so many ways to do this sort of thing using shell-oriented approaches (as well as generate_series) that it's hard to get too excited about implementing it directly in pgbench. Part of the reason for adding the \shell and \setshell commands way to make tricky things like this possible without having to touch the pgbench code further. I for example would solve the problem you're facing like this: 1) Write a shell script that generates the file I need 2) Call it from pgbench using \shell, passing the size it needs. Have that write a delimited file with the data required. 3) Import the whole thing with COPY. And next thing you know you've even got the CREATE/COPY optimization as a possibility to avoid WAL, as well as the ability to avoid creating the data file more than once if the script is smart enough. Sample data file generation can be difficult; most of the time I'd rather solve in a general programming language. The fact that simple generation cases could be done with the mechanism you propose is true. However, this only really helps cases that are too complicated to express with generate_series, yet not so complicated that you really want a full programming language to generate the data. I don't think there's that much middle ground in that use case. But if this is what you think makes your life easier, I'm not going to tell you you're wrong. And I don't feel that your desire for this features means you must tackle a more complicated thing instead--even though what I personally would much prefer is something making this sort of thing easier to do in regression tests, too. That's a harder problem, though, and you're only volunteering to solve an easier one than that. Stepping aside from debate over usefulness, my main code concern is that each time I look at the pgbench code for yet another tacked on bit, it's getting increasingly creakier and harder to maintain. It's never going to be a good benchmark driver program capable of really complicated tasks. And making it try keeps piling on the risk of breaking it for its intended purpose of doing simple tests. If you can figure out how to keep the code contortions to implement the feature under control, there's some benefit there. I can't think of a unique reason for it; again, lots of ways to solve this already. But I'd probably use it if it were there. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench \for or similar loop
Kevin Grittner wrote: I'm not clear on exactly what you're proposing there, but the thing I've considered doing is having threads to try to keep a FIFO queue populated with a configurable transaction mix, while a configurable number of worker threads pull those transactions off the queue and... This is like the beginning of an advertisement for how Tsung is useful for simulating complicated workloads. The thought of growing pgbench to reach that level of capabilities makes my head hurt. When faced with this same issue, the sysbench team decided to embed Lua as their scripting language; sample scripts: http://bazaar.launchpad.net/~sysbench-developers/sysbench/0.5/files/head:/sysbench/tests/db/ This isn't very well known because the whole MySQL community fracturing has impacted their ability to actually release this overhaul--seems like they spend all their time just trying to add support for each new engine of the month. I don't even like Lua, yet this still seems like a much better idea than trying to build on top of the existing pgbench codebase. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
Andrew Dunstan wrote: Personally, I want pgindent installed in /usr/local/ or similar. That way I can have multiple trees and it will work in all of them without my having to build it for each. What I don't want is for the installed patched BSD indent to conflict with the system's indent, which is why I renamed it. If you still think that's a barrier to easy use, then I think we need a way to provide hooks in the makefiles for specifying the install location, so we can both be satisfied. I don't think there's a conflict here, because the sort of uses I'm worried about don't want to install the thing at all; just to run it. I don't really care what "make install" does because I never intend to run it; dumping into /usr/local is completely reasonable for the people who do. Since there's no script I know of other than your prototype, I don't think repackaging is likely to break anything. That makes it worth doing *now* rather than later. But frankly, I'd rather do without an extra script if possible. Fine, the renaming bit I'm not really opposed to. The odds there's anyone using this thing that isn't reading this message exchange is pretty low. There is the documentation backport issue if you make any serious change it though. Maybe put the new version in another location, leave the old one where it is? There's a fair number of steps to this though. It's possible to automate them all such that running the program is trivial. I don't know how we'd ever get that same ease of use without some sort of scripting for the whole process. Could probably do it in a makefile instead, but I don't know if that's really any better. The intersection between people who want to run this and people who don't have bash available is pretty slim I think. I might re-write in Perl to make it more portable, but I think that will be at the expense of making it harder for people to tweak if it doesn't work out of the box. More code, too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync reliability
Simon Riggs wrote: We do issue fsync and then close, but only when we switch log files. We don't do that as part of the normal commit path. Since all these files are zero-filled before use, the space is allocated for them, and the remaining important filesystem layout metadata gets flushed during the close. The only metadata that changes after that--things like the last access time--isn't important to the WAL functioning. So the metadata doesn't need to be updated after a normal commit, it's already there. There are two main risks when crashing while fsync is in the middle of executing a push out to physical storage: torn pages due to partial data writes, and other out of order writes. The only filesystems where this isn't true are the copy on write ones, where the blocks move around on disk too. But those all have their own more careful guarantees about metadata too. The issue you raise above where "fsync is not safe for Write Ahead Logging" doesn't sound good. I don't think what you've said has fully addressed that yet. We could replace the commit path with O_DIRECT and physically order the data blocks, but I would guess the code path to durable storage has way too many bits of code tweaking it for me to feel happy that was worth it. As far as I can tell the CRC is sufficient protection against that. This is all data that hasn't really been committed being torn up here. Once you trust that the metadata problem isn't real, reordered writes are the only going to destroy things that are in the middle of being flushed to disk. A synchronous commit mangled this way will be rolled back regardless because it never really finished (fsync didn't return); an asynchronous one was never guaranteed to be on disk. On many older Linux systems O_DIRECT is a less reliable code path than than write/fsync is, so you're right that isn't necessarily a useful step forward. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync reliability
On 04/22/2011 09:32 AM, Simon Riggs wrote: OK, that's good, but ISTM we still have a hole during RemoveOldXlogFiles() where we don't fsync or open/close the file, just rename it. This is also something that many applications rely upon working as hoped for here, even though it's not technically part of POSIX. Early versions of ext4 broke that, and it caused a giant outcry of complaints. http://www.h-online.com/open/news/item/Ext4-data-loss-explanations-and-workarounds-740671.html has a good summary. This was broken on ext4 from around 2.6.28 to 2.6.30, but the fix for it was so demanded that it's even been ported by the relatively lazy distributions to their 2.6.28/2.6.29 kernels. There may be a small window for metadata issues here if you've put the WAL on ext2 and there's a crash in the middle of rename. That factors into why any suggestions I make about using ext2 come with a load of warnings about the risk of not journaling. It's hard to predict every type of issue that fsck might force you to come to terms with after a crash on ext2, and if there was a problem with this path I'd expect it to show up as something to be reconciled then. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync reliability
On 04/24/2011 10:06 PM, Daniel Farina wrote: On Thu, Apr 21, 2011 at 8:51 PM, Greg Smith wrote: There's still the "fsync'd a data block but not the directory entry yet" issue as fall-out from this too. Why doesn't PostgreSQL run into this problem? Because the exact code sequence used is this one: open write fsync close And Linux shouldn't ever screw that up, or the similar rename path. Here's what the close man page says, from http://linux.die.net/man/2/close : Theodore Ts'o addresses this *exact* sequence of events, and suggests if you want that rename to definitely stick that you must fsync the directory: http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync Not exactly. That's talking about the sequence used for creating a file, plus a rename. When new WAL files are being created, I believe the ugly part of this is avoided. The path when WAL files are recycled using rename does seem to be the one with the most likely edge case. The difficult case Tso's discussion is trying to satisfy involves creating a new file and then swapping it for an old one atomically. PostgreSQL never does that exactly. It creates new files, pads them with zeros, and then starts writing to them; it also renames old files that are already of the correctly length. Combined with the fact that there are always fsyncs after writes to the files, and this case really isn't exactly the same as any of the others people are complaining about. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync reliability
On 04/23/2011 09:58 AM, Matthew Woodcraft wrote: As far as I can make out, the current situation is that this fix (the auto_da_alloc mount option) doesn't work as advertised, and the ext4 maintainers are not treating this as a bug. See https://bugzilla.kernel.org/show_bug.cgi?id=15910 I agree with the resolution that this isn't a bug. As pointed out there, XFS does the same thing, and this behavior isn't going away any time soon. Leaving behind zero-length files in situations where developers tried to optimize away a necessary fsync happens. Here's the part where the submitter goes wrong: "We first added a fsync() call for each extracted file. But scattered fsyncs resulted in a massive performance degradation during package installation (factor 10 or more, some reported that it took over an hour to unpack a linux-headers-* package!) In order to reduce the I/O performance degradation, fsync calls were deferred..." Stop right there; the slow path was the only one that had any hope of being correct. It can actually slow things by a factor of 100X or more, worst-case. "So, we currently have the choice between filesystem corruption or major performance loss": yes, you do. Writing files is tricky and it can either be slow or safe. If you're going to avoid even trying to enforce the right thing here, you're really going to get really burned. It's unfortunate that so many people are used to the speed you get in the common situation for a while now with ext3 and cheap hard drives: all writes are cached unsafely, but the filesystem resists a few bad behaviors. Much of the struggle where people say "this is so much slower, I won't put up with it" and try to code around it is futile, and it's hard to separate out the attempts to find such optimizations from the legitimate complaints. Anyway, you're right to point out that the filesystem is not necessarily going to save anyone from some of the tricky rename situations even with the improvements made to delayed allocation. They've fixed some of the worst behavior of the earlier implementation, but there are still potential issues in that area it seems. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] branching for 9.2devel
On 04/25/2011 02:26 PM, Josh Berkus wrote: Overall, I think the advantages to a faster/shorter CF cycle outweigh the disadvantages enough to make it at least worth trying. I'm willing to run the first 1-week CF, as well as several of the others during the 9.2 cycle to try and make it work. It will be interesting to see if it's even possible to get all the patches assigned a reviewer in a week. The only idea I've come up with for making this idea more feasible is to really buckle down on the idea that all submitters should also be reviewing. I would consider a fair policy to say that anyone who doesn't volunteer to review someone else's patch gets nudged toward the bottom of the reviewer priority list. Didn't offer to review someone else's patch? Don't be surprised if you get bumped out of a week long 'fest. This helps with two problems. It helps fill in short-term reviewers, and in the long-term it makes submitters more competent. The way things are setup now, anyone who submits a patch without having done a review first is very likely to get their patch bounced back; the odds of getting everything right without that background are low. Ideally submitters might even start fixing their own patches without reviewer prompting, once they get into someone else's and realize what they didn't do. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving the memory allocator
On 04/25/2011 05:45 PM, Andres Freund wrote: The profile I used in this case was: pgbench -h /tmp/ -p5433 -s 30 pgbench -S -T 20 I'd suggest collecting data from running this with "-M prepared" at some point too, so that you can get a basic idea which of these allocations are avoided when using prepared statements. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improvements to pgtune
Shiv wrote: On the program I hope to learn as much about professional software engineering principles as PostgreSQL. My project is aimed towards extending and hopefully improving upon pgtune. If any of you have some ideas or thoughts to share. I am all ears!! Well, first step on the software engineering side is to get a copy of the code in a form you can modify. I'd recommend grabbing it from https://github.com/gregs1104/pgtune ; while there is a copy of the program on git.postgresql.org, it's easier to work with the one on github instead. I can push updates over to the copy on postgresql.org easily enough, and that way you don't have to worry about getting an account on that server. There's a long list of suggested improvements to make at https://github.com/gregs1104/pgtune/blob/master/TODO Where I would recommend getting started is doing some of the small items on there, some of which I have already put comments into the code about but just not finished yet. Some examples: -Validate against min/max -Show original value in output -Limit shared memory use on Windows (see notes on shared_buffers at http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server for more information) -Look for postgresql.conf file using PGDATA environment variable -Look for settings files based on path of the pgtune executable -Save a settings reference files for newer versions of PostgreSQL (right now I only target 8.4) and allow passing in the version you're configuring. A common mistake made by GSOC students is to dive right in to trying to make big changes. You'll be more successful if you get practice at things like preparing and sharing patches on smaller changes first. At the next level, there are a few larger features that I would consider valuable that are not really addressed by the program yet: -Estimate how much shared memory is used by the combination of settings. See Table 17-2 at http://www.postgresql.org/docs/9.0/static/kernel-resources.html ; those numbers aren't perfect, and improving that table is its own useful project. But it gives an idea how they fit together. I have some notes at the end of the TODO file on how I think the information needed to produce this needs to be passed around the inside of pgtune. -Use that estimate to produce a sysctl.conf file for one platform; Linux is the easiest one to start with. I've attached a prototype showing how to do that, written in bash. -Write a Python-TK or web-based front-end for the program. Now that I know someone is going to work on this program again, I'll see what I can do to clean some parts of it up. There are a couple of things it's easier for me to just fix rather than to describe, like the way I really want to change how it adds comments to the settings it changes. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us #!/bin/bash # Output lines suitable for sysctl configuration based # on total amount of RAM on the system. The output # will allow up to 50% of physical memory to be allocated # into shared memory. # On Linux, you can use it as follows (as root): # # ./shmsetup >> /etc/sysctl.conf # sysctl -p # Early FreeBSD versions do not support the sysconf interface # used here. The exact version where this works hasn't # been confirmed yet. page_size=`getconf PAGE_SIZE` phys_pages=`getconf _PHYS_PAGES` if [ -z "$page_size" ]; then echo Error: cannot determine page size exit 1 fi if [ -z "$phys_pages" ]; then echo Error: cannot determine number of memory pages exit 2 fi shmall=`expr $phys_pages / 2` shmmax=`expr $shmall \* $page_size` echo \# Maximum shared segment size in bytes echo kernel.shmmax = $shmmax echo \# Maximum number of shared memory segments in pages echo kernel.shmall = $shmall -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improvements to pgtune
Daniel Farina wrote: It seems like in general it lacks a feedback mechanism to figure things out settings from workloads, instead relying on Greg Smith's sizable experience to do some arithmetic and get you off the ground in a number of common cases. To credit appropriately, the model used right now actually originated with a Josh Berkus spreadsheet, from before I was doing this sort of work full-time. That's held up pretty well, but it doesn't fully reflect how I do things nowadays. The recent realization that pgtune is actually shipping as a package for Debian/Ubuntu now has made realize this is a much higher profile project now, one that I should revisit doing a better job on. Every time I've gotten pulled into discussions of setting parameters based on live monitoring, it's turned into a giant black hole--absorbs a lot of energy, nothing useful escapes from it. I credit completely ignoring that idea altogether, and using the simplest possible static settings instead, as one reason I managed to ship code here that people find useful. I'm not closed to the idea, just not optimistic it will lead anywhere useful. That makes it hard to work on when there are so many obvious things guaranteed to improve the program that could be done instead. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Predicate locking
Dan Ports wrote: Yes, you're right -- the current implementation of SSI only locks indexes at the granularity of index pages. So although those transactions don't actually access the same records, they're detected as a conflict because they're on the same index page. Let's try to demonstrate that with an update to Vlad's example. Run this on a first client to generate the same type of table, but with an easy to vary number of rows in it: drop table t; create table t (id bigint, value bigint); insert into t(id,value) (select s,1 from generate_series(1,348) as s); create index t_idx on t(id); begin transaction; set transaction isolation level serializable; select * from t where id = 2; insert into t (id, value) values (-2, 1); Execute this on the second client: begin transaction; set transaction isolation level serializable; select * from t where id = 3; insert into t (id, value) values (-3, 0); commit; And then return to the first one to run COMMIT. I'm getting a serialization failure in that case. However, if I increase the generate_series to create 349 rows (or more) instead, it works. I don't see where it crosses a page boundary in table or index size going from 348 to 349, so I'm not sure why I'm seeing a change happening there. In both cases, there's only one non-header index block involved. I don't think Vlad is being unreasonable here; he's provided a test case demonstrating the behavior he'd like to see, and shown it doesn't work as expected. If we can prove that test does work on non-trivial sized tables, and that it only suffers from to-be-optimized broader locks than necessary, that would make a more compelling argument against the need for proper predicate locks. I don't fully understand why this attempt I tried to do that is working the way it does though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Predicate locking
Kevin Grittner wrote: I don't think Vlad is being unreasonable here; he's provided a test case demonstrating the behavior he'd like to see, and shown it doesn't work as expected. ... on a toy table with contrived values. How different is this from the often-asked question about why a query against a four-line table is not using the index they expect, and how can we expect it to scale if it doesn't? It's not, but in that case I've been known to show someone that the behavior they're seeing doesn't happen on a larger table. My point was just that no one has really done that here yet: provided an example showing SSI serialization working as a substitute for predicate locking in this sort of use case. I trust that the theory is sound here, and yet I'd still like to see that demonstrated. This is why I launched into making a less trivial test case. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Predicate locking
Kevin Grittner wrote: Check where the plan goes from a table scan to an indexed access. Also look at what is showing for SIRead locks in pg_locks as you go. Between those two bits of information, it should become apparent. OK, so this doesn't look to be an index lock related thing at all here. Updated test case does this to create the table and show some additional state: drop table t; create table t (id bigint, value bigint); insert into t(id,value) (select s,1 from generate_series(1,348) as s); create index t_idx on t(id); begin transaction; set transaction isolation level serializable; explain analyze select * from t where id = 2; select pid,locktype,relation::regclass,page,tuple from pg_locks where mode='SIReadLock'; insert into t (id, value) values (-2, 1); select pid,locktype,relation::regclass,page,tuple from pg_locks where mode='SIReadLock'; Do the same thing as before on the second process: begin transaction; set transaction isolation level serializable; select * from t where id = 3; insert into t (id, value) values (-3, 0); commit; Then return to the first client to commit. When I execute that with 348 records, the case that fails, it looks like this: gsmith=# explain analyze select * from t where id = 2; QUERY PLAN Seq Scan on t (cost=0.00..6.35 rows=2 width=16) (actual time=0.106..0.286 rows=1 loops=1) Filter: (id = 2) Total runtime: 0.345 ms (3 rows) gsmith=# select pid,locktype,relation::regclass,page,tuple from pg_locks where mode='SIReadLock'; pid | locktype | relation | page | tuple --+--+--+--+--- 1495 | relation | t| | So it's actually grabbing a lock on the entire table in that situation. The other client does the same thing, and they collide with the described serialization failure. The minute I try that with table that is 349 rows instead, it switches plans: gsmith=# explain analyze select * from t where id = 2; QUERY PLAN -- Bitmap Heap Scan on t (cost=4.27..6.29 rows=2 width=16) (actual time=0.169..0.171 rows=1 loops=1) Recheck Cond: (id = 2) -> Bitmap Index Scan on t_idx (cost=0.00..4.27 rows=2 width=0) (actual time=0.144..0.144 rows=1 loops=1) Index Cond: (id = 2) Total runtime: 0.270 ms (5 rows) gsmith=# select pid,locktype,relation::regclass,page,tuple from pg_locks where mode='SIReadLock'; pid | locktype | relation | page | tuple --+--+--+--+--- 1874 | page | t_idx|1 | 1874 | tuple| t|0 | 2 (2 rows) Grabbing a lock on the index page and the row, as Dan explained it would. This actually eliminates this particular serialization failure altogether here though, even with these still on the same table and index page. So the root problem with Vlad's test isn't the index lock at all; it's heavy locking from the sequential scan that's executing on the trivial cases. If he expands his tests to use a larger amount of data, such that the plan switches to a realistic one, his results with the new serialization mode may very well be more satisfying. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for new feature: Buffer Cache Hibernation
Alvaro Herrera wrote: As for gain, I have heard of test setups requiring hours of runtime in order to prime the buffer cache. And production ones too. I have multiple customers where a server restart is almost a planned multi-hour downtime. The system may be back up, but for a couple of hours performance is so terrible it's barely usable. You can watch the MB/s ramp up as the more random data fills in over time; getting that taken care of in a larger block more amenable to elevator sorting would be a huge help. I never bothered with this particular idea though because shared_buffers is only a portion of the important data. Cedric's pgfincore code digs into the OS cache, too, which can then save enough to be really useful here. And that's already got a snapshot/restore feature. The slides at http://www.pgcon.org/2010/schedule/events/261.en.html have a useful into to that, pages 30 through 34 are the neat ones. That provides some other neat APIs for preloading popular data into cache too. I'd rather work on getting something like that into core, rather than adding something that only is targeting just shared_buffers. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compiling a PostgreSQL 7.3.2 project with Eclipse
Krešimir Križanović wrote: However, where I compile and run the project, in the Eclipse console I get: POSTGRES backend interactive interface $Revision: 1.115 $ $Date: 2006/02/06 01:19:46 $ I don’t know what to do with this backend interface. I would like to get a postmaster running and to connect to a data base with psql. However, when i try to start psql, it says that there is no postmaster running. An example of a session with that interface is at http://archives.postgresql.org/pgsql-hackers/2000-01/msg01471.php ; you may find it useful at some point. Your problem is caused by a change made to PostgreSQL's naming convention made after the 7.3 fork you're using. In earlier versions, "postgres" meant start the server in single user mode: http://www.postgresql.org/docs/7.3/static/app-postgres.html While "postmaster" started it as a proper server: http://www.postgresql.org/docs/7.3/static/app-postmaster.html In modern versions, they are the same thing. The Eclipse example uses "postgres", which starts the regular server now, but in 7.3 only started single user mode. Change where you run the program to use "postmaster" instead and it should work more like what you're expecting. Doing something useful with the TelegraphCQ code is probably going to take you a while. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why not install pgstattuple by default?
Christopher Browne wrote: I'm getting "paper cuts" quite a bit these days over the differences between what different packaging systems decide to install. The one *I* get notably bit on, of late, is that I have written code that expects to have pg_config to do some degree of self-discovery, only to find production folk complaining that they only have "psql" available in their environment. Given the other improvements in being able to build extensions in 9.1, we really should push packagers to move pg_config from the PostgreSQL development package into the main one starting in that version. I've gotten bit by this plenty of times. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for new feature: Buffer Cache Hibernation
On 05/05/2011 05:06 AM, Mitsuru IWASAKI wrote: In summary, PgFincore's target is File System Buffer Cache, Buffer Cache Hibernation's target is DB Buffer Cache(shared buffers). Right. The thing to realize is that shared_buffers is becoming a smaller fraction of the total RAM used by the database every year. On Windows it's been stuck at useful settings being less than 512MB for a while now. And on UNIX systems, around 8GB seems to be effective upper limit. Best case, shared_buffers is only going to be around 25% of total RAM; worst-case, approximately, you might have Windows server with 64GB of RAM where shared_buffers is less than 1% of total RAM. There's nothing wrong with the general idea you're suggesting. It's just only targeting a small (and shrinking) subset of the real problem here. Rebuilding cache state starts with shared_buffers, but that's not enough of the problem to be an effective tweak on many systems. I think that all the complexity with CRCs etc. is unlikely to lead anywhere too, and those two issues are not completely unrelated. The simplest, safest thing here is the right way to approach this, not the most complicated one, and a simpler format might add some flexibility here to reload more cache state too. The bottleneck on reloading the cache state is reading everything from disk. Trying to micro-optimize any other part of that is moving in the wrong direction to me. I doubt you'll ever measure a useful benefit that overcomes the expense of maintaining the code. And you seem to be moving to where someone can't restore cache state when they change shared_buffers. A simpler implementation might still work in that situation; reload until you run out of buffers if shared_buffers shrinks, reload until you're done with the original size. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why not install pgstattuple by default?
On 05/06/2011 05:58 PM, Josh Berkus wrote: Yeah, I wasn't thinking of including all of contrib. There's a lot of reasons not to do that. I was asking about pgstattuple in particular, since it's: (a) small (b) has no external dependancies (c) adds no stability risk or performance overhead (d) is usually needed on production systems when it's needed at all It's possible that we have one or two other diagnostic utilities which meet the above profile. pageinspect, maybe? I use pgstattuple, pageinspect, pg_freespacemap, and pg_buffercache regularly enough that I wish they were more common. Throw in pgrowlocks and you've got the whole group Robert put into the debug set. It makes me sad every time I finish a utility using one of these and realize I'll have to include the whole "make sure you have the contrib modules installed" disclaimer in its documentation again. These are the only ones I'd care about moving into a more likely place. The rest of the contrib modules are the sort where if you need them, you realize that early and get them installed. These are different by virtue of their need popping up most often during emergencies. The fact that I believe they all match the low impact criteria too makes it even easier to consider. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why not install pgstattuple by default?
On 05/07/2011 12:42 PM, Peter Eisentraut wrote: On fre, 2011-05-06 at 14:32 -0400, Greg Smith wrote: Given the other improvements in being able to build extensions in 9.1, we really should push packagers to move pg_config from the PostgreSQL development package into the main one starting in that version. I've gotten bit by this plenty of times. Do you need pg_config to install extensions? No, but you still need it to build them. PGXN is a source code distribution method, not a binary one. It presumes users can build modules they download using PGXS. No pg_config, no working PGXS, no working PGXN. For such a small binary to ripple out to that impact is bad. The repmgr program we distribute has the same problem, so I've been getting first-hand reports of just how many people are likely to run into this recently. You have to install postgresql-devel with RPM and on Debian, the very non-obvious postgresql-server-dev-$version Anyway, didn't want to hijack this thread beyond pointing out that if there any package reshuffling that happens for contrib changes, it should check for and resolve this problem too. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why not install pgstattuple by default?
On 05/06/2011 04:06 PM, Tom Lane wrote: FWIW, I did move pg_config from -devel to the "main" (really client) postgresql package in Fedora, as of 9.0. That will ensure it's present in either client or server installations. Eventually that packaging will reach RHEL ... We should make sure that the PGDG packages adopt that for 9.1 then, so it starts catching on more. Unless Devrim changed to catch up since I last installed an RPM set, in that 9.0 it's still in the same place: $ rpm -qf /usr/pgsql-9.0/bin/pg_config postgresql90-devel-9.0.2-2PGDG.rhel5 While Peter's question about whether it's really all that useful is reasonable, I'd at least like to get a better error message when you don't have everything needed to compile extensions. I think the shortest path to that is making pg_config more likely to be installed, then to check whether the file "pg_config --pgxs" references exists. I'll see if I can turn that idea into an actual change to propose. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why not install pgstattuple by default?
Attached patch is a first cut at what moving one contrib module (in this case pg_buffercache) to a new directory structure might look like. The idea is that src/extension could become a place for "first-class" extensions to live. Those are ones community is committed to providing in core, but are just better implemented as extensions than in-database functions, for reasons that include security. This idea has been shared by a lot of people for a while, only problem is that it wasn't really practical to implement cleanly until the extensions code hit. I think it is now, this attempts to prove it. Since patches involving file renaming are clunky, the changes might be easier to see at https://github.com/greg2ndQuadrant/postgres/commit/507923e21e963c873a84f1b850d64e895776574f where I just pushed this change too. The install step for the modules looks like this now: gsmith@grace:~/pgwork/src/move-contrib/src/extension/pg_buffercache$ make install /bin/mkdir -p '/home/gsmith/pgwork/inst/move-contrib/lib/postgresql' /bin/mkdir -p '/home/gsmith/pgwork/inst/move-contrib/share/postgresql/extension' /bin/sh ../../../config/install-sh -c -m 755 pg_buffercache.so '/home/gsmith/pgwork/inst/move-contrib/lib/postgresql/pg_buffercache.so' /bin/sh ../../../config/install-sh -c -m 644 ./pg_buffercache.control '/home/gsmith/pgwork/inst/move-contrib/share/postgresql/extension/' /bin/sh ../../../config/install-sh -c -m 644 ./pg_buffercache--1.0.sql ./pg_buffercache--unpackaged--1.0.sql '/home/gsmith/pgwork/inst/move-contrib/share/postgresql/extension/' $ psql -c "create extension pg_buffercache" CREATE EXTENSION The only clunky bit I wasn't really happy with is the amount of code duplication that comes from having a src/extension/Makefile that looks almost, but not quite, identical to contrib/Makefile. The rest of the changes don't seem too bad to me, and even that's really only 36 lines that aren't touched often. Yes, the paths are different, so backports won't happen without an extra step. But the code changes required were easier than I was expecting, due to the general good modularity of the extensions infrastructure. So long as the result ends up in share/postgresql/extension/ , whether they started in contrib/ or src/extension/ doesn't really matter to CREATE EXTENSION. But having them broke out this way makes it easy for the default Makefile to build and install them all. (I recognize I didn't do that last step yet though) I'll happily go covert pgstattuple and the rest of the internal diagnostics modules to this scheme, and do the doc cleanups, this upcoming week if it means I'll be able to use those things without installing all of contrib one day. Ditto for proposing RPM and Debian packaging changes that match them. All that work will get paid back the first time I don't have to fill out a bunch of paperwork (again) at a customer site justifying why they need to install the contrib [RPM|deb] package (which has some scary stuff in it) on all their servers, just so I can get some bloat or buffer inspection module. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us diff --git a/contrib/Makefile b/contrib/Makefile index 6967767..04cf330 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -30,7 +30,6 @@ SUBDIRS = \ pageinspect \ passwordcheck \ pg_archivecleanup \ - pg_buffercache \ pg_freespacemap \ pg_standby \ pg_stat_statements \ diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile deleted file mode 100644 index 323c0ac..000 --- a/contrib/pg_buffercache/Makefile +++ /dev/null @@ -1,18 +0,0 @@ -# contrib/pg_buffercache/Makefile - -MODULE_big = pg_buffercache -OBJS = pg_buffercache_pages.o - -EXTENSION = pg_buffercache -DATA = pg_buffercache--1.0.sql pg_buffercache--unpackaged--1.0.sql - -ifdef USE_PGXS -PG_CONFIG = pg_config -PGXS := $(shell $(PG_CONFIG) --pgxs) -include $(PGXS) -else -subdir = contrib/pg_buffercache -top_builddir = ../.. -include $(top_builddir)/src/Makefile.global -include $(top_srcdir)/contrib/contrib-global.mk -endif diff --git a/contrib/pg_buffercache/pg_buffercache--1.0.sql b/contrib/pg_buffercache/pg_buffercache--1.0.sql deleted file mode 100644 index 9407d21..000 --- a/contrib/pg_buffercache/pg_buffercache--1.0.sql +++ /dev/null @@ -1,17 +0,0 @@ -/* contrib/pg_buffercache/pg_buffercache--1.0.sql */ - --- Register the function. -CREATE FUNCTION pg_buffercache_pages() -RETURNS SETOF RECORD -AS 'MODULE_PATHNAME', 'pg_buffercache_pages' -LANGUAGE C; - --- Create a view for convenient access. -CREATE VIEW pg_buffercache AS - SELECT P.* FROM pg_buffercache_pages() AS P - (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, - r
Re: [HACKERS] patch for new feature: Buffer Cache Hibernation
Mitsuru IWASAKI wrote: the patch is available at: http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110508.patch We can't accept patches just based on a pointer to a web site. Please e-mail this to the mailing list so that it can be considered a submission under the project's licensing terms. I hope this would be committable and the final version. PostgreSQL has high standards for code submissions. Extremely few submissions are committed without significant revisions to them based on code review. So far you've gotten a first round of high-level design review, there's several additional steps before something is considered for a commit. The whole process is outlined at http://wiki.postgresql.org/wiki/Submitting_a_Patch From a couple of minutes of reading the patch, the first things that pop out as problems are: -All of the ControlFile -> controlFile renaming has add a larger difference to ReadControlFile than I would consider ideal. -Touching StrategyControl is not something this patch should be doing. -I don't think your justification ("debugging or portability") for keeping around your original code in here is going to be sufficient to do so. -This should not be named enable_buffer_cache_hibernation. That very large diff you ended up with in the regression tests is because all of the settings named enable_* are optimizer control settings. Using the name "buffer_cache_hibernation" instead would make a better starting point. From a bigger picture perspective, this really hasn't addressed any of my comments about shared_buffers only being the beginning of the useful cache state to worry about here. I'd at least like the solution to the buffer cache save/restore to have a plan for how it might address that too one day. This project is also picky about only committing code that fits into the long-term picture for desired features. Having a working example of a server-side feature doing cache storage and restoration is helpful though. Don't think your work here is unappreciated--it is. Getting this feature added is just a harder problem than what you've done so far. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improvements to pgtune
ased, while working on simpler programs that don't aim so high leads to software that ships to the world and finds users. The only reason pgtune is now available in packaged form on multiple operating systems is that I ignored all advice about aiming for a complicated tool and instead wrote a really simple one. That was hard enough to finish. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
On 05/09/2011 12:06 PM, Andrew Dunstan wrote: The fact that we can do in place upgrades of the data only addresses one pain point in upgrading. Large legacy apps require large retesting efforts when upgrading, often followed by lots more work renovating the code for backwards incompatibilities. This can be a huge cost for what the suits see as little apparent gain, and making them do it more frequently in order to stay current will not win us any friends. I just had a "why a new install on 8.3?" conversation today, and it was all about the application developer not wanting to do QA all over again for a later release. Right now, one of the major drivers for "why upgrade?" has been the performance improvements in 8.3, relative to any older version. The main things pushing happy 8.3 sites to 8.4 or 9.0 that I see are either VACUUM issues (improved with partial vacuum in 8.4) or wanting real-time replication (9.0). I predict many sites that don't want either are likely to sit on 8.3 for a really long time. The community won't be able to offer a compelling reason why smaller sites in particular should go through the QA an upgrade requires. The fact that the app QA time is now the main driver--not the dump and reload time--is good, because it makes it does make it easier for the people with the biggest data sets to move. They're the ones that need the newer versions the most anyway, and in that regard having in-place upgrade start showing up as of 8.3 was really just in time. I think 8.3 is going to be one of those releases like 7.4, where people just keep running it forever. At least shortening the upgrade path has made that concern a little bit better. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why not install pgstattuple by default?
On 05/09/2011 03:31 PM, Alvaro Herrera wrote: For executables we already have src/bin. Do we really need a separate place for, say, pg_standby or pg_upgrade? There's really no executables in contrib that I find myself regularly desperate for/angry at because they're not installed as an integral part of the server, the way I regularly find myself cursing some things that are now extensions. The only contrib executable I use often is pgbench, and that's normally early in server life--when it's still possible to get things installed easily most places. And it's something that can be removed when finished, in cases where people are nervous about the contrib package. Situations where pg_standby or pg_upgrade suddenly pop up as emergency needs seem unlikely too, which is also the case with oid2name, pg_test_fsync, pg_archivecleanup, and vacuumlo. I've had that happen with pg_archivecleanup exactly once since it appeared--running out of space and that was the easiest way to make the problem go away immediately and permanently--but since it was on an 8.4 server we had to download the source and build anyway. Also, my experience is that people are not that paranoid about running external binaries, even though they could potentially do harm to the database. Can always do a backup beforehand. But the idea of loading a piece of code that lives in the server all the time freaks them out. The way the word contrib implies (and sometimes is meant to mean) low quality, while stuff that ships with the main server package does not, has been beaten up here for years already. It's only a few cases where that's not fully justified, and the program can easily become an extension, that I feel are really worth changing here. There are 49 directories in contrib/ ; at best maybe 20% of them will ever fall into that category. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why not install pgstattuple by default?
On 05/09/2011 02:31 PM, Robert Haas wrote: I don't think we should be too obstinate about trying to twist the arm of packagers who (as Tom points out) will do whatever they want in spite of us, but the current state of contrib, with all sorts of things of varying type, complexity, and value mixed together cannot possibly be a good thing. I think the idea I'm running with for now means that packagers won't actually have to do anything. I'd expect typical packaging for 9.1 to include share/postgresql/extension from the build results without being more specific. You need to grab 3 files from there to get the plpgsql extension, and I can't imagine any packager listing them all by name. So if I dump the converted contrib extensions to put files under there, and remove them from the contrib build area destination, I suspect they will magically jump from the contrib to the extensions area of the server package at next package build; no packager level changes required. The more I look at this, the less obtrusive of a change it seems to be. Only people who will really notice are users who discover more in the basic server package, and of course committers with backporting to do. Since packaged builds of 9.1 current with beta1 seem to be in short supply still, this theory looks hard to prove just yet. I'm very excited that it's packaging week here however (rolls eyes), so I'll check it myself. I'll incorporate the suggestions made since I posted that test patch and do a bigger round of this next, end to end with an RPM set as the output. It sounds like everyone who has a strong opinion on what this change might look like has sketched a similar looking bikeshed. Once a reasonable implementation is hammered out, I'd rather jump to the big argument between not liking change vs. the advocacy benefits to PostgreSQL of doing this; they are considerable in my mind. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
Josh Berkus wrote: As I don't think we can change this, I think the best answer is to tell people "Don't submit a big patch to PostgreSQL until you've done a few small patches first. You'll regret it". When I last did a talk about getting started writing patches, I had a few people ask me afterwards if I'd ever run into problems with having patch submissions rejected. I said I hadn't. When asked what my secret was, I told them my first serious submission modified exactly one line of code[1]. And *that* I had to defend in regards to its performance impact.[2] Anyway, I think the intro message should be "Don't submit a big patch to PostgreSQL until you've done a small patch and some patch review" instead though. It's both a good way to learn what not to do, and it helps with one of the patch acceptance bottlenecks. The problem is not the process itself, but that there is little documentation of that process, and that much of that documentation does not match the defacto process. Obviously, the onus is on me as much as anyone else to fix this. I know the documentation around all this has improved a lot since then. Unfortunately there's plenty of submissions done by people who never read it. Sometimes it's because people didn't know about it; in others I suspect it was seen but some hard parts ignored because it seemed like too much work. One of these days I'm going to write the "Formatting Curmudgeon Guide to Patch Submission", to give people an idea just how much diff reading and revision a patch should go through in order to keep common issues like spurious whitespace diffs out of it. Submitters can either spent a block of time sweating those details out themselves, or force the job onto a reviewer/committer; they're always there. And every minute it's sitting in someone else's hands who is doing that job instead of reading the real code, the odds of the patch being kicked back go up. [1] http://archives.postgresql.org/pgsql-patches/2007-03/msg00553.php [2] http://archives.postgresql.org/pgsql-patches/2007-02/msg00222.php -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers
Heikki Linnakangas wrote: Well, my first patch was two-phase commit. And I had never even used PostgreSQL before I dived into the source tree and started to work on that Well, everyone knows you're awesome. A small percentage of the people who write patches end up having the combination of background skills, mindset, and approach to pull something like that off. But there are at least a dozens submissions that start review with "I don't think there will ever work, but I can't even read your malformed patch to be sure" for every one that went like 2PC. If every submitter was a budding Heikki we wouldn't need patch submission guidelines at all. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Infinity bsearch crash on Windows
A 9.1Beta1 test report from Richard Broersma (and confirmed on another system by Mark Watson) showed up pgsql-testers this week at http://archives.postgresql.org/pgsql-testers/2011-05/msg0.php with the following test crashing his Windows server every time: SELECT 'INFINITY'::TIMESTAMP; That works fine for me on Linux. Richard chased the error in the logs, which was a generic "you can't touch that memory" one, down to a full stack trace: http://archives.postgresql.org/pgsql-testers/2011-05/msg9.php It looks like it's losing its mind inside of src/backend/utils/adt/datetime.c , specifically at this line in datebsearch: 3576 while (last >= base) 3577 { 3578 position = base + ((last - base) >> 1); 3579 result = key[0] - position->token[0]; Why crash there only on Windows? Was the problem actually introduced above this part of the code? These are all questions I have no answer for. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Infinity bsearch crash on Windows
Tom Lane wrote: SELECT 'INFINITY'::TIMESTAMP; Hmm ... I bet this is related to the recent reports about ALTER USER VALID UNTIL 'infinity' crashing on Windows. Can the people seeing this get through the regression tests? Perhaps more to the point, what is their setting of TimeZone? What does the pg_timezone_abbrevs view show for them? I must have missed that thread, I think I'm missing one of these lists (pgsql-bugs maybe?). I've cc'd Mark Watson so maybe you can get better responses without me in the middle if needed; for this one, he reports: Show timezone gives US/Eastern Select * from pg_timezone_abbrevs returns zero rows My Linux system that doesn't have this problem is also in US/Eastern, too, but I get 189 rows in pg_timezone_abrevs. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] performance-test farm
Tomas Vondra wrote: Actually I was not aware of how the buildfarm works, all I knew was there's something like that because some of the hackers mention a failed build on the mailing list occasionally. So I guess this is a good opportunity to investigate it a bit ;-) Anyway I'm not sure this would give us the kind of environment we need to do benchmarks ... but it's worth to think of. The idea is that buildfarm systems that are known to have a) reasonable hardware and b) no other concurrent work going on could also do performance tests. The main benefit of this approach is it avoids duplicating all of the system management and source code building work needed for any sort of thing like this; just leverage the buildfarm parts when they solve similar enough problems. Someone has actually done all that already; source code was last sync'd to the build farm master at the end of March: https://github.com/greg2ndQuadrant/client-code By far the #1 thing needed to move this forward from where it's stuck at now is someone willing to dig into the web application side of this. We're collecting useful data. It needs to now be uploaded to the server, saved, and then reports of what happened generated. Eventually graphs of performance results over time will be straighforward to generate. But the whole idea requires someone else (not Andrew, who has enough to do) sits down and figures out how to extend the web UI with these new elements. I guess we could run a script that collects all those important parameters and then detect changes. Anyway we still need some 'really stable' machines that are not changed at all, to get a long-term baseline. I have several such scripts I use, and know where two very serious ones developed by others are at too. This part is not a problem. If the changes are big enough to matter, they will show up as a difference on the many possible "how is the server configured?" reports, we just need to pick the most reasonable one. It's a small details I'm not worried about yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] performance-test farm
Tom Lane wrote: There's no such thing as a useful performance test that runs quickly enough to be sane to incorporate in our standard regression tests. To throw a hard number out: I can get a moderately useful performance test on a SELECT-only workload from pgbench in about one minute. That's the bare minimum, stepping up to 5 minutes is really necessary before I'd want to draw any real conclusions. More importantly, a large portion of the time I'd expect regression test runs to be happening with debug/assert on. We've well established this trashes pgbench performance. One of the uglier bits of code added to add the "performance farm" feature to the buildfarm code was hacking in a whole different set of build options for it. Anyway, what I was envisioning here was that performance farm systems would also execute the standard buildfarm tests, but not the other way around. We don't want performance numbers from some platform that is failing the basic tests. I would just expect that systems running the performance tests would cycle through regression testing much less often, as they might miss a commit because they were running a longer test then. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cache estimates, cache access cost
Cédric Villemain wrote: http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache This rebases easily to make Cedric's changes move to the end; I just pushed a version with that change to https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone wants a cleaner one to browse. I've attached a patch too if that's more your thing. I'd recommend not getting too stuck on the particular hook Cédric has added here to compute the cache estimate, which uses mmap and mincore to figure it out. It's possible to compute similar numbers, albeit less accurate, using an approach similar to how pg_buffercache inspects things. And I even once wrote a background writer extension that collected this sort of data as it was running the LRU scan anyway. Discussions of this idea seem to focus on how the "what's in the cache?" data is collected, which as far as I'm concerned is the least important part. There are multiple options, some work better than others, and there's no reason that can't be swapped out later. The more important question is how to store the data collected and then use it for optimizing queries. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us diff --git a/contrib/Makefile b/contrib/Makefile index 6967767..47652d5 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -27,6 +27,7 @@ SUBDIRS = \ lo \ ltree \ oid2name \ + oscache \ pageinspect \ passwordcheck \ pg_archivecleanup \ diff --git a/contrib/oscache/Makefile b/contrib/oscache/Makefile new file mode 100644 index 000..8d8dcc5 --- /dev/null +++ b/contrib/oscache/Makefile @@ -0,0 +1,15 @@ +# contrib/oscache/Makefile + +MODULE_big = oscache +OBJS = oscache.o + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/oscache +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/oscache/oscache.c b/contrib/oscache/oscache.c new file mode 100644 index 000..1ad7dc2 --- /dev/null +++ b/contrib/oscache/oscache.c @@ -0,0 +1,151 @@ +/*- + * + * oscache.c + * + * + * Copyright (c) 2011, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/oscache/oscache.c + * + *- + */ +/* { POSIX stuff */ +#include /* exit, calloc, free */ +#include /* stat, fstat */ +#include /* size_t, mincore */ +#include /* sysconf, close */ +#include /* mmap, mincore */ +/* } */ + +/* { PostgreSQL stuff */ +#include "postgres.h" /* general Postgres declarations */ +#include "utils/rel.h" /* Relation */ +#include "storage/bufmgr.h" +#include "catalog/catalog.h" /* relpath */ +/* } */ + +PG_MODULE_MAGIC; + +void _PG_init(void); + +float4 oscache(Relation, ForkNumber); + +/* + * Module load callback + */ +void +_PG_init(void) +{ + /* Install hook. */ + OSCache_hook = &oscache; +} + +/* + * oscache process the os cache inspection for the relation. + * It returns the percentage of blocks in OS cache. + */ +float4 +oscache(Relation relation, ForkNumber forkNum) +{ + int segment = 0; + char *relationpath; + char filename[MAXPGPATH]; + int fd; + int64 total_block_disk = 0; + int64 total_block_mem = 0; + + /* OS things */ + int64 pageSize = sysconf(_SC_PAGESIZE); /* Page size */ + register int64 pageIndex; + + relationpath = relpathperm(relation->rd_node, forkNum); + + /* + * For each segment of the relation + */ + snprintf(filename, MAXPGPATH, "%s", relationpath); + while ((fd = open(filename, O_RDONLY)) != -1) + { + // for stat file + struct stat st; + // for mmap file + void *pa = (char *)0; + // for calloc file + unsigned char *vec = (unsigned char *)0; + int64 block_disk = 0; + int64 block_mem = 0; + + if (fstat(fd, &st) == -1) + { + close(fd); + elog(ERROR, "Can not stat object file : %s", +filename); + return 0; + } + + /* + * if file ok + * then process + */ + if (st.st_size != 0) + { + /* number of block in the current file */ + block_disk = st.st_size/pageSize; + + /* TODO We need to split mmap size to be sure (?) to be able to mmap */ + pa = mmap(NULL, st.st_size, PROT_NONE, MAP_SHARED, fd, 0); + if (pa == MAP_FAILED) + { +close(fd); +elog(ERROR, "Can not mmap object file : %s, errno = %i,%s\nThis error can happen if there is not enought space in memory to do the projection. Please mail ced...@2ndquadrant.fr with '[oscache] ENOMEM' as subject.", + filename, errno, strerror(errno)); +return 0; + } + + /* Prepare our vector containing all blocks informatio
Re: [HACKERS] Why not install pgstattuple by default?
Greg Smith wrote: Attached is a second patch to move a number of extensions from contrib/ to src/test/. Extensions there are built by the default built target, making installation of the postgresql-XX-contrib package unnecessary for them to be available. That was supposed to be contrib/ to src/extension , no idea where that test bit came from. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why not install pgstattuple by default?
Greg Smith wrote: Any packager who grabs the shared/postgresql/extension directory in 9.1, which I expect to be all of them, shouldn't need any changes to pick up this adjustment. For example, pgstattuple installs these files: share/postgresql/extension/pgstattuple--1.0.sql share/postgresql/extension/pgstattuple--unpackaged--1.0.sql share/postgresql/extension/pgstattuple.control And these are the same locations they were already at. ...and the bit I missed here is that there's a fourth file here: lib/postgresql/pgstattuple.so If you look at a 9.1 spec file, such as http://svn.pgrpms.org/browser/rpm/redhat/9.1/postgresql/EL-6/postgresql-9.1.spec , you'll find: %files contrib ... %{pgbaseinstdir}/lib/pgstattuple.so Which *does* require a packager change to relocate from the postgresql-91-package to the main server one. So the theory that a change here might happen without pushing a repackaging suggestion toward packagers is busted. This does highlight that some packaging guidelines would be needed here to completely this work. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Adding an example for replication configuration to pg_hba.conf
Two things that could be changed from this example to make it more useful: -Document at least a little bit more how this is different from the "all/all" rule. I can imagine users wondering "do I use this instead of the other one? In addition? Is it redundant if I have 'all' in there? A little more description here aiming at the expected FAQs here would make this much more useful. -The default database is based on your user name, which is postgres in most packaged builds but not if you compile your own. I don't know whether it's practical to consider substituting that into this file, or if it's just enough to mention that as an additional doc comment. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.2 schedule
At the developer meeting last week: http://wiki.postgresql.org/wiki/PgCon_2011_Developer_Meeting there was an initial schedule for 9.2 hammered out and dutifully transcribed at http://wiki.postgresql.org/wiki/PostgreSQL_9.2_Development_Plan , and the one part I wasn't sure I had written down correctly I see Robert already fixed. There was a suggestion to add some publicity around the schedule for this release. There's useful PR value to making it more obvious to people that the main development plan is regular and time-based, even if the release date itself isn't fixed. The right time to make an initial announcement like that is "soon", particularly if a goal here is to get more submitted into the first 9.2 CF coming in only a few weeks. Anyone have changes to suggest before this starts working its way toward an announcement? The main parts of the discussion leading to changes from the 9.1 schedule, as I recall them, are: -Shooting for a slightly earlier branch/initial 9.2 CommitFest in June helps some with patch developer bit-rot, and may let developers who are focused on new features be productive for more of the year. The perception that new development is unwelcome between the final CF and version release seems to have overshot a bit from its intention. It's not the best period to chat on this list, with many people distracted by release goals. But some people just aren't in the right position to work on alpha/beta testing and stability work then, and the intention was not to block them from doing something else if that's the case. (A similar bit brought up during one of the patch prep talks is that review is also welcome outside of a CF, which isn't really clear) -The last CF of the release is tough to reschedule usefully due to concerns about December/beginning of the year holidays. -Given that work in August is particularly difficult to line up with common summer schedules around the world, having the other >1 month gap in the schedule go there makes sense. As for why there aren't more changes, it's hard to argue that the 9.1 process was broken such that it needs heavy modification. There were a large number of new features committed, people seem satisfied with the quality of the result so far, and the schedule didn't go too far off the rails. Outside of the manpower issues (which are serious), the sections that strained the most against problems seem really hard to identify with anything other than hindsight. The tension between "ship it" and "make the release better" is a really fundamental one to software development. The two main ideas that pop up regularly, organizing more CommitFests or making them shorter, are both hard to adopt without more active volunteers working on review (both at the initial and committer level) and an increase in available CF manager time. An idea I heard a couple of people suggest is that it would take a CF manager focused exclusively on the "patch chasing" parts of the role--not someone who is also trying to develop, commit, or review during the CF--before this would be feasible to consider. Some sort of relief for making that role less demanding is needed here, before it's practical to schedule those even more often. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.2 schedule
David Fetter wrote: I thought we'd agreed on the timing for the first CF, and that I was to announce it in the PostgreSQL Weekly News, so I did just that. Yes, and excellent. The other ideas were: -Publish information about the full schedule to some of the more popular mailing lists -Link to this page more obviously from postgresql.org (fixed redirect URL is probably the right approach) to "bless" it, and potentially improve its search rank too. The specific new problem being highlighted to work on here is that the schedule and development process is actually quite good as open-source projects go, but that fact isn't visible at all unless you're already on the inside of the project. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.2 schedule
On 05/24/2011 05:03 PM, Peter Eisentraut wrote: On mån, 2011-05-23 at 22:44 -0400, Greg Smith wrote: -Given that work in August is particularly difficult to line up with common summer schedules around the world, having the other>1 month gap in the schedule go there makes sense. You might want to add a comment on the schedule page about the June/July/August timing, because it looks like a typo, and the meeting minutes are also inconsistent how they talk about June and July. Yes, I was planning to (and just did) circle back to the minutes to make everything match up. It's now self-consistent, same dates as the schedule, and explains the rationale better. I'm not sure how to address the feeling of typo you have on the schedule page beyond that. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.2 schedule
On 05/24/2011 01:35 PM, Josh Berkus wrote: I would suggest instead adding a new page to postgresql.org/developer which lists the development schedule, rather than linking to that wiki page. Maybe on this page? http://www.postgresql.org/developer/roadmap Now that I look at the roadmap page again, I think all that would really be needed here is to tweak its wording a bit. If the description on there of the link to the wiki looked like this: General development information A wiki page about various aspects of the PostgreSQL development process, including detailed schedules and submission guidelines I think that's enough info to keep there. Putting more information back onto the main site when it can live happily on the wiki seems counterproductive to me; if there's concerns about things like vandalism, we can always lock the page. I could understand the argument that "it looks more professional to have it on the main site", but perception over function only goes so far for me. The idea of adding a link back to the wiki from the https://commitfest.postgresql.org/ page would complete being able to navigate among the three major sites here, no matter which people started at. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner
Michael Nolan wrote: Based on last year's discussion of this TODO item, it seems thoughts have been focused on estimating how much data is being satisfied from PG's shared buffers. However, I think that's only part of the problem. Sure, but neither it nor what you're talking about are the real gating factor on making an improvement here. Figuring out how to expose all this information to the optimizer so it can use it when planning is the hard part. Collecting a read time profile is just one of the many ways you can estimate what's in cache, and each of the possible methods has good and bad trade-offs. I've been suggesting that people assume that's a solved problem--I'm pretty sure what you're proposing was done by Greg Stark once and a prototype built even--and instead ask what you're going to do next if you had this data. This data would probably need to be kept separately for each table or index, as some tables or indexes may be mostly or fully in cache or on faster physical media than others, although in the absence of other data about a specific table or index, data about other relations in the same tablespace might be of some use. This is the important part. Model how the data needs to get stored such that the optimizer can make decisions using it, and I consider it easy to figure out how it will get populated later. There are actually multiple ways to do it, and it may end up being something people plug-in an implementation that fits their workload into, rather than just having one available. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On 05/24/2011 04:34 PM, Robert Haas wrote: we could name this feature "partial full page writes" and enable it either with a setting of full_page_writes=partial +1 to overloading the initial name, but only if the parameter is named 'maybe', 'sometimes', or 'perhaps'. I've been looking into a similar refactoring of the names here, where we bundle all of these speed over safety things (fsync, full_page_writes, etc.) into one control so they're easier to turn off at once. Not sure if it should be named "web_scale" or "do_you_feel_lucky_punk". 3. Going a bit further, Greg proposed the idea of ripping out our current WAL infrastructure altogether and instead just having one WAL record that says "these byte ranges on this page changed to have these new contents". The main thing that makes this idea particularly interesting to me, over the other two, is that it might translate well into the idea of using sync_file_range to aim for a finer fsync call on Linux than is currently possible. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for new feature: Buffer Cache Hibernation
On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote: For 1, I've just finish my work. The latest patch is available at: http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch Reminder here--we can't accept code based on it being published to a web page. You'll need to e-mail it to the pgsql-hackers mailing list to be considered for the next PostgreSQL CommitFest, which is starting in a few weeks. Code submitted to the mailing list is considered a release of it to the project under the PostgreSQL license, which we can't just assume for things when given only a URL to them. Also, you suggested you were out of time to work on this. If that's the case, we'd like to know that so we don't keep cc'ing you about things in expectation of an answer. Someone else may pick this up as a project to continue working on. But it's going to need a fair amount of revision before it matches what people want here, and I'm not sure how much of what you've written is going to end up in any commit that may happen from this idea. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench--new transaction type
On 05/29/2011 03:11 PM, Jeff Janes wrote: If you use "pgbench -S -M prepared" at a scale where all data fits in memory, most of what you are benchmarking is network/IPC chatter, and table locking. If you profile it, you'll find a large amount of the time is actually spent doing more mundane things, like memory allocation. The network and locking issues are really not the bottleneck at all in a surprising number of these cases. Your patch isn't really dependent on your being right about the cause here, which means this doesn't impact your submissions any. Just wanted to clarify that what people expect are slowing things down in this situation and what actually shows up when you profile are usually quite different. I'm not sure whether this feature makes sense to add to pgbench, but it's interesting to have it around for developer testing. The way you've built this isn't messing with the code too much to accomplish that, and your comments about it being hard to do this using "-f" are all correct. Using a custom test file aims to shoot your foot unless you apply a strong grip toward doing otherwise. some numbers for single client runs on 64-bit AMD Opteron Linux: 12,567 sps under -S 19,646 sps under -S -M prepared 58,165 sps under -P 10,000 is too big of a burst to run at once. The specific thing I'm concerned about is what happens if you try this mode when using "-T" to enforce a runtime limit, and your SELECT rate isn't high. If you're only doing 100 SELECTs/second because your scale is big enough to be seek bound, you could overrun by nearly two minutes. I think this is just a matter of turning the optimization around a bit. Rather than starting with a large batch size and presuming that's ideal, in this case a different approach is really needed. You want the smallest batch size that gives you a large win here. My guess is that most of the gain here comes from increasing batch size to something in the 10 to 100 range; jumping to 10K is probably overkill. Could you try some smaller numbers and see where the big increases start falling off at? Obligatory code formatting nitpick: try not to overrun the right margin any further than the existing code around line 1779, where you add more ttype comments. That needs to turn into a multi-line comment. Rest of the patch looks fine, and don't worry about resubmitting for that; just something to tweak on your next update. A slightly more descriptive filename for the patch would help out those of us who look at a lot of pgbench patches, too. Something like "pgbench_loop_v1.patch" for example would make it easier for me to remember which patch this was by its name. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Getting a bug tracker for the Postgres project
On 05/29/2011 05:17 AM, Peter Eisentraut wrote: Here is a list to choose from: http://en.wikipedia.org/wiki/Comparison_of_issue_tracking_systems I turned this into a spreadsheet to sort and prune more easily; if anyone wants that let me know, it's not terribly useful beyond what I'm posting here. 44 total, 16 that are open-source. I would say that having an e-mail interface is the next major cut to make. While distasteful, it's possible for this project to adopt a solution that doesn't use PostgreSQL, and one interesting candidate is in that category. It's not feasible to adopt one that doesn't integrate well with e-mail though. List of software without listed e-mail integration: Fossil, GNATS, Liberum Help Desk, MantisBT, org-mode, Flyspray, ikiwiki, Trac. The 8 F/OSS programs left after that filter are: OTRS Debbugs Request Tracker Zentrack LibreSource Redmine Roundup Bugzilla The next two filters you might apply are: Support for Git: Redmine, Bugzilla PostgreSQL back-end: OTRS, Request Tracker, LibreSource, Redmine, Roundup, Bugzilla There are a couple of additional nice to have items I saw on the feature list, and they all seem to spit out just Redmine & Bugzilla. Those are the two I've ended up using the most on PostgreSQL related projects, too, so that isn't a surprise to me. While I'm not a strong fan of Redmine, it has repeatedly been the lesser of the evils available here for three different companies I've worked at or dealt with. Greg Stark is right that Debbugs has a lot of interesting features similar to the desired workflow here. It's not tied to just Debian anymore; the GNU project is also using it now. And the database backend isn't that terrible to consider: it's flat files with a BerkleyDB index built on top. I think if it was perfect except for that, it would still be worth considering. Debbugs is far from a general purpose solution though, so any customization to support differences in this project's workflow would likely end up being one-off hacks. The VCS support might be a problem, but I've gotten the impression that git is increasingly popular for other Debian work. Since the program is in Perl, I can't imagine it's a gigantic task to switch that out, and probably one other people would like to see. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How can I check the treatment of bug fixes?
On 05/27/2011 08:36 AM, MauMau wrote: > I posted a patch for bug #6011 to pgsql-hackers several days ago. How > can I check the status of bug fixes? I'm worried that the patch might > be forgotten, because bug #5842 was missed for two months until Bruce > noticed it. Discussion here seems to have wandered far away from useful suggestions for you, let's see if that's possible to return to that. Best way to confirm when a bug is resolved is to subscribe to the pgsql-committers mailing list. If a commit for this fix appears, odds are good the original bug number will be referenced. Even if it isn't, you may recognize it via its description. Until you see that, the bug is almost certainly still open. Bugs that are considered to impact the current version under development are sometimes listed at http://wiki.postgresql.org/wiki/Open_Items Adding a bug to there that's not really specific to the new version may not be considered good form by some. It is the closest thing to an open bug tracker around though, and adding items to there means they won't be forgotten about; it's checked regularly by developers considering when it's a good time to release another alpha or beta. In my mind, clarifying what circumstances it's appropriate for people to put a bug onto the Open Items list would be a useful way to spend a little time. Certainly more productive than trying firing more bullets at the unkillable zombie that is bug tracking software. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for new feature: Buffer Cache Hibernation
On 06/01/2011 03:03 AM, Tatsuo Ishii wrote: Also I really want to see the performance comparison between these two approaches in the real world database. Well, tell me how big of a performance improvement you want PgFincore to win by, and I'll construct a benchmark where it does that. If you pick a database size that fits in the OS cache, but is bigger than shared_buffers, the difference between the approaches is huge. The opposite--trying to find a case where this hibernation approach wins--is extremely hard to do. Anyway, further discussion of this patch is kind of a waste right now. We've never gotten the patch actually sent to the list to establish a proper contribution (just pointers to a web page), and no feedback on that or other suggestions for redesign (extension repackaging, GUC renaming, removing unused code, and a few more). Unless the author shows up again in the next two weeks, this is getting bounced back with no review as code we can't use. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers