[HACKERS] Foreign Visual Studio builds

2010-10-14 Thread Greg Smith
We got an interesting documentation document left by Christian Freund 
recently, in regards to 
http://www.postgresql.org/docs/9.0/interactive/install-windows-full.html 
; it says:


Regarding 16.1.3 - "perl mkvcbuild.pl"

In case you use a German version of VC change line 69 in "Solution.pm" to:

   if ($line !~ /^Microsoft\s*\(R\) Visual C\+\+-Projekt-Generator - 
\D+(\d+)\.00\.\d+/)



I'm going to approve the doc comment so that it might help someone else 
that runs into this in the short-term, but I thought it was was worth 
sharing as something that might be improved in the build code instead.  
That regex seems a bit too specific.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Debugging initdb breakage

2010-10-17 Thread Greg Smith

David Fetter wrote:

Where should we preserve this, other than the mailing list archives?
  


http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD 
has most of the other trivia in this area, so I just added Alvaro's 
technique to the bottom of it with a quick intro to add some context.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simplifying replication

2010-10-19 Thread Greg Smith

Josh Berkus wrote:
It is critical that we make replication easier to set up, administrate 
and monitor than it currently is.  In my conversations with people, 
this is more important to our users and the adoption of PostgreSQL 
than synchronous replication is.


You should enjoy one of the patches we're furiously working on then, 
which is aiming at some of the administration and monitoring pieces 
here.  I have my own grand vision of how easy replication should be to 
setup too.  Visions and plans are nice, but building functional pieces 
of them and delivering them to the community is what actually moves 
PostgreSQL forward.  So far, multiple people have done that for sync 
rep, and what we're supposed to be focused on at this stage in the 
development cycle is finishing the work related to the open CommitFest 
item that includes that.


I find this launch into a new round of bike-shedding a bit distracting.  
If you want this to be easier to use, which it's obvious to any observer 
it should be because what's delivered in 9.0 is way too complicated, 
please work on finding development resources to assign to that problem.  
Because that's the bottleneck on simplifying things, not ideas about 
what to do.  I would recommend finding or assigning a developer to work 
on integrating base backup in to the streaming protocol as the biggest 
single thing that would improve the built-in replication.  All of the 
rest of the trivia about what knobs to set and such are tiny details 
that make for only a minor improvement until that's taken care of.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simplifying replication

2010-10-19 Thread Greg Smith

Josh Berkus wrote:
*shrug*.  Robert asked me to write it up for the list based on the 
discussions around synch rep.  Now you're going to bash me for doing so?


Sorry, next time I'll make sure to bash Robert too.  I don't have any 
problems with the basic ideas you're proposing, just concerns about when 
the right time to get into that whole giant subject is and who is going 
to work on.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] max_wal_senders must die

2010-10-19 Thread Greg Smith

Josh Berkus wrote:
Under what bizarre set of circumstances would anyone have runaway 
connections from replicas to the master?


Cloud computing deployments where additional replicas are brought up 
automatically in response to demand.  It's easy to imagine a situation 
where a standby instance is spawned, starts to sync, and that additional 
load triggers *another* standby to come on board; repeat until the 
master is doing nothing but servicing standby sync requests.  
max_wal_senders provides a safety value for that.


I think Magnus's idea to bump the default to 5 triages the worst of the 
annoyance here, without dropping the feature (which has uses) or waiting 
for new development to complete.  I'd be in favor of just committing 
that change right now, before it gets forgotten about, and then if 
nobody else gets around to further work at least something improved here 
for 9.1.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] max_wal_senders must die

2010-10-19 Thread Greg Smith

Josh Berkus wrote:

Well, now that you mention it, I also think that "hot standby" should be
the default.  Yes, I know about the overhead, but I also think that the
number of our users who want easy replication *far* outnumber the users
who care about an extra 10% WAL overhead.
  


I think this whole situation is similar to the resistance to increasing 
default_statistics_target.  There's additional overhead added by 
enabling both of these settings, in return for making it more likely for 
the average person to see useful behavior by default.  If things are 
rejiggered so the advanced user has to turn things off in order to get 
optimal performance when not using these features, in return for the 
newbie being more likely to get things working in basic form, that's 
probably a net win for PostgreSQL advocacy.  But much like 
default_statistics_target, there needs to be some more formal work done 
on quantifying just how bad each of these overheads really are first.  
We lost 3-7% on multiple simple benchmarks between 8.3 and 8.4[1] 
because of that retuning for ease of use on real-world workloads, and 
that's not something you want to repeat too often.


[1] http://suckit.blog.hu/2009/09/29/postgresql_history and the tests 
Jignesh did while at Sun


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] max_wal_senders must die

2010-10-20 Thread Greg Smith

Josh Berkus wrote:

If we could agree on some workloads, I could run some benchmarks.  I'm
not sure what those would be though, given that COPY and ALTER TABLE
aren't generally included in most benchmarks.


You can usefully and easily benchmark this by timing a simple pgbench 
initialization at a decently large scale.  The COPY used to populate the 
giant accounts table takes advantage of the WAL bypass fast path if 
available, and you can watch performance tank the minute one of the 
options that disables it is turned on.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ask for review of MERGE

2010-10-21 Thread Greg Smith

Robert Haas wrote:

I think the right way to write UPSERT is something
along the lines of:

MERGE INTO Stock t USING (VALUES (10, 1)) s(item_id, balance) ON
s.item_id = t.item_id ...
  


That led in the right direction, after a bit more fiddling I was finally 
able to get something that does what I wanted:  a single table UPSERT 
implemented with this MERGE implementation.  Here's a log of a test 
session, suitable for eventual inclusion in the regression tests:


CREATE TABLE Stock(item_id int UNIQUE, balance int);
INSERT INTO Stock VALUES (10, 2200);
INSERT INTO Stock VALUES (20, 1900);
SELECT * FROM Stock ORDER BY item_id;

item_id | balance
-+-
 10 |2200
 20 |1900

MERGE INTO Stock t
USING (VALUES(10,100)) AS s(item_id,balance)
ON s.item_id=t.item_id
WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance
WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance)
;

MERGE 1

SELECT * FROM Stock ORDER BY item_id;
item_id | balance
-+-
 10 |2300
 20 |1900

MERGE INTO Stock t
USING (VALUES(30,2000)) AS s(item_id,balance)
ON s.item_id=t.item_id
WHEN MATCHED THEN UPDATE SET balance=t.balance + s.balance
WHEN NOT MATCHED THEN INSERT VALUES(s.item_id,s.balance)
;

MERGE 1
SELECT * FROM Stock ORDER BY item_id;
item_id | balance
-+-
 10 |2300
 20 |1900
 30 |2000

I'm still a little uncertain as to whether any of my other examples 
should have worked under the spec but just didn't work here, but I'll 
worry about that later.


Here's what the query plan looks like on a MATCH:

Merge  (cost=0.00..8.29 rows=1 width=22) (actual time=0.166..0.166 
rows=0 loops=1)

  Action 1: Update When Matched
  Action 2: Insert When Not Mactched
  MainPlan:
  ->  Nested Loop Left Join  (cost=0.00..8.29 rows=1 width=22) (actual 
time=0.050..0.061 rows=1 loops=1)
->  Values Scan on "*VALUES*"  (cost=0.00..0.01 rows=1 width=8) 
(actual time=0.009..0.010 rows=1 loops=1)
->  Index Scan using stock_item_id_key on stock t  
(cost=0.00..8.27 rows=1 width=14) (actual time=0.026..0.030 rows=1 loops=1)

  Index Cond: ("*VALUES*".column1 = item_id)
Total runtime: 0.370 ms


And here's a miss:

Merge  (cost=0.00..8.29 rows=1 width=22) (actual time=0.145..0.145 
rows=0 loops=1)

  Action 1: Update When Matched
  Action 2: Insert When Not Mactched
  MainPlan:
  ->  Nested Loop Left Join  (cost=0.00..8.29 rows=1 width=22) (actual 
time=0.028..0.033 rows=1 loops=1)
->  Values Scan on "*VALUES*"  (cost=0.00..0.01 rows=1 width=8) 
(actual time=0.004..0.005 rows=1 loops=1)
->  Index Scan using stock_item_id_key on stock t  
(cost=0.00..8.27 rows=1 width=14) (actual time=0.015..0.015 rows=0 loops=1)

  Index Cond: ("*VALUES*".column1 = item_id)
Total runtime: 0.255 ms

Next steps here:
1) Performance/concurrency tests against trigger-based UPSERT approach.
2) Finish bit rot cleanup against HEAD.
3) Work out more complicated test cases to try and fine more unexpected 
behavior edge cases and general bugs.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ask for review of MERGE

2010-10-22 Thread Greg Smith
There are now two branches of MERGE code in review progress available.  
http://github.com/greg2ndQuadrant/postgres/tree/merge-unstable has the 
bit-rotted version that doesn't quite work against HEAD yet, while 
http://github.com/greg2ndQuadrant/postgres/tree/merge is what I'm still 
testing against until I get that sorted out.


Attached is a tar file containing an initial concurrent MERGE test 
case.  What it does is create is a pgbench database with a scale of 1.  
Then it runs a custom test that does an UPSERT using MERGE against 
pgbench_accounts, telling pgbench the database scale is actually 2.  
This means that approximately half of the statements executed will hit 
the MATCHED side of the merge and update existing rows, while half will 
hit the NOT MATCHED one and create new account records.  Did some small 
tests using pgbench's debug mode where you see all the statements it 
executes, and all the output looked correct.  Successive tests runs are 
not deterministically perfect performance comparisons, but with enough 
transactions I hope that averages out.


For comparison sake, there's an almost identical test case that does the 
same thing using the pl/pgsql example function from the PostgreSQL 
manual for the UPSERT instead also in there.  You just run test-merge.sh 
or test-function.sh and it runs the whole test, presuming you built and 
installed pgbench and don't mind that it will blow away any database 
named pgbench you already have.


Since the current MERGE implementation is known not to be optimized for 
concurrency, my main goal here wasn't to see how fast it was.  That 
number is interesting though.  With the sole postgresql.conf change of:


checkpoint_settings=32

And a debug/assertion build using 8 concurrent clients, I got 1607 TPS 
of UPSERT out of the trigger approach @ 6MB/s of writes to disk, while 
the MERGE one delivered 988 TPS @ 4.5MB/s of writes.  Will explore this 
more later.


This did seem to find a bug in the implementation after running for a while:

TRAP: FailedAssertion("!(epqstate->origslot != ((void *)0))", File: 
"execMain.c", Line: 1732)


Line number there is relative to what you can see at 
http://github.com/greg2ndQuadrant/postgres/blob/merge/src/backend/executor/execMain.c
and that's a null pointer check at the beginning of 
EvalPlanQualFetchRowMarks.  Haven't explored why this happened or how 
repeatable this is, but since Boxuan was looking for some bugs to chase 
I wanted to deliver him one to chew on.


While the performance doesn't need to be great in V1, there needs to be 
at least some minimal protection against concurrency issues before this 
is commit quality.  Will continue to shake this code out looking for 
them now that I have some basic testing that works for doing so.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us




merge-test-v1.tar
Description: Unix tar archive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ask for review of MERGE

2010-10-23 Thread Greg Smith

Marko Tiikkaja wrote:
What's been bothering me is that so far there has not been an 
agreement on whether we need to protect against concurrency issues or 
not.  In fact, there has been almost no discussion about the 
concurrency issues which AIUI have been the biggest single reason we 
don't have MERGE already.


Sure there were, they just happened a long time ago.  I tried to 
summarize the previous discussion at 
http://wiki.postgresql.org/wiki/SQL_MERGE with the following:


"PostgreSQL doesn't have a good way to lock access to a key value that 
doesn't exist yet--what other databases call key range 
locking...Improvements to the index implementation are needed to allow 
this feature."


You can sift through the other archive links in there, the discussion 
leading up to that conclusion is in there somewhere.  And we had a 
discussion of this at the last developer's meeting where this was 
identified as a key issue to sort out before this was done; see the 
table at the end of 
http://wiki.postgresql.org/wiki/PgCon_2010_Developer_Meeting and note 
the warning associated with this feature.


The deadlock on developing this feature has been that there's little 
benefit to building key range locking on its own, or even a good way to 
prove that it works, without a visible feature that uses it.  It's much 
easier to justify development time on if the rest of MERGE is known to 
work, and just needs that plugged in to improve concurrency.  And since 
Boxuan appeared at the right time to do the syntax and implementation 
part first as well, that's the order it's ended up happening in.  We're 
only making the concurrency part a second priority right now in order to 
break the development into managable chunks.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ask for review of MERGE

2010-10-24 Thread Greg Smith

Robert Haas wrote:

I am also wondering exactly what the semantics are supposed to be
under concurrency.  We can't assume that it's OK to treat WHEN NOT
MATCHED as WHEN MATCHED if a unique-key violation is encountered.  The
join condition could be something else entirely.  I guess we could
scan the target table with a dirty snapshot and block on any in-doubt
tuples, similar to what EXCLUDE constraints do internally, but
throwing MVCC out the window doesn't seem right either.
  


You've answered Tom's question of why not to just INSERT and trap the 
error here.  Presuming a unique key violation is what will kick back in 
this situation and to just follow the other side doesn't cover all the 
ways this can get used.


I'm thinking of this problem now as being like the one that happens when 
a concurrent UPDATE happens.  To quote from the docs here:


"a target row might have already been updated (or deleted or locked) by 
another concurrent transaction by the time it is found. In this case, 
the would-be updater will wait for the first updating transaction to 
commit or roll back (if it is still in progress). If the first updater 
rolls back, then its effects are negated and the second updater can 
proceed with updating the originally found row. If the first updater 
commits, the second updater will ignore the row if the first updater 
deleted it, otherwise it will attempt to apply its operation to the 
updated version of the row. The search condition of the command (the 
WHERE clause) is re-evaluated to see if the updated version of the row 
still matches the search condition. If so, the second updater proceeds 
with its operation using the updated version of the row."


What I think we can do with adding a key reservation is apply the same 
sort of logic to INSERTs too, given a way to lock an index entry before 
the row itself is complete.  If MERGE hits a WHERE condition that finds 
a key lock entry in the index, it has to sit and wait for that other 
command to finish.  There isn't any other sensible behavior I can see in 
that situation, just like there isn't one for UPDATE right now.


If the transaction that was locking the index entry commits, it has to 
start over with re-evaluating the WHEN MATCHED part (the equivilent of 
the WHERE in the UPDATE case).  But it shouldn't need to do a new JOIN 
again, because that condition was proven to be met already by the lock 
squatting on that index key, without even having the rest of the row 
there yet to check its data.  If the other entry commits, it would then 
follow the WHEN MATCHED path and in the UPSERT case do an UPDATE.  If 
the transaction who had the key reservervation rolls back, the 
re-evaluation of the MERGE WHEN MATCHED will fail, at which point it 
follows the WHERE FOUND INSERT path.  As it will now be the locker of 
the key reservation it had been waiting on earlier at that point, it 
will be free to do the INSERT with no concerns about race conditions or 
retries.  In the SERIALIZABLE case, again just try to follow the same 
slightly different rules that exist for concurrent UPDATE commands.


There may be a tricky point here that we will need to warn people that 
UPSERT implementations need to make sure they only reference the columns 
of the primary key in the MERGE conditions for this to work as expected, 
because otherwise they might get back a unique key violation error 
anyway.  I don't know how other databases deal with that particular 
problem.  I have considered following the somewhat distasteful path of 
installing SQL Server or Oracle here just to see how they handle some of 
the pathological test cases I'm now thinking about for this command.


This particular weak spot in MVCC is already there for UPDATE, and this 
approach seems to me to be the most logical way to extend the same basic 
implementation to also eliminate some concurrent MERGE race conditions, 
including the most common one the UPSERT case is stuck with.  But I have 
no intention of halting work on the basic MERGE to wait for this problem 
to be solved even if I'm completely wrong about what I've outlined 
above.  That style of thinking--don't even start until every a perfect 
solution for every possible problem is known--is something I consider a 
problem in this community's development model, one that has contributed 
to why no work has been done on this feature until now.  This one splits 
nicely into "get the implemenation working" and "improve the 
concurrency" parts, and I haven't heard a good reason yet why those need 
to coupled together.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ask for review of MERGE

2010-10-24 Thread Greg Smith

Robert Haas wrote:

Well, there's no guarantee that any index at all exists on the target
table, so any solution based on index locks can't be fully general.
  


Sure, but in the most common use case I think we're targeting at first, 
no indexes means there's also no unique constraints either.  I don't 
think people can reasonable expect to MERGE or anything else to 
guarantee they won't accidentally create two rows that conflict via the 
terms of some non-database enforced key.


I am too brain-dead this particular Sunday to think of exactly how to 
deal with the 3-tuple case you described, except to say that I think 
it's OK for complicated situations to give up and throw a serialization 
error.  I'm already collecting a list of pathological tests and will try 
to add something based on your problem case, then see what we can do 
with it later.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sorted writes for checkpoints

2010-10-29 Thread Greg Smith

Itagaki Takahiro wrote:

When I submitted the patch, I tested it on disk-based RAID-5 machine:
http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php
But there were no additional benchmarking reports at that time. We still
need benchmarking before we re-examine the feature. For example, SSD and
SSD-RAID was not popular at that time, but now they might be considerable.
  


I did multiple rounds of benchmarking that, just none of it showed any 
improvement so I didn't bother reporting them in detail.  I have 
recently figured out why the performance testing I did of that earlier 
patch probably failed to produce useful results on my system when I was 
testing it back then though.  It relates to trivia around how ext3 
handles fsync that's well understood now (the whole cache flushes out 
when one comes in), but wasn't back then yet.


We have a working set of patches here that both rewrite the checkpoint 
logic to avoid several larger problems with how it works now, as well as 
adding instrumentation that makes it possible to directly measure and 
graph whether methods such as sorting writes provide any improvement or 
not to the process.  My hope is to have those all ready for initial 
submission as part of CommitFest 2010-11, as the main feature addition 
from myself toward improving 9.1.


I have a bunch of background information about this I'm presenting at 
PGWest next week, after which I'll start populating the wiki with more 
details and begin packaging the code too.  I had hoped to revisit the 
checkpoint sorting details after that.  Jeff or yourself are welcome to 
try your own tests in that area, I could use the help.  But I think my 
measurement patches will help you with that considerably once I release 
them in another couple of weeks.  Seeing a graph of latency sync times 
for each file is very informative for figuring out whether a change did 
something useful, more so than just staring at total TPS results.  Such 
latency graphs are what I've recently started to do here, with some 
server-side changes that then feed into gnuplot.


The idea of making something like the sorting logic into a pluggable 
hook seems like a waste of time to me, particulary given that the 
earlier implementation really needed to be allocated a dedicated block 
of shared memory to work well IMHO (and I believe that's still the 
case).  That area isn't where the real problems are at here anyway, 
especially on large memory systems.  How the sync logic works is the 
increasingly troublesome part of the checkpoint code, because the 
problem it has to deal with grows proportionately to the size of the 
write cache on the system.  Typical production servers I deal with have 
about 8X as much RAM now as they did in 2007 when I last investigated 
write sorting.  Regular hard drives sure haven't gotten 8X faster since 
then, and battery-backed caches (which used to have enough memory to 
absorb a large portion of a checkpoint burst) have at best doubled in size.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SR fails to send existing WAL file after off-line copy

2010-10-31 Thread Greg Smith
Last week we got this report from Matt Chesler:  
http://archives.postgresql.org/pgsql-admin/2010-10/msg00221.php that he 
was getting errors when trying to do a simple binary replication test.  
The problem is that what appears to be a perfectly good WAL segment 
doesn't get streamed to the standby.  No one responded. 

One of our testers just ran into the same thing.  I just investigated, 
and I'm baffled as to what's going on myself.  Can't tell if this is a 
bug or an under-documented restriction, but this makes two reports of 
the problem now.  (Mine is happening on a standard 9.0.0 RPM set, didn't 
notice any changes in 9.0.1 that would impact this; afraid to upgrade 
while I have a repeatable test case for this sitting here)


The setup is intended to get a simple test replication setup going 
without even having to do the whole pg_start_backup shuffle, by copying 
the whole server directory when it's down.  Basic steps are:


-Follow the first set of instructions at 
http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial to setup a 
master compatible with replication, then duplicate it after stopping it 
using rsync.  Note that you may have to manually create an empty pg_xlog 
directory on the standby, depending on what was there before you started 
replication. 

To rule out one possible source of problems here, I made an additional 
change on the master not listed there:


[mas...@pyramid pg_log]$ psql -d postgres -c "show wal_keep_segments"
wal_keep_segments
---
10

I wondered if having this set to 0 (the default) was causing the issue, 
thinking perhaps it doesn't do any looking for existing segments at all 
in that situation.  Problem still happens for me.


-Create a recovery.conf pointing to the master as described in the tutorial.

-Start the standby.  Make sure that it has reached the point where it's 
requesting WAL segments from the master; you want to see it looping 
doing periodic "FATAL:  could not connect to the primary server: could 
not connect to server: Connection refused" before touching the master.


-Start the master

What I expect to happen now is that the current WAL file that was in 
progress at the point the data was copied over gets streamed over.  That 
doesn't seem to happen.  Instead, I see this on the standby:


LOG:  streaming replication successfully connected to primary
FATAL:  could not receive data from WAL stream: FATAL:  requested WAL 
segment 0001 has already been removed


This on the master:

LOG:  replication connection authorized: user=rep host=127.0.0.1 port=52571
FATAL:  requested WAL segment 0001 has already been 
removed


Which is confusing because that file is certainly on the master still, 
and hasn't even been considered archived yet much less removed:


[mas...@pyramid pg_log]$ ls -l $PGDATA/pg_xlog
-rw--- 1 master master 16777216 Oct 31 16:29 0001
drwx-- 2 master master 4096 Oct  4 12:28 archive_status
[mas...@pyramid pg_log]$ ls -l $PGDATA/pg_xlog/archive_status/
total 0

So why isn't SR handing that data over?  Is there some weird unhandled 
corner case this exposes, but that wasn't encountered by the systems the 
tutorial was tried out on?  I'm not familiar enough with the SR 
internals to reason out what's going wrong myself yet.  Wanted to 
validate that Matt's report wasn't a unique one though, with a bit more 
detail included about the state the system gets into, and one potential 
fix (increasing wal_keep_segments) already tried without improvement.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SR fails to send existing WAL file after off-line copy

2010-11-01 Thread Greg Smith

Heikki Linnakangas wrote:
Yes, indeed there is a corner-case bug when you try to stream the very 
first WAL segment, with log==seg==0.


I confirmed that the bug exists in only this case by taking my problem 
install and doing this:


psql -d postgres -c "checkpoint; select pg_switch_xlog();"

To force it to the next xlog file.  With only that change, everything 
else then works.  So we'll just need to warn people about this issue and 
provide that workaround, as something that only trivial new installs 
without much data loaded into them are likely to run into, until 9.0.2 
ships with your fix in it.  I'll update the docs on the wiki 
accordingly, once I've recovered from this morning's flight out West.


I forgot to credit Robert Noles here for rediscovering this bug on one 
of our systems and bringing it to my attention.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

2010-11-05 Thread Greg Smith

Tom Lane wrote:

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere?  Should we be rethinking the default preference order?
  


And I've seen the expected sync write performance gain over fdatasync on 
a system with a battery-backed cache running VxFS on Linux, because 
working open_[d]sync means O_DIRECT writes bypassing the OS cache, and 
therefore reducing cache pollution from WAL writes.  This doesn't work 
by default on Solaris because they have a special system call you have 
to execute for direct output, but if you trick the OS into doing that 
via mount options you can observe it there too.  The last serious tests 
of this area I saw on that platform were from Jignesh, and they 
certainly didn't show a significant performance regression running in 
sync mode.  I vaguely recall seeing a set once that showed a minor loss 
compared to fdatasync, but it was too close to make any definitive 
statement about reordering.


I haven't seen any report yet of a serious performance regression in the 
new Linux case that was written by someone who understands fully how 
fsync and drive cache flushing are supposed to interact.  It's been 
obvious for a year now that the reports from Phoronix about this had no 
idea what they were actually testing.  I didn't see anything from 
Marti's report that definitively answers whether this is anything other 
than Linux finally doing the right thing to flush drive caches out when 
sync writes happen.  There may be a performance regression here related 
to WAL data going out in smaller chunks than it used to, but in all the 
reports I've seen it that hasn't been isolated well enough to consider 
making any changes yet--to tell if it's a performance loss or a 
reliability gain we're seeing.


I'd like to see some output from the 9.0 test_fsync on one of these 
RHEL6 systems on a system without a battery backed write cache as a 
first step here.  That should start to shed some light on what's 
happening.  I just bumped up the priority on the pending upgrade of my 
spare laptop to the RHEL6 beta I had been trying to find time for, so I 
can investigate this further myself.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

2010-11-07 Thread Greg Smith

Tom Lane wrote:

I'm hoping that Greg Smith will take the lead on testing
this, since he seems to have spent the most time in the area so far.
  


It's not coincidence that the chapter of my book I convinced the 
publisher to release as a sample is the one that covers this area; this 
mess has been visibly approaching for some time now.  I'm going to put 
RHEL6 onto a system and start collecting some proper slowdown numbers 
this week, then pass along a suggested test regime for others.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sorted writes for checkpoints

2010-11-07 Thread Greg Smith

Jeff Janes wrote:


Assuming the ordering is useful, the only way the OS can do as good a
job as the checkpoint code can, is if the OS stores the entire
checkpoint worth of data as dirty blocks and doesn't start writing
until an fsync comes in.  This strikes me as a pathologically
configured OS/FS.  (And would explain problems with fsyncs)
  


This can be exactly the situation with ext3 on Linux, which I believe is 
one reason the write sorting patch didn't go anywhere last time it came 
up--that's certainly what I tested it on.  The slides for my talk 
"Righting Your Writes" are now up at 
http://projects.2ndquadrant.com/talks and an example showing this is on 
page 9.  I'm hoping to get the 3 patches shown in action or described in 
that talk submitted to the list before the next CommitFest.  You really 
need timing of individual sync calls to figure out what's going on here, 
and what happens is completely dependent on filesystem.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] CommitFest 2010-11: Call for Reviewers

2010-11-12 Thread Greg Smith
It's the time of year again where leaves are falling, temperatures 
dropping, and patches cry out looking for one last review before the 
start of the holiday season.  The patches submitted so far are listed at 
https://commitfest.postgresql.org/action/commitfest_view?id=8 , 
featuring a mix of brand new ones with some going through their second 
or third round of rework.  Expect to see a furious weekend of last 
minute patches that arrive just before the deadline too.  The official 
cut-off to start the CommitFest is Mon Nov 15 00:00:00 UTC.


If you're interested in reviewing a patch but haven't done so before, 
the process is outlined at the following page:


http://wiki.postgresql.org/wiki/Reviewing_a_Patch
http://wiki.postgresql.org/wiki/RRReviewers

All you have to do to "claim" a patch is update the CommitFest 
application page to put your name down as the reviewer, then do the 
review in a timely fashion.  Look for the patches with the bright red 
"Nobody" listed in the reviewer field.


If you have questions about which patch to work on, please try to keep 
e-mail traffic related to topics like patch selection and volunteering 
for a new round-robin assignment to the pgsql-rrreviewers mailing list.  
Whereas the pgsql-hackers list is the right destination for the actual 
review itself.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a new problem in MERGE

2010-11-14 Thread Greg Smith

Boxuan Zhai wrote:
I have plan to fix the above two bugs together. (in fact, I have 
already started coding in merge_v202 edition). My question is how 
should I make my update be consistent with yours. Is it possible for 
you to give me an edition that I can work on? 


I just got this reconciled with HEAD again.  There have been two changes 
I made in the code you'll eventually want in your working copy:


1) Fix NIL/NULL confusion:  
https://github.com/greg2ndQuadrant/postgres/commit/9013ba9e81490e3623add1b029760817021297c0


2) Update ExecMerge to accept and pass through an oldtuple value.  This 
is needed to make the code compatible with the PostgreSQL git HEAD after 
the changes made in the 2009-09 CommitFest.  Bit rot updates made: 
https://github.com/greg2ndQuadrant/postgres/commit/be03bd201720f42a666f7e356ec8507c1357f502


I'm not sure if how I did (2) is correct for all cases, but at least the 
code compiles again now and the server will start.  Attached is an 
updated patch that applies to HEAD as of right now, and that code has 
been pushed to 
https://github.com/greg2ndQuadrant/postgres/tree/merge-unstable with the 
changes rebased to be the last two commits.


It fails "make installcheck" on my system.  But as the initial diffs I 
looked at relate to enums and such, I don't think that's a problem with 
your patch.  Will investigate further here with some of my own patches 
I'm working on today.  Hopefully this is enough to unblock what you were 
looking for more details from me about.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD




merge-v203-20101114.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_stat_bgwriter broken?

2010-11-14 Thread Greg Smith
I'm behind on my list mail so maybe this has been mentioned already, but 
when I just tried pg_stat_bgwriter from a build against today's HEAD I 
got this:


pgbench=# select count(*) FROM pg_stat_bgwriter;
ERROR:  did not find '}' at end of input node

Can someone confirm if this broke recently, or is it just me?  Last time 
I would have tested this myself was a few weeks ago.


Regardless, I was thinking of adding some basic sanity checking on this 
view, that runs from the regression tests to catch this class of problem 
in the future.  It's kind of sloppy that this and the bgwriter counter 
reset aren't tested as part of "make check".  I think these two would 
always produce stable results:


postgres=# SELECT count(*) FROM pg_stat_bgwriter;
count
---
1
postgres=# SELECT pg_stat_reset_shared('bgwriter');
pg_stat_reset_shared
--

(1 row)

And the first one of those is similarly broken on my install.  Thoughts 
on whether adding those to the regression tests would be a worthwhile 
patch?  I'll do the work, just thinking out loud about the concept.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_bgwriter broken?

2010-11-14 Thread Greg Smith

Tom Lane wrote:

Worksforme.  You probably need a full recompile and/or initdb


Yeah, sorry about the noise.  This went away after some more intensive 
rebuilding.  I think I still want to add some regression testing of this 
view as suggested.  If that had been there, I'd have been a lot more 
confident it was my mistake, and not something that just slipped through 
without being tested.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Instrument checkpoint sync calls

2010-11-14 Thread Greg Smith
Attached patch adds some logging for each individual fsync call made 
during a checkpoint, along with a summary at the end.  You need to have 
the following to see all of the detail:


log_checkpoints=on
log_min_messages=debug1

And here's a sample:

LOG:  checkpoint starting: immediate force wait
DEBUG:  checkpoint sync: file=1 time=1.946000 msec
DEBUG:  checkpoint sync: file=2 time=0.666000 msec
DEBUG:  checkpoint sync: file=3 time=0.004000 msec
LOG:  checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec
LOG:  checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log 
file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s, 
total=0.003 s


I think that it's reasonable for the sort of people who turn 
log_checkpoints on to also get the sync summary line, thus it being 
logged at LOG level.  The detail on individual syncs might go to DEBUG2 
or lower instead of DEBUG1 where I put it, that part I don't have a 
strong opinion on.  It's at DEBUG1 to make testing the patch without a 
gigantic amount of log data also coming in easier.


Right now the code is written such that all the calls that grab timing 
information are wrapped around "ifdef DEBUG_FSYNC", which is a variable 
set to 1 that could be a compile-time option like DEBUG_DEADLOCK, to 
allow turning this code path off at build time.  I personally think that 
if you're already making an fsync call and have log_checkpoints on, the 
additional overhead of also timing that fsync is minimal even on 
platforms where timing is slow (I don't have such a system to test that 
assumption however).  And I've seen enough value in troubleshooting 
nasty checkpoint sync problems using this patch to feel it's worth 
having even if it does add some overhead.


I'm a little concerned about log_checkpoints changing on me in the 
middle of the execution of a checkpoint, which would cause some problems 
here.  Not sure if that's something I should explicitly code for, given 
that all I think it will do is screw up one of the timing results.  It 
does seem a risk from the last minute self-review I just did of the code.


I'll give a sample program that stresses the system, generating slow 
timing results and other types of bad behavior, along with the next 
patch I submit here shortly.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us


diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1219fcf..c72da72 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -20,6 +20,7 @@
 
 #include "catalog/catalog.h"
 #include "miscadmin.h"
+#include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
@@ -29,6 +30,11 @@
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
+/*
+ * For now just leave this on all the time, eventually this will
+ * need to be a full compiler flag like DEBUG_DEADLOCK
+ */
+#define DEBUG_FSYNC	1
 
 /* interval for calling AbsorbFsyncRequests in mdsync */
 #define FSYNCS_PER_ABSORB		10
@@ -927,6 +933,15 @@ mdsync(void)
 	PendingOperationEntry *entry;
 	int			absorb_counter;
 
+#ifdef DEBUG_FSYNC
+	/* Statistics on sync times */
+	int processed = 0;
+	instr_time sync_start,sync_end,diff;
+	double elapsed;
+	double longest = 0;
+	double total_elapsed = 0;
+#endif
+
 	/*
 	 * This is only called during checkpoints, and checkpoints should only
 	 * occur in processes that have created a pendingOpsTable.
@@ -1069,9 +1084,32 @@ mdsync(void)
 seg = _mdfd_getseg(reln, entry->tag.forknum,
 			  entry->tag.segno * ((BlockNumber) RELSEG_SIZE),
    false, EXTENSION_RETURN_NULL);
+
+#ifdef DEBUG_FSYNC
+if (log_checkpoints)
+	INSTR_TIME_SET_CURRENT(sync_start);
+#endif
 if (seg != NULL &&
 	FileSync(seg->mdfd_vfd) >= 0)
+{
+
+#ifdef DEBUG_FSYNC
+	if (log_checkpoints)
+	{
+		INSTR_TIME_SET_CURRENT(sync_end);
+		diff=sync_end;
+		INSTR_TIME_SUBTRACT(diff, sync_start);
+		elapsed = (double) INSTR_TIME_GET_MICROSEC(diff) / 1000.0;
+		if (elapsed > longest)
+			longest = elapsed;
+		total_elapsed+=elapsed;
+		processed++;
+		/* Reduce the log level here to DEBUG2 in final patch? */
+		elog(DEBUG1, "checkpoint sync: file=%d time=%f msec", processed, elapsed);
+	}
+#endif
 	break;		/* success; break out of retry loop */
+	}
 
 /*
  * XXX is there any point in allowing more than one retry?
@@ -1113,6 +1151,13 @@ mdsync(void)
 			elog(ERROR, "pendingOpsTable corrupted");
 	}			/* end loop over hashtable entries */
 
+#ifdef DEBUG_FSYNC
+	/* Report on sync performance metrics */
+	if (log_checkpoints && (processed > 0))
+		elog(LO

[HACKERS] Count backend self-sync calls

2010-11-14 Thread Greg Smith
The attached patch adds a new field to pg_stat_bgwriter, counting the 
number of times backends execute their own fsync calls.  Normally, when 
a backend needs to fsync data, it passes a request to the background 
writer, which then absorbs the call into its own queue of work to do.  
However, under some types of heavy system load, the associated queue can 
fill.  When this happens, backends are forced to do their own fsync 
call.  This is potentially much worse than when they do a regular write.


The really nasty situation is when the background writer is busy because 
it's executing a checkpoint.  In that case, it's possible for the 
backend fsync calls to start competing with the ones the background 
writer is trying to get done, causing the checkpoint sync phase to 
execute slower than it should.  I've seen the sync phase take over 45 
minutes on a really busy server once it got into this condition, where 
hundreds of clients doing their own backend fsync calls were fighting 
against the checkpoint fsync work.  With this patch, you can observe 
that happening as an upwards spike in 
pg_stat_bgwriter.buffers_backend_sync, which as documented is an 
inclusive subset of the total shown in buffers_backend.


While it takes a busier system than I can useful show how to simulate 
here to show a really bad situation, I'm able to see some of these 
unabsorbed backend fsync calls when initializing a pgbench database, to 
prove they happen in the lab.  The attached test program takes as its 
input a pgbench scale counter.  It then creates a pgbench database 
(deleting any existing pgbench database, so watch out for that) and 
shows the values accumulated in pg_stat_bgwriter during that period.  
Here's an example, using the script's default scale of 100 on a server 
with 8GB of RAM and fake fsync (the hard drives are lying about it):


-[ RECORD 1 ]+-
now  | 2010-11-14 16:08:41.36421-05
...
Initializing pgbench
-[ RECORD 1 ]+--
now  | 2010-11-14 16:09:46.713693-05
checkpoints_timed| 0
checkpoints_req  | 0
buffers_checkpoint   | 0
buffers_clean| 0
maxwritten_clean | 0
buffers_backend  | 654716
buffers_backend_sync | 90
buffers_alloc| 803

This is with default sizing for memory structures.  As you increase 
shared_buffers, one of the queues involved here increases 
proportionately, making it less likely to run into this problem.  That 
just changes it to the kind of problem I've only seen on a larger system 
with a difficult to simulate workload.  The production system getting 
hammered with this problem (running a web application) that prompted 
writing the patch had shared_buffers=4GB at the time.


The patch also adds some logging to the innards involved here, to help 
with understanding problems in this area.  I don't think that should be 
in the version committed as is.  May want to drop the logging level or 
make it disabled in regular builds, since it is sitting somewhere it 
generates a lot of log data and adds overhead.  It is nice for now, as 
it lets you get an idea how much fsync work *is* being absorbed by the 
BGW, as well as showing what relation is suffering from this issue.  
Example of both those things, with the default config for everything 
except log_checkpoints (on) and log_min_messages (debug1):


DEBUG:  Absorbing 4096 fsync requests
DEBUG:  Absorbing 150 fsync requests
DEBUG:  Unable to forward fsync request, executing directly
CONTEXT:  writing block 158638 of relation base/16385/16398

Here 4096 is the most entries the BGW will ever absorb at once, and all 
90 of the missed sync calls are logged so you can see what files they 
came from.


As a high-level commentary about this patch, I'm not sure what most end 
users will ever do with this data.  At the same time, I wasn't sure what 
a typical end user would do with anything else in pg_stat_bgwriter 
either when it was added, and it turns out the answer is "wait for 
people who know the internals to write things that monitor it".  For 
example, Magnus has updated recent versions of the Munin plug-in for 
PostgreSQL to usefully graph pg_stat_bgwriter data over time.  As I have 
some data to suggest checkpoint problems on Linux in particular are 
getting worse as total system memory increases, I expect that having a 
way to easily instrument for this particular problem will be 
correspondingly more important in the future too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us


diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dbf966b..f701b8f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -264,8 +264,10 @@ postgres: user database host 
  
 
diff --git a/src/backend/catalog/system_vi

[HACKERS] Spread checkpoint sync

2010-11-14 Thread Greg Smith

As far as concerns about this 10% setting not doing enough work, which 
is something I do see, you can always increase how often absorbing 
happens by decreasing bgwriter_delay now--giving other benefits too.  
For example, if you run the fsync-stress-v2.sh script I included with 
the last patch I sent, you'll discover the spread sync version of the 
server leaves just as many unabsorbed writes behind as the old code 
did.  Those are happening because of periods the background writer is 
sleeping.  They drop as you decrease the delay; here's a table showing 
some values I tested here, with all three patches installed:


bgwriter_delaybuffers_backend_sync
200 ms90
50 ms28
25 ms3

There's a bunch of performance related review work that needs to be done 
here, in addition to the usual code review for the patch.  My hope is 
that I can get enough of that done to validate this does what it's 
supposed to on public hardware that a later version of this patch is 
considered for the next CommitFest.  It's a little more raw than I'd 
like still, but the idea has been tested enough here that I believe it's 
fundamentally sound and valuable.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us


diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 43a149e..0ce8e2b 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct
 
 static BgWriterShmemStruct *BgWriterShmem;
 
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */
+#define ABSORB_ACTION_DIVISOR	10
 
 /*
  * GUC parameters
@@ -382,7 +382,7 @@ BackgroundWriterMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbFsyncRequests(false);
 
 		if (got_SIGHUP)
 		{
@@ -636,7 +636,7 @@ BgWriterNap(void)
 		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
 			break;
 		pg_usleep(100L);
-		AbsorbFsyncRequests();
+		AbsorbFsyncRequests(true);
 		udelay -= 100L;
 	}
 
@@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void)
 void
 CheckpointWriteDelay(int flags, double progress)
 {
-	static int	absorb_counter = WRITES_PER_ABSORB;
-
 	/* Do nothing if checkpoint is being executed by non-bgwriter process */
 	if (!am_bg_writer)
 		return;
@@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress)
 			ProcessConfigFile(PGC_SIGHUP);
 		}
 
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
+		AbsorbFsyncRequests(false);
 
 		BgBufferSync();
 		CheckArchiveTimeout();
 		BgWriterNap();
 	}
-	else if (--absorb_counter <= 0)
+	else
 	{
 		/*
-		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-		 * operations even when we don't sleep, to prevent overflow of the
-		 * fsync request queue.
+		 * Check for overflow of the fsync request queue.
 		 */
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
+		AbsorbFsyncRequests(false);
+	}
+}
+
+/*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+	pg_time_t	now;
+ 	pg_time_t	sync_start_time;
+ 	int			sync_delay_secs;
+ 
+ 	/*
+ 	 * Delay after each sync, in seconds.  This could be a parameter.  But
+ 	 * since ideally this will be auto-tuning in the near future, not
+	 * assigning it a GUC setting yet.
+ 	 */
+#define EXTRA_SYNC_DELAY	3
+
+	/* Do nothing if checkpoint is being executed by non-bgwriter process */
+	if (!am_bg_writer)
+		return;
+
+ 	sync_start_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Perform the usual bgwriter duties.
+	 */
+ 	for (;;)
+ 	{
+		AbsorbFsyncRequests(false);
+ 		BgBufferSync();
+ 		CheckArchiveTimeout();
+ 		BgWriterNap();
+ 
+ 		/*
+ 		 * Are we there yet?
+ 		 */
+ 		now = (pg_time_t) time(NULL);
+ 		sync_delay_secs = now - sync_start_time;
+ 		if (sync_delay_secs >= EXTRA_SYNC_DELAY)
+			break;
 	}
 }
 
@@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  * non-bgwriter processes, do nothing if not bgwriter.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbFsyncRequests(bool force)
 {
 	BgWriterRequest *requests = NULL;
 	BgWriterRequest *request;
 	int			n;
 
+	/* 
+	 * Divide the size of the request queue by this to determine when
+	 * absorption action needs to be taken.  Default here aims to empty the
+	 * queue whenever 1 / 10 = 10% of it is full.  If this isn't good enough,
+	 * you probably need to lower bgwriter_delay, rather than presume
+	 * this needs to be a tunable you can decrease.
+	 

Re: [HACKERS] pg_stat_bgwriter broken?

2010-11-14 Thread Greg Smith

Tom Lane wrote:

If the cause was what I suggested, most likely the other stats views
were broken too --- were you able to pass the regression tests in that
state?
  


I was having a lot of problems with the system that included regression 
test issues, and yelped about this one earlier than I normally would 
because it was blocking work I wanted to get submitted today.  Sorry 
about the noise, and thanks for the info about why this was happening.  
The way you've explained it now, I see why it's not worth adding a 
specific regression test for this in the future, and what else I should 
have checked when I ran into my problem (that other system stats views 
were also working or not).


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-14 Thread Greg Smith

Magnus Hagander wrote:

I think that it's reasonable for the sort of people who turn log_checkpoints
on to also get the sync summary line, thus it being logged at LOG level.



In that case, wouldn't it make more sense to add a couple of more
fields to the existing line? Easier to post-process the logfile that
way...
  


It might.  One trade-off is that if you're looking at the sync write 
detail, the summary comes out in a similar form.  And it was easy to put 
in here--I'd have to return some new data out of the sync phase call in 
order for that to show up in the main log.  If there's general buy-in on 
the idea, I could do all of that.




I personally think that if you're
already making an fsync call and have log_checkpoints on, the additional
overhead of also timing that fsync is minimal even on platforms where timing
is slow (I don't have such a system to test that assumption however)...

It sounds like it should be - but that should be possible to measure, no?
  


What I was alluding to is that I know gettimeofday executes fast on my 
Linux system here, so even if I did measure the overhead and showed it's 
near zero that doesn't mean it will be so on every platform.  The "how 
long does it take to find out the current time on every supported 
PostgreSQL platform?" question is one I'd like to have an answer to, but 
it's hard to collect properly.  All I know is that I don't have any 
system where it's slow to properly test again here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us




Re: [HACKERS] Count backend self-sync calls

2010-11-14 Thread Greg Smith

Robert Haas wrote:

I think this one could be removed:

+   if (n > 0)
+   elog(DEBUG1,"Absorbing %d fsync requests",n);
  


Right; that one is just there to let you know the patch is working, and 
how much the background writer does for you per pass, mainly for the 
purpose of reviewing the patch.




But if this is generating a lot of log data or adding a lot of
overhead, then you have bigger problems anyway:

+   elog(DEBUG1, "Unable to forward fsync request, executing 
directly");
  


The argument against this log line even existing is that if the field is 
added to pg_stat_bgwriter, that's probably how you want to monitor this 
data anyway.  I don't think there's actually much value to it, which is 
one reason I didn't worry too much about matching style guidelines, 
translation, etc.  You've found the two things that I think both need to 
go away before commit, but I left them in because I think they're 
valuable for testing the patch itself does what it's supposed to.



Also, how about referring to this as buffers_backend_fsync
consistently throughout, instead of dropping the "f" in some places?
  


I started out touching code that called it just "sync", but then crossed 
to other code that called it "fsync", and made the external UI use that 
name.  No objections to sorting that out within my patch so it's consistent.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CommitFest 2010-11: Call for Reviewers

2010-11-14 Thread Greg Smith
Despite some jerk submitting three brand new patches at the last minute 
by himself, the November CommitFest is now closed for new submissions 
and ready for review!  Only 7 of the 40 patches that need review have a 
reviewer assigned so far.  Reviewers, mark down a patch you want to look 
at in the CommitFest app:  
https://commitfest.postgresql.org/action/commitfest_view?id=8 or ask to 
be assigned one on the pgsql-rrreviewers list, if you want to work on 
something but don't have a preference for which patch.  There are a few 
larger patches in this round, but the majority of those already have a 
reviewer attached to them.  Many of the remaining patches are pretty 
approachable even if you're relatively new to reviewing PostgreSQL 
code.  See the usual for details:


http://wiki.postgresql.org/wiki/Reviewing_a_Patch
http://wiki.postgresql.org/wiki/RRReviewers

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

2010-11-14 Thread Greg Smith

Marti Raudsepp wrote:

PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php

NB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior.


Following our standard community development model, I've put this patch 
onto our CommitFest list:  
https://commitfest.postgresql.org/action/patch_view?id=432 and assigned 
myself as the reviewer.  I didn't look at this until now because I 
already had some patch development and review work to finish before the 
CommitFest deadline we just crossed.  Now I can go back to reviewing 
other people's work.


P.S. There is no pgsql-patch list anymore; everything goes through the 
hackers list now.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI update

2010-11-15 Thread Greg Smith

Kevin Grittner wrote:

That went in an hour and a half before the CF deadline, but I just
didn't feel right putting it into the CF in that shape.


Then keep on working on it and we can revisit its state when you're 
happy with it.  The purpose of the CommitFest cut-off is not to block 
work on long-term development just because a deadline passed; it's to 
make sure patches which might otherwise never get a review are looked at 
eventually.  Finding a reviewer for these larger and complicated patches 
is a very different sort of job than finding one for an average patch 
anyway.


I could use a brief reminder of how this bit fits into the "serializable 
lock consistency" patch that's already sitting into the CF queue as 
"Ready for Committer" though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-11-15 Thread Greg Smith

So my task list is:

0) Rebase against the HEAD that just code related to this touched today

1) Assume that log_checkpoints is sufficient control over whether the 
timing overhead added is worth collecting, and therefore remove the 
half-baked idea of also wrapping with a compile-time option.


2) Have the sync summary returned upwards, so it can be put onto the 
same line as the rest of the rest of the log_checkpoint info.


All seems reasonable to me.  Will rev a new patch by tomorrow.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Count backend self-sync calls

2010-11-15 Thread Greg Smith

Jeff Janes wrote:

Do you know where this competition is happening?  Is it on the
platters, or is it in the hard drive write cache (I thought high-end
hardware had tagged writes to avoid that), or in the kernel?
  


Kernel.  Linux systems with lots of memory will happily queue up 
gigabytes of memory in their write cache, only getting serious about 
writing it out to disk when demanded to by fsync.




This makes sense if we just need to append to a queue.  But once the
queue is full and we are about to do a backend fsync, might it make
sense to do a little more work to look for dups?
  


One of the paths I'd like to follow is experimenting with both sorting 
writes by file and looking for duplication in the queues.  I think a 
basic, simple sync spreading approach needs to get finished first 
through; this sort of thing would then be an optimization on top of it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Changes to Linux OOM killer in 2.6.36

2010-11-18 Thread Greg Smith
Last month's new Linux kernel 2.6.36 includes a rewrite of the out of 
memory killer: 


http://lwn.net/Articles/391222/
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a63d83f427fbce97a6cea0db2e64b0eb8435cd10

The new "badness" method totals the task's RSS and swap as a percentage 
of RAM, where the old one scored starting with the total memory used by 
the process.  I *think* that this is an improvement for PostgreSQL, 
based on the sort of data I see with:


ps -o pid,rss,size,vsize,args -C postgres

But I haven't tested with one of the new kernels yet to be sure.  
Something to look at next time I get in that bleeding edge kernel kind 
of mood.


One thing that's definitely changed is the interface used to control 
turning off the OOM killer.  There's a backward compatibility 
translation right now that maps the current "-17" bit mask value the 
PostgreSQL code sends to /proc//oom_adj into the new units scale.  
However, oom_adj is deprecated, scheduled for removal in August 2010:  
http://www.mjmwired.net/kernel/Documentation/feature-removal-schedule.txt


So eventually, if the OOM disabling code is still necessary in 
PostgreSQL, it will need to do this sort of thing instead:


echo -1000 > /proc//oom_score_adj

I've seen kernel stuff get deprecated before the timeline before for 
code related reasons (when the compatibility bits were judged too 
obtrusive to keep around anymore), but since this translation bit is 
only a few lines of code I wouldn't expect that to happen here.


I don't think it's worth doing anything to the database code until tests 
on the newer kernel confirm whether this whole thing is even necessary 
anymore.  Wanted to pass along the info while I was staring at it 
though.  Thanks to Daniel Farina for pointing this out.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving prep_buildtree used in VPATH builds

2010-11-18 Thread Greg Smith

Gurjeet Singh wrote:
Seems like it would improve performance in general, but more so under 
load conditions when you actually need it. I am not sure if -depth 
option is supported by all incarnations of 'find'.


Given the way directory writes are cached by the filesystem, I'm not 
sure why the load at the time matters so much.  If what you mean by that 
is you're low on memory, that would make more sense to me.


Anyway, "-depth" is in POSIX as of 2001.  It seems to be in all the 
major SysV UNIX variants going back further then that, and of course 
it's in GNU find.  But it looks like BSD derived systems from before 
that POSIX standard originally called this "-d" instead.  So there's 
some potential for this to not work on older systems; it works fine on 
my test FreeBSD 7.2 system.  Maybe someone who has access to some 
ancient BSD-ish system can try this out?  The simplest test case similar 
to what the patch adds seems to be if this runs, returning 
subdirectories in depth-first order before their parent:


$ find / -depth -type d -print

If that fails somewhere, it may turn out to require another configure 
check just to determine whether you can use this configuration time 
optimization.  That's certainly possible to add to the patch if it got 
committed and turns out to break one of the buildfarm members.  It seems 
to me like this whole thing may be a bit more work than it's worth, 
given this is a fairly small and difficult to reproduce speedup in only 
one stage of the build process.  I'd think that if configure takes 
longer than it has to because the system is heavily loaded, the amount 
compilation time is going to suffer from that would always dwarf this 
component of total build time.  But if this was slow enough at some 
point to motivate you to write a patch for it, maybe that assumption is 
wrong.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changes to Linux OOM killer in 2.6.36

2010-11-19 Thread Greg Smith

Kevin Grittner wrote:

Greg Smith  wrote:
 
  

oom_adj is deprecated, scheduled for removal in August 2010:

 
That surprised me so I checked the URL.  I believe you have a typo

there and it's August, 2012.
  


This is why I include references, so that when the cold medicine hits me 
in the middle of proofreading my message and I sent it anyway you aren't 
mislead.  Yes, 2012, only a few months before doomsday.  The aproaching 
end of the world then means any bugs left can be marked WONTFIX.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us




Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Greg Smith

Jeff Janes wrote:

And for very large memory
systems, even 1% may be too much to cache (dirty*_ratio can only be
set in integer percent points), so recent kernels introduced
dirty*_bytes parameters.  I like these better because they do what
they say.  With the dirty*_ratio, I could never figure out what it was
a ratio of, and the results were unpredictable without extensive
experimentation.
  


Right, you can't set dirty_background_ratio low enough to make this 
problem go away.  Even attempts to set it to 1%, back when that that was 
the right size for it, seem to be defeated by other mechanisms within 
the kernel.  Last time I looked at the related source code, it seemed 
the "congestion control" logic that kicks in to throttle writes was a 
likely suspect.  This is why I'm not real optimistic about newer 
mechanism like the dirty_background_bytes added 2.6.29 to help here, as 
that just gives a mapping to setting lower values; the same basic logic 
is under the hood.


Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly 
due to the same congestion mechanism. 


Yes, but how much work do we want to put into redoing the checkpoint
logic so that the sysadmin on a particular OS and configuration and FS
can avoid having to change the kernel parameters away from their
defaults?  (Assuming of course I am correctly understanding the
problem, always a dangerous assumption.)
  


I've been trying to make this problem go away using just the kernel 
tunables available since 2006.  I adjusted them carefully on the server 
that ran into this problem so badly that it motivated the submitted 
patch, months before this issue got bad.  It didn't help.  Maybe if they 
were running a later kernel that supported dirty_background_bytes that 
would have worked better.  During the last few years, the only thing 
that has consistently helped in every case is the checkpoint spreading 
logic that went into 8.3.  I no longer expect that the kernel developers 
will ever make this problem go away the way checkpoints are written out 
right now, whereas the last good PostgreSQL work in this area definitely 
helped.


The basic premise of the current checkpoint code is that if you write 
all of the buffers out early enough, by the time syncs execute enough of 
the data should have gone out that those don't take very long to 
process.  That was usually true for the last few years, on systems with 
a battery-backed cache; the amount of memory cached by the OS was 
relatively small relative to the RAID cache size.  That's not the case 
anymore, and that divergence is growing bigger.


The idea that the checkpoint sync code can run in a relatively tight 
loop, without stopping to do the normal background writer cleanup work, 
is also busted by that observation.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spread checkpoint sync

2010-11-21 Thread Greg Smith

Robert Haas wrote:

Doing all the writes and then all the fsyncs meets this requirement
trivially, but I'm not so sure that's a good idea.  For example, given
files F1 ... Fn with dirty pages needing checkpoint writes, we could
do the following: first, do any pending fsyncs for files not among F1
.. Fn; then, write all pages for F1 and fsync, write all pages for F2
and fsync, write all pages for F3 and fsync, etc.  This might seem
dumb because we're not really giving the OS a chance to write anything
out before we fsync, but think about the ext3 case where the whole
filesystem cache gets flushed anyway.


I'm not horribly interested in optimizing for the ext3 case per se, as I 
consider that filesystem fundamentally broken from the perspective of 
its ability to deliver low-latency here.  I wouldn't want a patch that 
improved behavior on filesystem with granular fsync to make the ext3 
situation worst.  That's as much as I'd want design to lean toward 
considering its quirks.  Jeff Janes made a case downthread for "why not 
make it the admin/OS's job to worry about this?"  In cases where there 
is a reasonable solution available, in the form of "switch to XFS or 
ext4", I'm happy to take that approach.


Let me throw some numbers out to give a better idea of the shape and 
magnitude of the problem case I've been working on here.  In the 
situation that leads that the near hour-long sync phase I've seen, 
checkpoints will start with about a 3GB backlog of data in the kernel 
write cache to deal with.  That's about 4% of RAM, just under the 5% 
threshold set by dirty_background_ratio.  Whether or not the 256MB write 
cache on the controller is also filled is a relatively minor detail I 
can't monitor easily.  The checkpoint itself?  <250MB each time. 

This proportion is why I didn't think to follow the alternate path of 
worrying about spacing the write and fsync calls out differently.  I 
shrunk shared_buffers down to make the actual checkpoints smaller, which 
helped to some degree; that's what got them down to smaller than the 
RAID cache size.  But the amount of data cached by the operating system 
is the real driver of total sync time here.  Whether or not you include 
all of the writes from the checkpoint itself before you start calling 
fsync didn't actually matter very much; in the case I've been chasing, 
those are getting cached anyway.  The write storm from the fsync calls 
themselves forcing things out seems to be the driver on I/O spikes, 
which is why I started with spacing those out.


Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog 
takes a minimum of 10 minutes of real time.  There are about 300 1GB 
relation files involved in the case I've been chasing.  This is where 
the 3 second delay number came from; 300 files, 3 seconds each, 900 
seconds = 15 minutes of sync spread.  You can turn that math around to 
figure out how much delay per relation you can afford while still 
keeping checkpoints to a planned end time, which isn't done in the patch 
I submitted yet.


Ultimately what I want to do here is some sort of smarter write-behind 
sync operation, perhaps with a LRU on relations with pending fsync 
requests.  The idea would be to sync relations that haven't been touched 
in a while in advance of the checkpoint even.  I think that's similar to 
the general idea Robert is suggesting here, to get some sync calls 
flowing before all of the checkpoint writes have happened.  I think that 
the final sync calls will need to get spread out regardless, and since 
doing that requires a fairly small amount of code too that's why we 
started with that.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Single client performance on trivial SELECTs

2011-04-14 Thread Greg Smith
This week several list regulars here waded into the MySQL Convention.  I 
decided to revisit PostgreSQL vs. MySQL performance using the sysbench 
program as part of that.  It's not important to what I'm going to 
describe to understand exactly what statements sysbench runs here or how 
to use it, but if anyone is curious I've got some more details about how 
I ran the tests in my talk slides at 
http://projects.2ndquadrant.com/talks  The program has recently gone 
through some fixes that make it run a bit better both in general and 
against PostgreSQL.  The write tests are still broken against 
PostgreSQL, but it now seems to do a reasonable job simulating a simple 
SELECT-only workload.  A fix from Jignesh recently made its way into the 
database generation side of the code that makes it less tedious to test 
with it too. 

The interesting part was how per-client scaling compared between the two 
databases; graph attached.  On my 8 core server, PostgreSQL scales 
nicely up to a steady 50K TPS.  I see the same curve, almost identical 
numbers, with PostgreSQL and pgbench--no reason to suspect sysbench is 
doing anything shady.  The version of MySQL I used hits around 67K TPS 
with innodb when busy with lots of clients.  That part doesn't bother 
me; nobody expects PostgreSQL to be faster on trivial SELECT statements 
and the gap isn't that big.


The shocking part was the single client results.  I'm using to seeing 
Postgres get around 7K TPS per core on those, which was the case here, 
and I never considered that an interesting limitation to think about 
before.  MySQL turns out to hit 38K TPS doing the same work.  Now that's 
a gap interesting enough to make me wonder what's going on.


Easy enough to exercise the same sort of single client test case with 
pgbench and put it under a profiler:


sudo opcontrol --init
sudo opcontrol --setup --no-vmlinux
createdb pgbench
pgbench -i -s 10 pgbench
psql -d pgbench -c "vacuum"
sudo opcontrol --start
sudo opcontrol --reset
pgbench -S -n -c 1 -T 60 pgbench
sudo opcontrol --dump ; sudo opcontrol --shutdown
opreport -l image:$HOME/pgwork/inst/test/bin/postgres

Here's the top calls, from my laptop rather than the server that I 
generated the graph against.  It does around 5.5K TPS with 1 clients and 
10K with 2 clients, so same basic scaling:


samples  %image name   symbol name
53548 6.7609  postgres AllocSetAlloc
32787 4.1396  postgres MemoryContextAllocZeroAligned
26330 3.3244  postgres base_yyparse
21723 2.7427  postgres hash_search_with_hash_value
20831 2.6301  postgres SearchCatCache
19094 2.4108  postgres hash_seq_search
18402 2.3234  postgres hash_any
15975 2.0170  postgres AllocSetFreeIndex
14205 1.7935  postgres _bt_compare
13370 1.6881  postgres core_yylex
10455 1.3200  postgres MemoryContextAlloc
10330 1.3042  postgres LockAcquireExtended
10197 1.2875  postgres ScanKeywordLookup
9312  1.1757  postgres MemoryContextAllocZero

I don't know nearly enough about the memory allocator to comment on 
whether it's possible to optimize it better for this case to relieve any 
bottleneck.  Might just get a small gain then push the limiter to the 
parser or hash functions.  I was surprised to find that's where so much 
of the time was going though.


P.S. When showing this graph in my talk, I pointed out that anyone who 
is making decisions about which database to use based on trivial SELECTs 
on small databases isn't going to be choosing between PostgreSQL and 
MySQL anyway--they'll be deploying something like MongoDB instead if 
that's the important metric.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


<>
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Single client performance on trivial SELECTs

2011-04-14 Thread Greg Smith

Heikki Linnakangas wrote:
In this case you could just use prepared statements and get rid of all 
the parser related overhead, which includes much of the allocations.


Trying that gives me around 9200 TPS instead of 5500 on my laptop, so a 
pretty big gain here.  Will have to include that in my next round of 
graphs across multiple client loads once I'm home again and can run 
easily on my server.


To provide a matching profile from the same system as the one I already 
submitted from, for archival sake, here's what the profile I get looks 
like with "-M prepared":


samples  %image name   symbol name
33093 4.8518  postgres AllocSetAlloc
30012 4.4001  postgres hash_seq_search
27149 3.9803  postgres MemoryContextAllocZeroAligned
26987 3.9566  postgres hash_search_with_hash_value
25665 3.7628  postgres hash_any
16820 2.4660  postgres _bt_compare
14778 2.1666  postgres LockAcquireExtended
12263 1.7979  postgres AllocSetFreeIndex
11727 1.7193  postgres tas
11602 1.7010  postgres SearchCatCache
11022 1.6159  postgres pg_encoding_mbcliplen
10963 1.6073  postgres MemoryContextAllocZero
9296  1.3629  postgres MemoryContextCreate
8368  1.2268  postgres fmgr_isbuiltin
7973  1.1689  postgres LockReleaseAll
7423  1.0883  postgres ExecInitExpr
7309  1.0716  postgres     pfree

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Single client performance on trivial SELECTs

2011-04-15 Thread Greg Smith
I did some more research into the memory allocation hotspots found in 
the profile, in regards to what MySQL has done in those areas.  (And by 
"research" I mean "went to the bar last night and asked Baron about 
it")  The issue of memory allocation being a limiter on performance has 
been nagging that community for long enough that the underlying malloc 
used can even be swapped with a LD_PRELOAD trick:  
http://dev.mysql.com/doc/refman/5.5/en/mysqld-safe.html#option_mysqld_safe_malloc-lib


Plenty of people have benchmarked that and seen a big difference between 
the implementations; some sample graphs:


http://locklessinc.com/articles/mysql_performance/
http://blogs.sun.com/timc/entry/mysql_5_1_memory_allocator
http://mysqlha.blogspot.com/2009/01/double-sysbench-throughput-with_18.html

To quote from the last of those, "Malloc is a bottleneck for sysbench 
OLTP readonly", so this problem is not unique to PostgreSQL.  As of 5.5 
the better builds are all defaulting to TCMalloc, which is interesting 
but probably not as useful because it's focused on improving 
multi-threaded performance:  
http://goog-perftools.sourceforge.net/doc/tcmalloc.html


I'm not sure exactly what is useful to be learned from that specific 
work.  But it does suggest two things:  one, this is far from an easy 
thing to fix.  Two, the only reason MySQL does so well on it is because 
there was some focused work on it, taking a quite a while to accomplish, 
and involving many people.  Doing better for PostgreSQL is something I 
see as more of a long-term goal, rather than something it would be 
reasonable to expect quick progress on.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MMAP Buffers

2011-04-15 Thread Greg Smith

Joshua Berkus wrote:

Guys, can we *please* focus on the patch for now, rather than the formatting, 
which is fixable with sed?
  


Never, and that's not true.  Heikki was being nice; I wouldn't have even 
slogged through it long enough to ask the questions he did before 
kicking it back as unusable.  A badly formatted patch makes it 
impossible to evaluate whether the changes from a submission are 
reasonable or not without the reviewer fixing it first.  And you can't 
automate correcting it, it takes a lot of tedious manual work.  Start 
doing a patch review every CommitFest cycle and you very quickly realize 
it's not an ignorable problem.  And lack of discipline in minimizing 
one's diff is always a sign of other code quality issues.


Potential contributors to PostgreSQL should know that a badly formatted 
patch faces an automatic rejection, because no reviewer can work with it 
easily.  This fact is not a mystery; in fact it's documented at 
http://wiki.postgresql.org/wiki/Submitting_a_Patch :  "The easiest way 
to get your patch rejected is to make lots of unrelated changes, like 
reformatting lines, correcting comments you felt were poorly worded etc. 
Each patch should have the minimum set of changes required to fulfil the 
single stated objective."  I think I'll go improve that text 
next--something like "Ways to get your patch rejected" should be its own 
section.


The problem here isn't whether someone used an IDE or not, it's that 
this proves they didn't read their own patch before submitting it.  
Reading one's own diff and reflecting on what you've changed is one of 
the extremely underappreciated practices of good open-source software 
development.  Minimizing the size of that diff is perhaps the most 
important thing someone can do in order to make their changes to a piece 
of software better.  Not saying something that leads in that direction 
would be a disservice to the submitter.


P.S. You know what else I feel should earn an automatic rejection 
without any reviewer even looking at the code?  Submitting a patch that 
claims to improve performance and not attaching the test case you used, 
along with detailed notes about before/after tests on your own 
hardware.  A hand wave "it's faster" is never good enough, and it's 
extremely wasteful of our limited reviewer resources to try and 
duplicate what the submitter claimed.  Going to add something about that 
to the submission guidelines too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MMAP Buffers

2011-04-16 Thread Greg Smith

Tom Lane wrote:

* On the other side of the coin, I have seen many a patch that was
written to minimize the length of the diff to the detriment of
readability or maintainability of the resulting code, and that's *not*
a good tradeoff.
  


Sure. that's possible.  But based on the reviews I've done, I'd say that 
the fact someone is even aware that minimizing their diff is something 
important to consider automatically puts them far ahead of the average 
new submitter.  There are a high percentage of patches where the 
submitter generates a diff and sents it without even looking at it.  
That a person would look at their diff and go too far without trying to 
make it small doesn't happen nearly as much.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MMAP Buffers

2011-04-16 Thread Greg Smith

Robert Haas wrote:

The OP says that this patch maintains the WAL-before-data rule without any 
explanation of how it accomplishes that seemingly quite amazing feat.  I assume 
I'm going to have to read this patch at some point to refute this assertion, 
and I think that sucks.


I don't think you have to read any patch that doesn't follow the 
submission guidelines.  The fact that you do is a great contribution to 
the community.  But if I were suggesting how your time would be best 
spent improving PostgreSQL, "reviewing patches that don't meet coding 
standards" would be at the bottom of the list.  There's always something 
better for the project you could be working on instead.


I just added 
http://wiki.postgresql.org/wiki/Submitting_a_Patch#Reasons_your_patch_might_be_returned 
, recycling some existing text, adding some new suggestions.


I hope I got the tone of that text right.  The intention was to have a 
polite but clear place to point submitters to when their suggestion 
doesn't meet the normal standards here, such that they might even get 
bounced before even entering normal CommitFest review.  This MMAP patch 
looks like it has all 5 of the problems mentioned on that now more 
focused list.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-04-16 Thread Greg Smith

Joshua Berkus wrote:

Then you can say that politely and firmly with direct reference to the problem, 
rather than making the submitter feel bad.
  


That's exactly what happened.  And then you responded that it was 
possible to use a patch without fixing the formatting first.  That's not 
true, and those of us who do patch review are tired of even trying. 




Our project has an earned reputation for being rejection-happy curmudgeons.  
This is something I heard more than once at MySQLConf, including from one 
student who chose to work on Drizzle instead of PostgreSQL for that reason.  I 
think that we could stand to go out of our way to be helpful to first-time 
submitters.
  


I'll trade you anecdotes by pointing out that I heard from half a dozen 
business people that the heavy emphasis on quality control and standards 
was the reason they were looking into leaving MySQL derived 
distributions for PostgreSQL.


I've spent days of time working on documentation to help new submitters 
get their patches improve to where they meet this community's 
standards.  This thread just inspired another round of that.  What 
doesn't help is ever telling someone they can ignore those and still do 
something useful we're interested in.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MMAP Buffers

2011-04-16 Thread Greg Smith

Radosław Smogura wrote:
Yes, but, hmm... in Netbeans I had really long gaps (probably 8 
spaces, from tabs), so deeper "ifs", comments at the and of variables, 
went of out my screen. I really wanted to not format this, but 
sometimes I needed.


The guide at 
http://www.open-source-editor.com/editors/how-to-make-netbeans-use-tabs-for-indention.html 
seems to cover how to fix this in Netbeans.  You want it to look like 
that screen shot:  4 spaces per indent with matching tab size of 4, and 
"Expand Tabs to Spaces" unchecked.


Generally, if you look at the diff you've created, and your new code 
doesn't line up right with what's already there, that means the 
tab/space setup isn't quite right when you were editing.  Reading the 
diff is useful for catching all sorts of other issues, too, so it's just 
generally a good practice.  As Peter already mentioned, the big problem 
here is that you checked in a modified configure file.


I also note that you use C++ style "//" comments, which aren't allowed 
under the coding guidelines--even though they work fine on many common 
platforms.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-04-17 Thread Greg Smith

Andrew Dunstan wrote:
What makes you think this isn't possible to run pgindent? There are no 
secret incantations.


The first hit newbies find looking for info about pgident is 
http://blog.hagander.net/archives/185-pgindent-vs-dash.html which sure 
looks like secret incantations to me.  The documentation 
src/tools/pgindent/README reads like a magic spell too:


   find . -name '*.[ch]' -type f -print | \
   egrep -v -f src/tools/pgindent/exclude_file_patterns | \
   xargs -n100 pgindent src/tools/pgindent/typedefs.list

And it doesn't actually work as written unless you've installed 
pgindent, entab/detab, and the specially patched NetBSD indent into the 
system PATH somewhere--unreasonable given that this may be executing on 
a source only tree that has never been installed..  The fact that the 
documention is only in the README and not with the rest of the code 
conventions isn't helping either.


The last time I tried to do this a few years ago I failed miserably and 
never came back.  I know way more about building software now though, 
and just got this to work for the first time.  Attached is a WIP wrapper 
script for running pgident that builds all the requirements into 
temporary directories, rather than expecting you to install anything 
system-wide or into a PostgreSQL destination directory.  Drop this into 
src/tools/pgindent, make it executable, and run it from that directory.  
Should do the right thing on any system that has "make" as an alias for 
"gmake" (TODO to be better about that in the file, with some other 
nagging things).


When I just ran it against master I got a bunch of modified files, but 
most of them look like things that have been touched recently so I think 
it did the right thing.  A test of my work here from someone who isn't 
running this for the first time would be helpful.  If this works well 
enough, I think it would make a good helper script to include in the 
distribution.  The loose ends to fix I can take care of easily enough 
once basic validation is finished.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


#!/bin/bash -ex

# Presume that we're in the pgindent directory at the start.
# TODO this should switch to the directory where this script is located at,
# in case someone calls src/tools/pgident/run-pgident

# Back to the base directory
pushd ../../..

# Sanity check we're in the right place
if [ ! -d src/tools/pgindent ] ; then
  echo run-pgindent can only be run from within the src/tools/pgindent 
directory, aborting
  popd
  exit 1
fi

echo pgindent setting up environment
wget ftp://ftp.postgresql.org/pub/dev/indent.netbsd.patched.tgz
mkdir -p indent
cd indent
zcat ../indent.netbsd.patched.tgz | tar xvf -
rm -f indent.netbsd.patched.tgz
make
INDENT_DIR=`pwd`
cd ..

pushd src/tools/entab directory
make
ln -s entab detab
ENTAB_DIR=`pwd`
popd

export PATH="$INDENT_DIR:$ENTAB_DIR:$PATH"

wget -O src/tools/pgindent/typedefs.list 
http://buildfarm.postgresql.org/cgi-bin/typedefs.pl

# This cleanup can only happen if there is already a makefile; assume
# that if there's isn't, this tree is clean enough
if [ -f GNUmakefile ] ; then
  # TODO this may need to be "gmake" on some systems instead
  make maintainer-clean
fi

echo pgindent starting run
find . -name '*.[ch]' -type f -print | \
  egrep -v -f src/tools/pgindent/exclude_file_patterns | \
  xargs -n100 src/tools/pgindent/pgindent src/tools/pgindent/typedefs.list

# Cleanup of utilities built temporarily here
unlink src/tools/entab/detab
rm -rf indent

popd
echo pgindent run complete

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-04-17 Thread Greg Smith

Robert Haas wrote:

But it turns out that it doesn't really matter.  Whitespace or no
whitespace, if you don't read the diff before you hit send, it's
likely to contain some irrelevant cruft, whether whitespace changes or
otherwise.
  


Right.  Presuming that pgident will actually solve anything leaps over 
two normally incorrect assumptions:


-That the main tree was already formatted with pgident before you 
started, so no stray diffs will result from it touching things the 
submitter isn't even involved in.


-There is no larger code formatting or diff issues except for spacing.

This has been a nagging loose end for a while, so I'd like to see 
pgindent's rough edges get sorted out so it's easier to use.  But 
whitespace errors because of bad editors are normally just a likely sign 
of a patch with bigger problems, rather than something that can get 
fixed and then submissions is good.  There is no substitute for the 
discipline of reading your own diff before submission.  I'll easily 
obsess over mine for an hour before I submit something major, and that 
time is always well spent.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-04-17 Thread Greg Smith

Andrew Dunstan wrote:
Now we could certainly make this quite a bit slicker. Apart from 
anything else, we should change the indent source code tarball so it 
unpacks into its own directory. Having it unpack into the current 
directory is ugly and unfriendly. And we should get rid of the "make 
clean" line in the install target of entab's makefile, which just 
seems totally ill-conceived.


I think the script I submitted upthread has most of the additional 
slickness needed here.  Looks like we both were working on documenting a 
reasonable way to do this at the same time the other day.  The idea of 
any program here relying on being able to write to /usr/local/bin as 
your example did makes this harder for people to run; that's why I made 
everything in the build tree and just pushed the appropriate directories 
into the PATH.


Since I see providing a script to automate this whole thing as the 
preferred way to make this easier, re-packaging the indent source 
tarball to extract to a directory doesn't seem worth the backwards 
compatibility trouble it will introduce.  Improving the entab makefile I 
don't have an opinion on.


It might also be worth setting it up so that instead of having to pass 
a path to a typedefs file on the command line, we default to a file 
sitting in, say, /usr/local/etc. Then you'd just be able to say 
"pgindent my_file.c".


OK, so I need to update my script to handle either indenting a single 
file, or doing all of them.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-04-21 Thread Greg Smith

On 04/21/2011 12:39 PM, Robert Haas wrote:

In fact, I've been wondering if we shouldn't consider extending the
support window for 8.2 past the currently-planned December 2011.
There seem to be quite a lot of people running that release precisely
because the casting changes in 8.3 were so painful, and I think the
incremental effort on our part to extend support for another year
would be reasonably small.


The pending EOL for 8.2 is the only thing that keeps me sane when 
speaking with people who refuse to upgrade, yet complain that their 8.2 
install is slow.  This last month, that seems to be more than usual "why 
does autovacuum suck so much?" complaints that would all go away with an 
8.3 upgrade.  Extending the EOL is not doing any of these users a 
favor.  Every day that goes by when someone is on a version of 
PostgreSQL that won't ever allow in-place upgrade is just making worse 
the eventual dump and reload they face worse.  The time spent porting to 
8.3 is a one-time thing; the suffering you get trying to have a 2011 
sized database on 2006's 8.2 just keeps adding up the longer you 
postpone it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-04-21 Thread Greg Smith
're not participating in the thing that needs the 
most help.  This is not a problem you make better with fuzzy management 
directives to be nicer to people.  There are real software engineering 
issues about how to ensure good code quality at its core.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync reliability

2011-04-21 Thread Greg Smith

On 04/21/2011 04:26 AM, Simon Riggs wrote:

However, that begs the question of what happens with WAL. At present,
we do nothing to ensure that "the entry in the directory containing
the file has also reached disk".
   



Well, we do, but it's not obvious why that is unless you've stared at 
this for far too many hours.  A clear description of the possible issue 
you and Dan are raising showed up on LKML a few years ago:  
http://lwn.net/Articles/270891/


Here's the most relevant part, which directly addresses the WAL case:

"[fsync] is unsafe for write-ahead logging, because it doesn't really 
guarantee any _ordering_ for the writes at the hard storage level.  So 
aside from losing committed data, it can also corrupt structural 
metadata.  With ext3 it's quite easy to verify that fsync/fdatasync 
don't always write a journal entry.  (Apart from looking at the kernel 
code :-)


Just write some data, fsync(), and observe the number of writes in 
/proc/diskstats.  If the current mtime second _hasn't_ changed, the 
inode isn't written.  If you write data, say, 10 times a second to the 
same place followed by fsync(), you'll see a little more than 10 write 
I/Os, and less than 20."


There's a terrible hack suggested where you run fchmod to force the 
journal out in the next fsync that makes me want to track the poster 
down and shoot him, but this part raises a reasonable question.


The main issue he's complaining about here is a moot one for 
PostgreSQL.  If the WAL rewrites have been reordered but have not 
completed, the minute WAL replay hits the spot with a missing block the 
CRC32 will be busted and replay is finished.  The fact that he's 
assuming a database would have such a naive WAL implementation that it 
would corrupt the database if blocks are written out of order in between 
fsync call returning is one of the reasons this whole idea never got 
more traction--hard to get excited about a proposal whose fundamentals 
rest on an assumption that doesn't turns out to be true on real databases.


There's still the "fsync'd a data block but not the directory entry yet" 
issue as fall-out from this too.  Why doesn't PostgreSQL run into this 
problem?  Because the exact code sequence used is this one:


open
write
fsync
close

And Linux shouldn't ever screw that up, or the similar rename path.  
Here's what the close man page says, from 
http://linux.die.net/man/2/close :


"A successful close does not guarantee that the data has been 
successfully saved to disk, as the kernel defers writes. It is not 
common for a filesystem to flush the buffers when the stream is closed. 
If you need to be sure that the data is physically stored use fsync(2). 
(It will depend on the disk hardware at this point.)"


What this is alluding to is that if you fsync before closing, the close 
will write all the metadata out too.  You're busted if your write cache 
lies, but we already know all about that issue.


There was a discussion of issues around this on LKML a few years ago, 
with Alan Cox getting the good pull quote at 
http://lkml.org/lkml/2009/3/27/268 : "fsync/close() as a pair allows the 
user to correctly indicate their requirements."  While fsync doesn't 
guarantee that metadata is written out, and neither does close, kernel 
developers seem to all agree that fsync-before-close means you want 
everything on disk.  Filesystems that don't honor that will break all 
sorts of software.


It is of course possible there are bugs in some part of this code path, 
where a clever enough test case might expose a window of strange 
file/metadata ordering.  I think it's too weak of a theorized problem to 
go specifically chasing after though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench \for or similar loop

2011-04-21 Thread Greg Smith

Alvaro Herrera wrote:

Why do we have pgbench at all in the first place?  Surely we could
rewrite it in plpgsql with proper stored procedures.
  


pgbench gives you a driver program with the following useful properties:

1) Multiple processes are spawned and each gets its own connection
2) A time/transaction limit is enforced across all of the connections at 
once

3) Timing information is written to a client-side log file
4) The work of running the clients can happen on a remote system, so 
that it's possible to just test the server-side performance
5) The program is similar enough to any other regular client, using the 
standard libpq interface, that connection-related overhead should be 
similar to a real workload.


All of those have some challenges before you could duplicate them in a 
stored procedure context.


My opinion of this feature is similar to the one Aiden already 
expressed:  there's already so many ways to do this sort of thing using 
shell-oriented approaches (as well as generate_series) that it's hard to 
get too excited about implementing it directly in pgbench.  Part of the 
reason for adding the \shell and \setshell commands way to make tricky 
things like this possible without having to touch the pgbench code 
further.  I for example would solve the problem you're facing like this:


1) Write a shell script that generates the file I need
2) Call it from pgbench using \shell, passing the size it needs.  Have 
that write a delimited file with the data required.

3) Import the whole thing with COPY.

And next thing you know you've even got the CREATE/COPY optimization as 
a possibility to avoid WAL, as well as the ability to avoid creating the 
data file more than once if the script is smart enough.


Sample data file generation can be difficult; most of the time I'd 
rather solve in a general programming language.  The fact that simple 
generation cases could be done with the mechanism you propose is true.  
However, this only really helps cases that are too complicated to 
express with generate_series, yet not so complicated that you really 
want a full programming language to generate the data.  I don't think 
there's that much middle ground in that use case.


But if this is what you think makes your life easier, I'm not going to 
tell you you're wrong.  And I don't feel that your desire for this 
features means you must tackle a more complicated thing instead--even 
though what I personally would much prefer is something making this sort 
of thing easier to do in regression tests, too.  That's a harder 
problem, though, and you're only volunteering to solve an easier one 
than that.


Stepping aside from debate over usefulness, my main code concern is that 
each time I look at the pgbench code for yet another tacked on bit, it's 
getting increasingly creakier and harder to maintain.  It's never going 
to be a good benchmark driver program capable of really complicated 
tasks.  And making it try keeps piling on the risk of breaking it for 
its intended purpose of doing simple tests.  If you can figure out how 
to keep the code contortions to implement the feature under control, 
there's some benefit there.  I can't think of a unique reason for it; 
again, lots of ways to solve this already.  But I'd probably use it if 
it were there.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench \for or similar loop

2011-04-21 Thread Greg Smith

Kevin Grittner wrote:

I'm not clear on exactly what you're proposing there, but the thing
I've considered doing is having threads to try to keep a FIFO queue
populated with a configurable transaction mix, while a configurable
number of worker threads pull those transactions off the queue and...
  


This is like the beginning of an advertisement for how Tsung is useful 
for simulating complicated workloads.  The thought of growing pgbench to 
reach that level of capabilities makes my head hurt.


When faced with this same issue, the sysbench team decided to embed Lua 
as their scripting language; sample scripts:  
http://bazaar.launchpad.net/~sysbench-developers/sysbench/0.5/files/head:/sysbench/tests/db/


This isn't very well known because the whole MySQL community fracturing 
has impacted their ability to actually release this overhaul--seems like 
they spend all their time just trying to add support for each new engine 
of the month.  I don't even like Lua, yet this still seems like a much 
better idea than trying to build on top of the existing pgbench codebase.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-04-22 Thread Greg Smith

Andrew Dunstan wrote:
Personally, I want pgindent installed in /usr/local/ or similar. That 
way I can have multiple trees and it will work in all of them without 
my having to build it for each. What I don't want is for the installed 
patched BSD indent to conflict with the system's indent, which is why 
I renamed it. If you still think that's a barrier to easy use, then I 
think we need a way to provide hooks in the makefiles for specifying 
the install location, so we can both be satisfied.


I don't think there's a conflict here, because the sort of uses I'm 
worried about don't want to install the thing at all; just to run it.  I 
don't really care what "make install" does because I never intend to run 
it; dumping into /usr/local is completely reasonable for the people who do.


Since there's no script I know of other than your prototype, I don't 
think repackaging is likely to break anything. That makes it worth 
doing *now* rather than later.


But frankly, I'd rather do without an extra script if possible.


Fine, the renaming bit I'm not really opposed to.  The odds there's 
anyone using this thing that isn't reading this message exchange is 
pretty low.  There is the documentation backport issue if you make any 
serious change it though.  Maybe put the new version in another 
location, leave the old one where it is?


There's a fair number of steps to this though.  It's possible to 
automate them all such that running the program is trivial.  I don't 
know how we'd ever get that same ease of use without some sort of 
scripting for the whole process.  Could probably do it in a makefile 
instead, but I don't know if that's really any better.  The intersection 
between people who want to run this and people who don't have bash 
available is pretty slim I think.  I might re-write in Perl to make it 
more portable, but I think that will be at the expense of making it 
harder for people to tweak if it doesn't work out of the box.  More 
code, too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync reliability

2011-04-22 Thread Greg Smith

Simon Riggs wrote:

We do issue fsync and then close, but only when we switch log files.
We don't do that as part of the normal commit path.
  


Since all these files are zero-filled before use, the space is allocated 
for them, and the remaining important filesystem layout metadata gets 
flushed during the close.  The only metadata that changes after 
that--things like the last access time--isn't important to the WAL 
functioning.  So the metadata doesn't need to be updated after a normal 
commit, it's already there.  There are two main risks when crashing 
while fsync is in the middle of executing a push out to physical 
storage: torn pages due to partial data writes, and other out of order 
writes.  The only filesystems where this isn't true are the copy on 
write ones, where the blocks move around on disk too.  But those all 
have their own more careful guarantees about metadata too.



The issue you raise above where "fsync is not safe for Write Ahead
Logging" doesn't sound good. I don't think what you've said has fully
addressed that yet. We could replace the commit path with O_DIRECT and
physically order the data blocks, but I would guess the code path to
durable storage has way too many bits of code tweaking it for me to
feel happy that was worth it.
  


As far as I can tell the CRC is sufficient protection against that.  
This is all data that hasn't really been committed being torn up here.  
Once you trust that the metadata problem isn't real, reordered writes 
are the only going to destroy things that are in the middle of being 
flushed to disk.  A synchronous commit mangled this way will be rolled 
back regardless because it never really finished (fsync didn't return); 
an asynchronous one was never guaranteed to be on disk.


On many older Linux systems O_DIRECT is a less reliable code path than 
than write/fsync is, so you're right that isn't necessarily a useful 
step forward.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync reliability

2011-04-22 Thread Greg Smith

On 04/22/2011 09:32 AM, Simon Riggs wrote:

OK, that's good, but ISTM we still have a hole during
RemoveOldXlogFiles() where we don't fsync or open/close the file, just
rename it.
   


This is also something that many applications rely upon working as hoped 
for here, even though it's not technically part of POSIX.  Early 
versions of ext4 broke that, and it caused a giant outcry of 
complaints.  
http://www.h-online.com/open/news/item/Ext4-data-loss-explanations-and-workarounds-740671.html 
has a good summary.  This was broken on ext4 from around 2.6.28 to 
2.6.30, but the fix for it was so demanded that it's even been ported by 
the relatively lazy distributions to their 2.6.28/2.6.29 kernels.


There may be a small window for metadata issues here if you've put the 
WAL on ext2 and there's a crash in the middle of rename.  That factors 
into why any suggestions I make about using ext2 come with a load of 
warnings about the risk of not journaling.  It's hard to predict every 
type of issue that fsck might force you to come to terms with after a 
crash on ext2, and if there was a problem with this path I'd expect it 
to show up as something to be reconciled then.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync reliability

2011-04-25 Thread Greg Smith

On 04/24/2011 10:06 PM, Daniel Farina wrote:

On Thu, Apr 21, 2011 at 8:51 PM, Greg Smith  wrote:
   

There's still the "fsync'd a data block but not the directory entry yet"
issue as fall-out from this too.  Why doesn't PostgreSQL run into this
problem?  Because the exact code sequence used is this one:

open
write
fsync
close

And Linux shouldn't ever screw that up, or the similar rename path.  Here's
what the close man page says, from http://linux.die.net/man/2/close :
 

Theodore Ts'o addresses this *exact* sequence of events, and suggests
if you want that rename to definitely stick that you must fsync the
directory:

http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync
   


Not exactly.  That's talking about the sequence used for creating a 
file, plus a rename.  When new WAL files are being created, I believe 
the ugly part of this is avoided.  The path when WAL files are recycled 
using rename does seem to be the one with the most likely edge case.


The difficult case Tso's discussion is trying to satisfy involves 
creating a new file and then swapping it for an old one atomically.  
PostgreSQL never does that exactly.  It creates new files, pads them 
with zeros, and then starts writing to them; it also renames old files 
that are already of the correctly length.  Combined with the fact that 
there are always fsyncs after writes to the files, and this case really 
isn't exactly the same as any of the others people are complaining about.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fsync reliability

2011-04-25 Thread Greg Smith

On 04/23/2011 09:58 AM, Matthew Woodcraft wrote:

As far as I can make out, the current situation is that this fix (the
auto_da_alloc mount option) doesn't work as advertised, and the ext4
maintainers are not treating this as a bug.

See https://bugzilla.kernel.org/show_bug.cgi?id=15910
   


I agree with the resolution that this isn't a bug.  As pointed out 
there, XFS does the same thing, and this behavior isn't going away any 
time soon.  Leaving behind zero-length files in situations where 
developers tried to optimize away a necessary fsync happens.


Here's the part where the submitter goes wrong:

"We first added a fsync() call for each extracted file. But scattered 
fsyncs resulted in a massive performance degradation during package 
installation (factor 10 or more, some reported that it took over an hour 
to unpack a linux-headers-* package!) In order to reduce the I/O 
performance degradation, fsync calls were deferred..."


Stop right there; the slow path was the only one that had any hope of 
being correct.  It can actually slow things by a factor of 100X or more, 
worst-case.  "So, we currently have the choice between filesystem 
corruption or major performance loss":  yes, you do.  Writing files is 
tricky and it can either be slow or safe.  If you're going to avoid even 
trying to enforce the right thing here, you're really going to get 
really burned.


It's unfortunate that so many people are used to the speed you get in 
the common situation for a while now with ext3 and cheap hard drives:  
all writes are cached unsafely, but the filesystem resists a few bad 
behaviors.  Much of the struggle where people say "this is so much 
slower, I won't put up with it" and try to code around it is futile, and 
it's hard to separate out the attempts to find such optimizations from 
the legitimate complaints.


Anyway, you're right to point out that the filesystem is not necessarily 
going to save anyone from some of the tricky rename situations even with 
the improvements made to delayed allocation.  They've fixed some of the 
worst behavior of the earlier implementation, but there are still 
potential issues in that area it seems.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] branching for 9.2devel

2011-04-25 Thread Greg Smith

On 04/25/2011 02:26 PM, Josh Berkus wrote:

Overall, I think the advantages to a faster/shorter CF cycle outweigh
the disadvantages enough to make it at least worth trying.  I'm willing
to run the first 1-week CF, as well as several of the others during the
9.2 cycle to try and make it work.
   


It will be interesting to see if it's even possible to get all the 
patches assigned a reviewer in a week.  The only idea I've come up with 
for making this idea more feasible is to really buckle down on the idea 
that all submitters should also be reviewing.  I would consider a fair 
policy to say that anyone who doesn't volunteer to review someone else's 
patch gets nudged toward the bottom of the reviewer priority list.  
Didn't offer to review someone else's patch?  Don't be surprised if you 
get bumped out of a week long 'fest.


This helps with two problems.  It helps fill in short-term reviewers, 
and in the long-term it makes submitters more competent.  The way things 
are setup now, anyone who submits a patch without having done a review 
first is very likely to get their patch bounced back; the odds of 
getting everything right without that background are low.  Ideally 
submitters might even start fixing their own patches without reviewer 
prompting, once they get into someone else's and realize what they 
didn't do.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving the memory allocator

2011-04-26 Thread Greg Smith

On 04/25/2011 05:45 PM, Andres Freund wrote:

The profile I used in this case was:

pgbench -h /tmp/ -p5433 -s 30 pgbench -S -T 20
   


I'd suggest collecting data from running this with "-M prepared" at some 
point too, so that you can get a basic idea which of these allocations 
are avoided when using prepared statements.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improvements to pgtune

2011-04-27 Thread Greg Smith

Shiv wrote:
 On the program I hope to learn as much about professional software 
engineering principles as PostgreSQL. My project is aimed towards 
extending and hopefully improving upon pgtune. If any of you have some 
ideas or thoughts to share. I am all ears!!


Well, first step on the software engineering side is to get a copy of 
the code in a form you can modify.  I'd recommend grabbing it from 
https://github.com/gregs1104/pgtune ; while there is a copy of the 
program on git.postgresql.org, it's easier to work with the one on 
github instead.  I can push updates over to the copy on postgresql.org 
easily enough, and that way you don't have to worry about getting an 
account on that server.


There's a long list of suggested improvements to make at 
https://github.com/gregs1104/pgtune/blob/master/TODO


Where I would recommend getting started is doing some of the small items 
on there, some of which I have already put comments into the code about 
but just not finished yet.  Some examples:


-Validate against min/max
-Show original value in output
-Limit shared memory use on Windows (see notes on shared_buffers at 
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server for more 
information)

-Look for postgresql.conf file using PGDATA environment variable
-Look for settings files based on path of the pgtune executable
-Save a settings reference files for newer versions of PostgreSQL (right 
now I only target 8.4) and allow passing in the version you're configuring.


A common mistake made by GSOC students is to dive right in to trying to 
make big changes.  You'll be more successful if you get practice at 
things like preparing and sharing patches on smaller changes first.


At the next level, there are a few larger features that I would consider 
valuable that are not really addressed by the program yet:


-Estimate how much shared memory is used by the combination of 
settings.  See Table 17-2 at 
http://www.postgresql.org/docs/9.0/static/kernel-resources.html ; those 
numbers aren't perfect, and improving that table is its own useful 
project.  But it gives an idea how they fit together.  I have some notes 
at the end of the TODO file on how I think the information needed to 
produce this needs to be passed around the inside of pgtune.


-Use that estimate to produce a sysctl.conf file for one platform; Linux 
is the easiest one to start with.  I've attached a prototype showing how 
to do that, written in bash.


-Write a Python-TK or web-based front-end for the program.

Now that I know someone is going to work on this program again, I'll see 
what I can do to clean some parts of it up.  There are a couple of 
things it's easier for me to just fix rather than to describe, like the 
way I really want to change how it adds comments to the settings it changes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


#!/bin/bash

# Output lines suitable for sysctl configuration based
# on total amount of RAM on the system.  The output
# will allow up to 50% of physical memory to be allocated
# into shared memory.

# On Linux, you can use it as follows (as root):
# 
# ./shmsetup >> /etc/sysctl.conf
# sysctl -p

# Early FreeBSD versions do not support the sysconf interface
# used here.  The exact version where this works hasn't
# been confirmed yet.

page_size=`getconf PAGE_SIZE`
phys_pages=`getconf _PHYS_PAGES`

if [ -z "$page_size" ]; then
  echo Error:  cannot determine page size
  exit 1
fi

if [ -z "$phys_pages" ]; then
  echo Error:  cannot determine number of memory pages
  exit 2
fi

shmall=`expr $phys_pages / 2`
shmmax=`expr $shmall \* $page_size` 

echo \# Maximum shared segment size in bytes
echo kernel.shmmax = $shmmax
echo \# Maximum number of shared memory segments in pages
echo kernel.shmall = $shmall

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improvements to pgtune

2011-04-27 Thread Greg Smith

Daniel Farina wrote:

It seems like in general it lacks a feedback mechanism to figure things out 
settings
from workloads, instead relying on Greg Smith's sizable experience to
do some arithmetic and get you off the ground in a number of common cases.
  


To credit appropriately, the model used right now actually originated 
with a Josh Berkus spreadsheet, from before I was doing this sort of 
work full-time.  That's held up pretty well, but it doesn't fully 
reflect how I do things nowadays.  The recent realization that pgtune is 
actually shipping as a package for Debian/Ubuntu now has made realize 
this is a much higher profile project now, one that I should revisit 
doing a better job on.


Every time I've gotten pulled into discussions of setting parameters 
based on live monitoring, it's turned into a giant black hole--absorbs a 
lot of energy, nothing useful escapes from it.  I credit completely 
ignoring that idea altogether, and using the simplest possible static 
settings instead, as one reason I managed to ship code here that people 
find useful.  I'm not closed to the idea, just not optimistic it will 
lead anywhere useful.  That makes it hard to work on when there are so 
many obvious things guaranteed to improve the program that could be done 
instead.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Predicate locking

2011-05-03 Thread Greg Smith

Dan Ports wrote:

Yes, you're right -- the current implementation of SSI only locks
indexes at the granularity of index pages. So although those
transactions don't actually access the same records, they're detected
as a conflict because they're on the same index page.


Let's try to demonstrate that with an update to Vlad's example.  Run 
this on a first client to generate the same type of table, but with an 
easy to vary number of rows in it:


drop table t;
create table t (id bigint, value bigint);
insert into t(id,value) (select s,1 from generate_series(1,348) as s);
create index t_idx on t(id);
begin transaction;
set transaction isolation level serializable;
select * from t where id = 2;
insert into t (id, value) values (-2, 1);

Execute this on the second client:

begin transaction;
set transaction isolation level serializable;
select * from t where id = 3;
insert into t (id, value) values (-3, 0);
commit;

And then return to the first one to run COMMIT.  I'm getting a 
serialization failure in that case.  However, if I increase the 
generate_series to create 349 rows (or more) instead, it works.  I don't 
see where it crosses a page boundary in table or index size going from 
348 to 349, so I'm not sure why I'm seeing a change happening there.  In 
both cases, there's only one non-header index block involved.


I don't think Vlad is being unreasonable here; he's provided a test case 
demonstrating the behavior he'd like to see, and shown it doesn't work 
as expected.  If we can prove that test does work on non-trivial sized 
tables, and that it only suffers from to-be-optimized broader locks than 
necessary, that would make a more compelling argument against the need 
for proper predicate locks.  I don't fully understand why this attempt I 
tried to do that is working the way it does though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Predicate locking

2011-05-03 Thread Greg Smith

Kevin Grittner wrote:

I don't think Vlad is being unreasonable here; he's provided a
test case demonstrating the behavior he'd like to see, and shown
it doesn't work as expected.

 
... on a toy table with contrived values.  How different is this

from the often-asked question about why a query against a four-line
table is not using the index they expect, and how can we expect it
to scale if it doesn't?


It's not, but in that case I've been known to show someone that the 
behavior they're seeing doesn't happen on a larger table.  My point was 
just that no one has really done that here yet:  provided an example 
showing SSI serialization working as a substitute for predicate locking 
in this sort of use case.  I trust that the theory is sound here, and 
yet I'd still like to see that demonstrated.  This is why I launched 
into making a less trivial test case.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Predicate locking

2011-05-03 Thread Greg Smith

Kevin Grittner wrote:

Check where the plan goes from a table scan to an indexed access.
Also look at what is showing for SIRead locks in pg_locks as you go.
Between those two bits of information, it should become apparent.


OK, so this doesn't look to be an index lock related thing at all here.  
Updated test case does this to create the table and show some additional 
state:


drop table t;
create table t (id bigint, value bigint);
insert into t(id,value) (select s,1 from generate_series(1,348) as s);
create index t_idx on t(id);
begin transaction;
set transaction isolation level serializable;
explain analyze select * from t where id = 2;
select pid,locktype,relation::regclass,page,tuple from pg_locks where 
mode='SIReadLock';

insert into t (id, value) values (-2, 1);
select pid,locktype,relation::regclass,page,tuple from pg_locks where 
mode='SIReadLock';


Do the same thing as before on the second process:

begin transaction;
set transaction isolation level serializable;
select * from t where id = 3;
insert into t (id, value) values (-3, 0);
commit;

Then return to the first client to commit.  When I execute that with 348 
records, the case that fails, it looks like this:


gsmith=# explain analyze select * from t where id = 2;
QUERY 
PLAN

Seq Scan on t  (cost=0.00..6.35 rows=2 width=16) (actual 
time=0.106..0.286 rows=1 loops=1)

  Filter: (id = 2)
Total runtime: 0.345 ms
(3 rows)

gsmith=# select pid,locktype,relation::regclass,page,tuple from pg_locks 
where mode='SIReadLock';

pid  | locktype | relation | page | tuple
--+--+--+--+---
1495 | relation | t|  | 

So it's actually grabbing a lock on the entire table in that situation.  
The other client does the same thing, and they collide with the 
described serialization failure.


The minute I try that with table that is 349 rows instead, it switches 
plans:


gsmith=# explain analyze select * from t where id = 2;
 QUERY 
PLAN 
--
Bitmap Heap Scan on t  (cost=4.27..6.29 rows=2 width=16) (actual 
time=0.169..0.171 rows=1 loops=1)

  Recheck Cond: (id = 2)
  ->  Bitmap Index Scan on t_idx  (cost=0.00..4.27 rows=2 width=0) 
(actual time=0.144..0.144 rows=1 loops=1)

Index Cond: (id = 2)
Total runtime: 0.270 ms
(5 rows)

gsmith=# select pid,locktype,relation::regclass,page,tuple from pg_locks 
where mode='SIReadLock';

pid  | locktype | relation | page | tuple
--+--+--+--+---
1874 | page | t_idx|1 | 
1874 | tuple| t|0 | 2

(2 rows)

Grabbing a lock on the index page and the row, as Dan explained it 
would.  This actually eliminates this particular serialization failure 
altogether here though, even with these still on the same table and 
index page.


So the root problem with Vlad's test isn't the index lock at all; it's 
heavy locking from the sequential scan that's executing on the trivial 
cases.  If he expands his tests to use a larger amount of data, such 
that the plan switches to a realistic one, his results with the new 
serialization mode may very well be more satisfying.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-05-04 Thread Greg Smith

Alvaro Herrera wrote:

As for gain, I have heard of test setups requiring hours of runtime in
order to prime the buffer cache.
  


And production ones too.  I have multiple customers where a server 
restart is almost a planned multi-hour downtime.  The system may be back 
up, but for a couple of hours performance is so terrible it's barely 
usable.  You can watch the MB/s ramp up as the more random data fills in 
over time; getting that taken care of in a larger block more amenable to 
elevator sorting would be a huge help.


I never bothered with this particular idea though because shared_buffers 
is only a portion of the important data.  Cedric's pgfincore code digs 
into the OS cache, too, which can then save enough to be really useful 
here.  And that's already got a snapshot/restore feature.  The slides at 
http://www.pgcon.org/2010/schedule/events/261.en.html have a useful into 
to that, pages 30 through 34 are the neat ones.  That provides some 
other neat APIs for preloading popular data into cache too.  I'd rather 
work on getting something like that into core, rather than adding 
something that only is targeting just shared_buffers.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compiling a PostgreSQL 7.3.2 project with Eclipse

2011-05-06 Thread Greg Smith

Krešimir Križanović wrote:



However, where I compile and run the project, in the Eclipse console I 
get:


POSTGRES backend interactive interface

$Revision: 1.115 $ $Date: 2006/02/06 01:19:46 $

I don’t know what to do with this backend interface. I would like to 
get a postmaster running and to connect to a data base with psql. 
However, when i try to start psql, it says that there is no postmaster 
running.




An example of a session with that interface is at 
http://archives.postgresql.org/pgsql-hackers/2000-01/msg01471.php ; you 
may find it useful at some point.


Your problem is caused by a change made to PostgreSQL's naming 
convention made after the 7.3 fork you're using. In earlier versions, 
"postgres" meant start the server in single user mode: 
http://www.postgresql.org/docs/7.3/static/app-postgres.html


While "postmaster" started it as a proper server: 
http://www.postgresql.org/docs/7.3/static/app-postmaster.html


In modern versions, they are the same thing. The Eclipse example uses 
"postgres", which starts the regular server now, but in 7.3 only started 
single user mode. Change where you run the program to use "postmaster" 
instead and it should work more like what you're expecting.


Doing something useful with the TelegraphCQ code is probably going to 
take you a while.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why not install pgstattuple by default?

2011-05-06 Thread Greg Smith

Christopher Browne wrote:

I'm getting "paper cuts" quite a bit these days over the differences
between what different packaging systems decide to install.  The one
*I* get notably bit on, of late, is that I have written code that
expects to have pg_config to do some degree of self-discovery, only to
find production folk complaining that they only have "psql" available
in their environment.
  


Given the other improvements in being able to build extensions in 9.1, 
we really should push packagers to move pg_config from the PostgreSQL 
development package into the main one starting in that version.  I've 
gotten bit by this plenty of times.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-05-06 Thread Greg Smith

On 05/05/2011 05:06 AM, Mitsuru IWASAKI wrote:

In summary, PgFincore's target is File System Buffer Cache, Buffer
Cache Hibernation's target is DB Buffer Cache(shared buffers).
   


Right.  The thing to realize is that shared_buffers is becoming a 
smaller fraction of the total RAM used by the database every year.  On 
Windows it's been stuck at useful settings being less than 512MB for a 
while now.  And on UNIX systems, around 8GB seems to be effective upper 
limit.  Best case, shared_buffers is only going to be around 25% of 
total RAM; worst-case, approximately, you might have Windows server with 
64GB of RAM where shared_buffers is less than 1% of total RAM.


There's nothing wrong with the general idea you're suggesting.  It's 
just only targeting a small (and shrinking) subset of the real problem 
here.  Rebuilding cache state starts with shared_buffers, but that's not 
enough of the problem to be an effective tweak on many systems.


I think that all the complexity with CRCs etc. is unlikely to lead 
anywhere too, and those two issues are not completely unrelated.  The 
simplest, safest thing here is the right way to approach this, not the 
most complicated one, and a simpler format might add some flexibility 
here to reload more cache state too.  The bottleneck on reloading the 
cache state is reading everything from disk.  Trying to micro-optimize 
any other part of that is moving in the wrong direction to me.  I doubt 
you'll ever measure a useful benefit that overcomes the expense of 
maintaining the code.  And you seem to be moving to where someone can't 
restore cache state when they change shared_buffers.  A simpler 
implementation might still work in that situation; reload until you run 
out of buffers if shared_buffers shrinks, reload until you're done with 
the original size.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why not install pgstattuple by default?

2011-05-06 Thread Greg Smith

On 05/06/2011 05:58 PM, Josh Berkus wrote:

Yeah, I wasn't thinking of including all of contrib. There's a lot of
reasons not to do that.  I was asking about pgstattuple in particular,
since it's:
(a) small
(b) has no external dependancies
(c) adds no stability risk or performance overhead
(d) is usually needed on production systems when it's needed at all

It's possible that we have one or two other diagnostic utilities which
meet the above profile. pageinspect, maybe?
   


I use pgstattuple, pageinspect, pg_freespacemap, and pg_buffercache 
regularly enough that I wish they were more common.  Throw in pgrowlocks 
and you've got the whole group Robert put into the debug set.  It makes 
me sad every time I finish a utility using one of these and realize I'll 
have to include the whole "make sure you have the contrib modules 
installed" disclaimer in its documentation again.


These are the only ones I'd care about moving into a more likely place.  
The rest of the contrib modules are the sort where if you need them, you 
realize that early and get them installed.  These are different by 
virtue of their need popping up most often during emergencies.  The fact 
that I believe they all match the low impact criteria too makes it even 
easier to consider.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why not install pgstattuple by default?

2011-05-07 Thread Greg Smith

On 05/07/2011 12:42 PM, Peter Eisentraut wrote:

On fre, 2011-05-06 at 14:32 -0400, Greg Smith wrote:
   

Given the other improvements in being able to build extensions in 9.1,
we really should push packagers to move pg_config from the PostgreSQL
development package into the main one starting in that version.  I've
gotten bit by this plenty of times.
 

Do you need pg_config to install extensions?
   


No, but you still need it to build them.  PGXN is a source code 
distribution method, not a binary one.  It presumes users can build 
modules they download using PGXS.  No pg_config, no working PGXS, no 
working PGXN.  For such a small binary to ripple out to that impact is bad.


The repmgr program we distribute has the same problem, so I've been 
getting first-hand reports of just how many people are likely to run 
into this recently.  You have to install postgresql-devel with RPM and 
on Debian, the very non-obvious postgresql-server-dev-$version


Anyway, didn't want to hijack this thread beyond pointing out that if 
there any package reshuffling that happens for contrib changes, it 
should check for and resolve this problem too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why not install pgstattuple by default?

2011-05-07 Thread Greg Smith

On 05/06/2011 04:06 PM, Tom Lane wrote:

FWIW, I did move pg_config from -devel to the "main" (really client)
postgresql package in Fedora, as of 9.0.  That will ensure it's present
in either client or server installations.  Eventually that packaging
will reach RHEL ...
   


We should make sure that the PGDG packages adopt that for 9.1 then, so 
it starts catching on more.  Unless Devrim changed to catch up since I 
last installed an RPM set, in that 9.0 it's still in the same place:


$ rpm -qf /usr/pgsql-9.0/bin/pg_config
postgresql90-devel-9.0.2-2PGDG.rhel5

While Peter's question about whether it's really all that useful is 
reasonable, I'd at least like to get a better error message when you 
don't have everything needed to compile extensions.  I think the 
shortest path to that is making pg_config more likely to be installed, 
then to check whether the file "pg_config --pgxs" references exists.  
I'll see if I can turn that idea into an actual change to propose.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why not install pgstattuple by default?

2011-05-07 Thread Greg Smith
Attached patch is a first cut at what moving one contrib module (in this 
case pg_buffercache) to a new directory structure might look like.  The 
idea is that src/extension could become a place for "first-class" 
extensions to live.  Those are ones community is committed to providing 
in core, but are just better implemented as extensions than in-database 
functions, for reasons that include security.  This idea has been shared 
by a lot of people for a while, only problem is that it wasn't really 
practical to implement cleanly until the extensions code hit.  I think 
it is now, this attempts to prove it.


Since patches involving file renaming are clunky, the changes might be 
easier to see at 
https://github.com/greg2ndQuadrant/postgres/commit/507923e21e963c873a84f1b850d64e895776574f 
where I just pushed this change too.  The install step for the modules 
looks like this now:


gsmith@grace:~/pgwork/src/move-contrib/src/extension/pg_buffercache$ 
make install

/bin/mkdir -p '/home/gsmith/pgwork/inst/move-contrib/lib/postgresql'
/bin/mkdir -p 
'/home/gsmith/pgwork/inst/move-contrib/share/postgresql/extension'
/bin/sh ../../../config/install-sh -c -m 755  pg_buffercache.so 
'/home/gsmith/pgwork/inst/move-contrib/lib/postgresql/pg_buffercache.so'
/bin/sh ../../../config/install-sh -c -m 644 ./pg_buffercache.control 
'/home/gsmith/pgwork/inst/move-contrib/share/postgresql/extension/'
/bin/sh ../../../config/install-sh -c -m 644 ./pg_buffercache--1.0.sql 
./pg_buffercache--unpackaged--1.0.sql  
'/home/gsmith/pgwork/inst/move-contrib/share/postgresql/extension/'

$ psql -c "create extension pg_buffercache"
CREATE EXTENSION

The only clunky bit I wasn't really happy with is the amount of code 
duplication that comes from having a src/extension/Makefile that looks 
almost, but not quite, identical to contrib/Makefile.  The rest of the 
changes don't seem too bad to me, and even that's really only 36 lines 
that aren't touched often.  Yes, the paths are different, so backports 
won't happen without an extra step.  But the code changes required were 
easier than I was expecting, due to the general good modularity of the 
extensions infrastructure.  So long as the result ends up in 
share/postgresql/extension/ , whether they started in contrib/ 
or src/extension/ doesn't really matter to CREATE EXTENSION.  
But having them broke out this way makes it easy for the default 
Makefile to build and install them all.  (I recognize I didn't do that 
last step yet though)


I'll happily go covert pgstattuple and the rest of the internal 
diagnostics modules to this scheme, and do the doc cleanups, this 
upcoming week if it means I'll be able to use those things without 
installing all of contrib one day.  Ditto for proposing RPM and Debian 
packaging changes that match them.  All that work will get paid back the 
first time I don't have to fill out a bunch of paperwork (again) at a 
customer site justifying why they need to install the contrib [RPM|deb] 
package (which has some scary stuff in it) on all their servers, just so 
I can get some bloat or buffer inspection module.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


diff --git a/contrib/Makefile b/contrib/Makefile
index 6967767..04cf330 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -30,7 +30,6 @@ SUBDIRS = \
 		pageinspect	\
 		passwordcheck	\
 		pg_archivecleanup \
-		pg_buffercache	\
 		pg_freespacemap \
 		pg_standby	\
 		pg_stat_statements \
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
deleted file mode 100644
index 323c0ac..000
--- a/contrib/pg_buffercache/Makefile
+++ /dev/null
@@ -1,18 +0,0 @@
-# contrib/pg_buffercache/Makefile
-
-MODULE_big = pg_buffercache
-OBJS = pg_buffercache_pages.o
-
-EXTENSION = pg_buffercache
-DATA = pg_buffercache--1.0.sql pg_buffercache--unpackaged--1.0.sql
-
-ifdef USE_PGXS
-PG_CONFIG = pg_config
-PGXS := $(shell $(PG_CONFIG) --pgxs)
-include $(PGXS)
-else
-subdir = contrib/pg_buffercache
-top_builddir = ../..
-include $(top_builddir)/src/Makefile.global
-include $(top_srcdir)/contrib/contrib-global.mk
-endif
diff --git a/contrib/pg_buffercache/pg_buffercache--1.0.sql b/contrib/pg_buffercache/pg_buffercache--1.0.sql
deleted file mode 100644
index 9407d21..000
--- a/contrib/pg_buffercache/pg_buffercache--1.0.sql
+++ /dev/null
@@ -1,17 +0,0 @@
-/* contrib/pg_buffercache/pg_buffercache--1.0.sql */
-
--- Register the function.
-CREATE FUNCTION pg_buffercache_pages()
-RETURNS SETOF RECORD
-AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
-LANGUAGE C;
-
--- Create a view for convenient access.
-CREATE VIEW pg_buffercache AS
-	SELECT P.* FROM pg_buffercache_pages() AS P
-	(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
-	 r

Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-05-07 Thread Greg Smith

Mitsuru IWASAKI wrote:

the patch is available at:
http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110508.patch 
  


We can't accept patches just based on a pointer to a web site.  Please 
e-mail this to the mailing list so that it can be considered a 
submission under the project's licensing terms.



I hope this would be committable and the final version.
  


PostgreSQL has high standards for code submissions.  Extremely few 
submissions are committed without significant revisions to them based on 
code review.  So far you've gotten a first round of high-level design 
review, there's several additional steps before something is considered 
for a commit.  The whole process is outlined at 
http://wiki.postgresql.org/wiki/Submitting_a_Patch


From a couple of minutes of reading the patch, the first things that 
pop out as problems are:


-All of the ControlFile -> controlFile renaming has add a larger 
difference to ReadControlFile than I would consider ideal.

-Touching StrategyControl is not something this patch should be doing.
-I don't think your justification ("debugging or portability") for 
keeping around your original code in here is going to be sufficient to 
do so.
-This should not be named enable_buffer_cache_hibernation.  That very 
large diff you ended up with in the regression tests is because all of 
the settings named enable_* are optimizer control settings.  Using the 
name "buffer_cache_hibernation" instead would make a better starting point.


From a bigger picture perspective, this really hasn't addressed any of 
my comments about shared_buffers only being the beginning of the useful 
cache state to worry about here.  I'd at least like the solution to the 
buffer cache save/restore to have a plan for how it might address that 
too one day.  This project is also picky about only committing code that 
fits into the long-term picture for desired features.


Having a working example of a server-side feature doing cache storage 
and restoration is helpful though.  Don't think your work here is 
unappreciated--it is.  Getting this feature added is just a harder 
problem than what you've done so far.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improvements to pgtune

2011-05-08 Thread Greg Smith
ased, while 
working on simpler programs that don't aim so high leads to software 
that ships to the world and finds users.  The only reason pgtune is now 
available in packaged form on multiple operating systems is that I 
ignored all advice about aiming for a complicated tool and instead wrote 
a really simple one.  That was hard enough to finish.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-05-09 Thread Greg Smith

On 05/09/2011 12:06 PM, Andrew Dunstan wrote:
The fact that we can do in place upgrades of the data only addresses 
one pain point in upgrading. Large legacy apps require large retesting 
efforts when upgrading, often followed by lots more work renovating 
the code for backwards incompatibilities. This can be a huge cost for 
what the suits see as little apparent gain, and making them do it more 
frequently in order to stay current will not win us any friends.


I just had a "why a new install on 8.3?" conversation today, and it was 
all about the application developer not wanting to do QA all over again 
for a later release.


Right now, one of the major drivers for "why upgrade?" has been the 
performance improvements in 8.3, relative to any older version.  The 
main things pushing happy 8.3 sites to 8.4 or 9.0 that I see are either 
VACUUM issues (improved with partial vacuum in 8.4) or wanting real-time 
replication (9.0).  I predict many sites that don't want either are 
likely to sit on 8.3 for a really long time.  The community won't be 
able to offer a compelling reason why smaller sites in particular should 
go through the QA an upgrade requires.  The fact that the app QA time is 
now the main driver--not the dump and reload time--is good, because it 
makes it does make it easier for the people with the biggest data sets 
to move.  They're the ones that need the newer versions the most anyway, 
and in that regard having in-place upgrade start showing up as of 8.3 
was really just in time.


I think 8.3 is going to be one of those releases like 7.4, where people 
just keep running it forever.  At least shortening the upgrade path has 
made that concern a little bit better.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why not install pgstattuple by default?

2011-05-09 Thread Greg Smith

On 05/09/2011 03:31 PM, Alvaro Herrera wrote:

For executables we already have src/bin. Do we really need a separate
place for, say, pg_standby or pg_upgrade?
   


There's really no executables in contrib that I find myself regularly 
desperate for/angry at because they're not installed as an integral part 
of the server, the way I regularly find myself cursing some things that 
are now extensions.  The only contrib executable I use often is pgbench, 
and that's normally early in server life--when it's still possible to 
get things installed easily most places.  And it's something that can be 
removed when finished, in cases where people are nervous about the 
contrib package.


Situations where pg_standby or pg_upgrade suddenly pop up as emergency 
needs seem unlikely too, which is also the case with oid2name, 
pg_test_fsync, pg_archivecleanup, and vacuumlo.  I've had that happen 
with pg_archivecleanup exactly once since it appeared--running out of 
space and that was the easiest way to make the problem go away 
immediately and permanently--but since it was on an 8.4 server we had to 
download the source and build anyway.


Also, my experience is that people are not that paranoid about running 
external binaries, even though they could potentially do harm to the 
database.  Can always do a backup beforehand.  But the idea of loading a 
piece of code that lives in the server all the time freaks them out.  
The way the word contrib implies (and sometimes is meant to mean) low 
quality, while stuff that ships with the main server package does not, 
has been beaten up here for years already.  It's only a few cases where 
that's not fully justified, and the program can easily become an 
extension, that I feel are really worth changing here.  There are 49 
directories in contrib/ ; at best maybe 20% of them will ever fall into 
that category.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why not install pgstattuple by default?

2011-05-09 Thread Greg Smith

On 05/09/2011 02:31 PM, Robert Haas wrote:

I don't think we should be too obstinate about trying to twist the arm
of packagers who (as Tom points out) will do whatever they want in
spite of us, but the current state of contrib, with all sorts of
things of varying type, complexity, and value mixed together cannot
possibly be a good thing.


I think the idea I'm running with for now means that packagers won't 
actually have to do anything.  I'd expect typical packaging for 9.1 to 
include share/postgresql/extension from the build results without being 
more specific.  You need to grab 3 files from there to get the plpgsql 
extension, and I can't imagine any packager listing them all by name.  
So if I dump the converted contrib extensions to put files under there, 
and remove them from the contrib build area destination, I suspect they 
will magically jump from the contrib to the extensions area of the 
server package at next package build; no packager level changes 
required.  The more I look at this, the less obtrusive of a change it 
seems to be.  Only people who will really notice are users who discover 
more in the basic server package, and of course committers with 
backporting to do.


Since packaged builds of 9.1 current with beta1 seem to be in short 
supply still, this theory looks hard to prove just yet.  I'm very 
excited that it's packaging week here however (rolls eyes), so I'll 
check it myself.  I'll incorporate the suggestions made since I posted 
that test patch and do a bigger round of this next, end to end with an 
RPM set as the output.  It sounds like everyone who has a strong opinion 
on what this change might look like has sketched a similar looking 
bikeshed.  Once a reasonable implementation is hammered out, I'd rather 
jump to the big argument between not liking change vs. the advocacy 
benefits to PostgreSQL of doing this; they are considerable in my mind.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-05-09 Thread Greg Smith

Josh Berkus wrote:

As I don't think we can change this, I think the best answer is to tell people
"Don't submit a big patch to PostgreSQL until you've done a few small
patches first.  You'll regret it".
  


When I last did a talk about getting started writing patches, I had a 
few people ask me afterwards if I'd ever run into problems with having 
patch submissions rejected.  I said I hadn't.  When asked what my secret 
was, I told them my first serious submission modified exactly one line 
of code[1].  And *that* I had to defend in regards to its performance 
impact.[2]


Anyway, I think the intro message should be "Don't submit a big patch to 
PostgreSQL until you've done a small patch and some patch review" 
instead though.  It's both a good way to learn what not to do, and it 
helps with one of the patch acceptance bottlenecks.



The problem is not the process itself, but that there is little
documentation of that process, and that much of that documentation does
not match the defacto process.  Obviously, the onus is on me as much as
anyone else to fix this.
  


I know the documentation around all this has improved a lot since then.  
Unfortunately there's plenty of submissions done by people who never 
read it.  Sometimes it's because people didn't know about it; in others 
I suspect it was seen but some hard parts ignored because it seemed like 
too much work.


One of these days I'm going to write the "Formatting Curmudgeon Guide to 
Patch Submission", to give people an idea just how much diff reading and 
revision a patch should go through in order to keep common issues like 
spurious whitespace diffs out of it.  Submitters can either spent a 
block of time sweating those details out themselves, or force the job 
onto a reviewer/committer; they're always there.  And every minute it's 
sitting in someone else's hands who is doing that job instead of reading 
the real code, the odds of the patch being kicked back go up.


[1] http://archives.postgresql.org/pgsql-patches/2007-03/msg00553.php
[2] http://archives.postgresql.org/pgsql-patches/2007-02/msg00222.php

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Formatting Curmudgeons WAS: MMAP Buffers

2011-05-10 Thread Greg Smith

Heikki Linnakangas wrote:
Well, my first patch was two-phase commit. And I had never even used 
PostgreSQL before I dived into the source tree and started to work on that


Well, everyone knows you're awesome.  A small percentage of the people 
who write patches end up having the combination of background skills, 
mindset, and approach to pull something like that off.  But there are at 
least a dozens submissions that start review with "I don't think there 
will ever work, but I can't even read your malformed patch to be sure" 
for every one that went like 2PC.  If every submitter was a budding 
Heikki we wouldn't need patch submission guidelines at all.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Infinity bsearch crash on Windows

2011-05-10 Thread Greg Smith
A 9.1Beta1 test report from Richard Broersma (and confirmed on another 
system by Mark Watson) showed up pgsql-testers this week at 
http://archives.postgresql.org/pgsql-testers/2011-05/msg0.php with 
the following test crashing his Windows server every time:


SELECT 'INFINITY'::TIMESTAMP;

That works fine for me on Linux.  Richard chased the error in the logs, 
which was a generic "you can't touch that memory" one, down to a full 
stack trace:  
http://archives.postgresql.org/pgsql-testers/2011-05/msg9.php


It looks like it's losing its mind inside of 
src/backend/utils/adt/datetime.c , specifically at this line in datebsearch:


   3576 while (last >= base)
   3577 {
   3578 position = base + ((last - base) >> 1);
   3579 result = key[0] - position->token[0];

Why crash there only on Windows?  Was the problem actually introduced 
above this part of the code?  These are all questions I have no answer for.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Infinity bsearch crash on Windows

2011-05-11 Thread Greg Smith

Tom Lane wrote:

SELECT 'INFINITY'::TIMESTAMP;



Hmm ... I bet this is related to the recent reports about ALTER USER
VALID UNTIL 'infinity' crashing on Windows.  Can the people seeing this
get through the regression tests?  Perhaps more to the point, what is
their setting of TimeZone?  What does the pg_timezone_abbrevs view show
for them?
  


I must have missed that thread, I think I'm missing one of these lists 
(pgsql-bugs maybe?).  I've cc'd Mark Watson so maybe you can get better 
responses without me in the middle if needed; for this one, he reports:


Show timezone gives US/Eastern
Select * from pg_timezone_abbrevs returns zero rows


My Linux system that doesn't have this problem is also in US/Eastern, 
too, but I get 189 rows in pg_timezone_abrevs.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] performance-test farm

2011-05-11 Thread Greg Smith

Tomas Vondra wrote:

Actually I was not aware of how the buildfarm works, all I
knew was there's something like that because some of the hackers mention
a failed build on the mailing list occasionally.

So I guess this is a good opportunity to investigate it a bit ;-)

Anyway I'm not sure this would give us the kind of environment we need
to do benchmarks ... but it's worth to think of.
  


The idea is that buildfarm systems that are known to have a) reasonable 
hardware and b) no other concurrent work going on could also do 
performance tests.  The main benefit of this approach is it avoids 
duplicating all of the system management and source code building work 
needed for any sort of thing like this; just leverage the buildfarm 
parts when they solve similar enough problems.  Someone has actually 
done all that already; source code was last sync'd to the build farm 
master at the end of March:  https://github.com/greg2ndQuadrant/client-code


By far the #1 thing needed to move this forward from where it's stuck at 
now is someone willing to dig into the web application side of this.  
We're collecting useful data.  It needs to now be uploaded to the 
server, saved, and then reports of what happened generated.  Eventually 
graphs of performance results over time will be straighforward to 
generate.  But the whole idea requires someone else (not Andrew, who has 
enough to do) sits down and figures out how to extend the web UI with 
these new elements.




I guess we could run a script that collects all those important
parameters and then detect changes. Anyway we still need some 'really
stable' machines that are not changed at all, to get a long-term baseline.
  


I have several such scripts I use, and know where two very serious ones 
developed by others are at too.  This part is not a problem.  If the 
changes are big enough to matter, they will show up as a difference on 
the many possible "how is the server configured?" reports, we just need 
to pick the most reasonable one.  It's a small details I'm not worried 
about yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] performance-test farm

2011-05-12 Thread Greg Smith

Tom Lane wrote:

There's no such thing as a useful performance test that runs quickly
enough to be sane to incorporate in our standard regression tests.
  


To throw a hard number out:  I can get a moderately useful performance 
test on a SELECT-only workload from pgbench in about one minute.  That's 
the bare minimum, stepping up to 5 minutes is really necessary before 
I'd want to draw any real conclusions.


More importantly, a large portion of the time I'd expect regression test 
runs to be happening with debug/assert on.  We've well established this 
trashes pgbench performance.  One of the uglier bits of code added to 
add the "performance farm" feature to the buildfarm code was hacking in 
a whole different set of build options for it.


Anyway, what I was envisioning here was that performance farm systems 
would also execute the standard buildfarm tests, but not the other way 
around.  We don't want performance numbers from some platform that is 
failing the basic tests.  I would just expect that systems running the 
performance tests would cycle through regression testing much less 
often, as they might miss a commit because they were running a longer 
test then.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cache estimates, cache access cost

2011-05-15 Thread Greg Smith

Cédric Villemain wrote:

http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
  


This rebases easily to make Cedric's changes move to the end; I just 
pushed a version with that change to 
https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone 
wants a cleaner one to browse.  I've attached a patch too if that's more 
your thing.


I'd recommend not getting too stuck on the particular hook Cédric has 
added here to compute the cache estimate, which uses mmap and mincore to 
figure it out.  It's possible to compute similar numbers, albeit less 
accurate, using an approach similar to how pg_buffercache inspects 
things.  And I even once wrote a background writer extension that 
collected this sort of data as it was running the LRU scan anyway.  
Discussions of this idea seem to focus on how the "what's in the cache?" 
data is collected, which as far as I'm concerned is the least important 
part.  There are multiple options, some work better than others, and 
there's no reason that can't be swapped out later.  The more important 
question is how to store the data collected and then use it for 
optimizing queries.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


diff --git a/contrib/Makefile b/contrib/Makefile
index 6967767..47652d5 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
 		lo		\
 		ltree		\
 		oid2name	\
+		oscache	\
 		pageinspect	\
 		passwordcheck	\
 		pg_archivecleanup \
diff --git a/contrib/oscache/Makefile b/contrib/oscache/Makefile
new file mode 100644
index 000..8d8dcc5
--- /dev/null
+++ b/contrib/oscache/Makefile
@@ -0,0 +1,15 @@
+# contrib/oscache/Makefile
+
+MODULE_big = oscache
+OBJS = oscache.o
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/oscache
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/oscache/oscache.c b/contrib/oscache/oscache.c
new file mode 100644
index 000..1ad7dc2
--- /dev/null
+++ b/contrib/oscache/oscache.c
@@ -0,0 +1,151 @@
+/*-
+ *
+ * oscache.c
+ *
+ *
+ * Copyright (c) 2011, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/oscache/oscache.c
+ *
+ *-
+ */
+/* { POSIX stuff */
+#include  /* exit, calloc, free */
+#include  /* stat, fstat */
+#include  /* size_t, mincore */
+#include  /* sysconf, close */
+#include  /* mmap, mincore */
+/* } */
+
+/* { PostgreSQL stuff */
+#include "postgres.h" /* general Postgres declarations */
+#include "utils/rel.h" /* Relation */
+#include "storage/bufmgr.h"
+#include "catalog/catalog.h" /* relpath */
+/* } */
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+float4 oscache(Relation, ForkNumber);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	/* Install hook. */
+	OSCache_hook = &oscache;
+}
+
+/*
+ * oscache process the os cache inspection for the relation.
+ * It returns the percentage of blocks in OS cache.
+ */
+float4
+oscache(Relation relation, ForkNumber forkNum)
+{
+	int  segment = 0;
+	char *relationpath;
+	char filename[MAXPGPATH];
+	int fd;
+	int64  total_block_disk = 0;
+	int64  total_block_mem  = 0;
+
+	/* OS things */
+	int64 pageSize  = sysconf(_SC_PAGESIZE); /* Page size */
+	register int64 pageIndex;
+
+	relationpath = relpathperm(relation->rd_node, forkNum);
+
+	/*
+	 * For each segment of the relation
+	 */
+	snprintf(filename, MAXPGPATH, "%s", relationpath);
+	while ((fd = open(filename, O_RDONLY)) != -1)
+	{
+		// for stat file
+		struct stat st;
+		// for mmap file
+		void *pa = (char *)0;
+		// for calloc file
+		unsigned char *vec = (unsigned char *)0;
+		int64  block_disk = 0;
+		int64  block_mem  = 0;
+
+		if (fstat(fd, &st) == -1)
+		{
+			close(fd);
+			elog(ERROR, "Can not stat object file : %s",
+filename);
+			return 0;
+		}
+
+		/*
+		* if file ok
+		* then process
+		*/
+		if (st.st_size != 0)
+		{
+			/* number of block in the current file */
+			block_disk = st.st_size/pageSize;
+
+			/* TODO We need to split mmap size to be sure (?) to be able to mmap */
+			pa = mmap(NULL, st.st_size, PROT_NONE, MAP_SHARED, fd, 0);
+			if (pa == MAP_FAILED)
+			{
+close(fd);
+elog(ERROR, "Can not mmap object file : %s, errno = %i,%s\nThis error can happen if there is not enought space in memory to do the projection. Please mail ced...@2ndquadrant.fr with '[oscache] ENOMEM' as subject.",
+	filename, errno, strerror(errno));
+return 0;
+			}
+
+			/* Prepare our vector containing all blocks informatio

Re: [HACKERS] Why not install pgstattuple by default?

2011-05-18 Thread Greg Smith

Greg Smith wrote:
Attached is a second patch to move a number of extensions from 
contrib/ to src/test/.  Extensions there are built by the default 
built target, making installation of the postgresql-XX-contrib package 
unnecessary for them to be available.


That was supposed to be contrib/ to src/extension , no idea where that 
test bit came from.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why not install pgstattuple by default?

2011-05-18 Thread Greg Smith

Greg Smith wrote:
Any packager who grabs the shared/postgresql/extension directory in 
9.1, which I expect to be all of them, shouldn't need any changes to 
pick up this adjustment.  For example, pgstattuple installs these files:


share/postgresql/extension/pgstattuple--1.0.sql
share/postgresql/extension/pgstattuple--unpackaged--1.0.sql
share/postgresql/extension/pgstattuple.control

And these are the same locations they were already at.


...and the bit I missed here is that there's a fourth file here:

lib/postgresql/pgstattuple.so

If you look at a 9.1 spec file, such as 
http://svn.pgrpms.org/browser/rpm/redhat/9.1/postgresql/EL-6/postgresql-9.1.spec 
, you'll find:


%files contrib
...
%{pgbaseinstdir}/lib/pgstattuple.so

Which *does* require a packager change to relocate from the 
postgresql-91-package to the main server one.  So the theory that a 
change here might happen without pushing a repackaging suggestion toward 
packagers is busted.  This does highlight that some packaging guidelines 
would be needed here to completely this work.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Adding an example for replication configuration to pg_hba.conf

2011-05-18 Thread Greg Smith

Two things that could be changed from this example to make it more useful:

-Document at least a little bit more how this is different from the 
"all/all" rule.  I can imagine users wondering "do I use this instead of 
the other one?  In addition?  Is it redundant if I have 'all' in there?  
A little more description here aiming at the expected FAQs here would 
make this much more useful.


-The default database is based on your user name, which is postgres in 
most packaged builds but not if you compile your own.  I don't know 
whether it's practical to consider substituting that into this file, or 
if it's just enough to mention that as an additional doc comment.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.2 schedule

2011-05-23 Thread Greg Smith
At the developer meeting last week:  
http://wiki.postgresql.org/wiki/PgCon_2011_Developer_Meeting there was 
an initial schedule for 9.2 hammered out and dutifully transcribed at  
http://wiki.postgresql.org/wiki/PostgreSQL_9.2_Development_Plan , and 
the one part I wasn't sure I had written down correctly I see Robert 
already fixed.


There was a suggestion to add some publicity around the schedule for 
this release.  There's useful PR value to making it more obvious to 
people that the main development plan is regular and time-based, even if 
the release date itself isn't fixed.  The right time to make an initial 
announcement like that is "soon", particularly if a goal here is to get 
more submitted into the first 9.2 CF coming in only a few weeks.  Anyone 
have changes to suggest before this starts working its way toward an 
announcement?


The main parts of the discussion leading to changes from the 9.1 
schedule, as I recall them, are:


-Shooting for a slightly earlier branch/initial 9.2 CommitFest in June 
helps some with patch developer bit-rot, and may let developers who are 
focused on new features be productive for more of the year.  The 
perception that new development is unwelcome between the final CF and 
version release seems to have overshot a bit from its intention.  It's 
not the best period to chat on this list, with many people distracted by 
release goals.  But some people just aren't in the right position to 
work on alpha/beta testing and stability work then, and the intention 
was not to block them from doing something else if that's the case.  (A 
similar bit brought up during one of the patch prep talks is that review 
is also welcome outside of a CF, which isn't really clear)


-The last CF of the release is tough to reschedule usefully due to 
concerns about December/beginning of the year holidays.


-Given that work in August is particularly difficult to line up with 
common summer schedules around the world, having the other >1 month gap 
in the schedule go there makes sense.


As for why there aren't more changes, it's hard to argue that the 9.1 
process was broken such that it needs heavy modification.  There were a 
large number of new features committed, people seem satisfied with the 
quality of the result so far, and the schedule didn't go too far off the 
rails.  Outside of the manpower issues (which are serious), the sections 
that strained the most against problems seem really hard to identify 
with anything other than hindsight.  The tension between "ship it" and 
"make the release better" is a really fundamental one to software 
development.


The two main ideas that pop up regularly, organizing more CommitFests or 
making them shorter, are both hard to adopt without more active 
volunteers working on review (both at the initial and committer level) 
and an increase in available CF manager time.  An idea I heard a couple 
of people suggest is that it would take a CF manager focused exclusively 
on the "patch chasing" parts of the role--not someone who is also trying 
to develop, commit, or review during the CF--before this would be 
feasible to consider.  Some sort of relief for making that role less 
demanding is needed here, before it's practical to schedule those even 
more often.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 schedule

2011-05-24 Thread Greg Smith

David Fetter wrote:

I thought we'd agreed on the timing for the first CF, and that I was
to announce it in the PostgreSQL Weekly News, so I did just that.
  


Yes, and excellent.  The other ideas were:

-Publish information about the full schedule to some of the more popular 
mailing lists


-Link to this page more obviously from postgresql.org (fixed redirect 
URL is probably the right approach) to "bless" it, and potentially 
improve its search rank too.


The specific new problem being highlighted to work on here is that the 
schedule and development process is actually quite good as open-source 
projects go, but that fact isn't visible at all unless you're already on 
the inside of the project.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 schedule

2011-05-24 Thread Greg Smith

On 05/24/2011 05:03 PM, Peter Eisentraut wrote:

On mån, 2011-05-23 at 22:44 -0400, Greg Smith wrote:
   

-Given that work in August is particularly difficult to line up with
common summer schedules around the world, having the other>1 month
gap in the schedule go there makes sense.
 

You might want to add a comment on the schedule page about the
June/July/August timing, because it looks like a typo, and the meeting
minutes are also inconsistent how they talk about June and July.
   


Yes, I was planning to (and just did) circle back to the minutes to make 
everything match up.  It's now self-consistent, same dates as the 
schedule, and explains the rationale better.


I'm not sure how to address the feeling of typo you have on the schedule 
page beyond that.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 schedule

2011-05-24 Thread Greg Smith

On 05/24/2011 01:35 PM, Josh Berkus wrote:

I would suggest instead adding a new page to postgresql.org/developer
which lists the development schedule, rather than linking to that wiki
page.  Maybe on this page?

http://www.postgresql.org/developer/roadmap
   


Now that I look at the roadmap page again, I think all that would really 
be needed here is to tweak its wording a bit.  If the description on 
there of the link to the wiki looked like this:


General development information
  A wiki page about various aspects of the PostgreSQL development 
process, including detailed schedules and submission guidelines


I think that's enough info to keep there.  Putting more information back 
onto the main site when it can live happily on the wiki seems 
counterproductive to me; if there's concerns about things like 
vandalism, we can always lock the page.  I could understand the argument 
that "it looks more professional to have it on the main site", but 
perception over function only goes so far for me.


The idea of adding a link back to the wiki from the 
https://commitfest.postgresql.org/ page would complete being able to 
navigate among the three major sites here, no matter which people 
started at.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Greg Smith

Michael Nolan wrote:
Based on last year's discussion of this TODO item, it seems thoughts 
have been focused on estimating how much data is
being satisfied from PG's shared buffers.  However, I think that's 
only part of the problem.  


Sure, but neither it nor what you're talking about are the real gating 
factor on making an improvement here.  Figuring out how to expose all 
this information to the optimizer so it can use it when planning is the 
hard part.  Collecting a read time profile is just one of the many ways 
you can estimate what's in cache, and each of the possible methods has 
good and bad trade-offs.  I've been suggesting that people assume that's 
a solved problem--I'm pretty sure what you're proposing was done by Greg 
Stark once and a prototype built even--and instead ask what you're going 
to do next if you had this data.


This data would probably need to be kept separately for each table or 
index, as some tables or indexes
may be mostly or fully in cache or on faster physical media than 
others, although in the absence of other
data about a specific table or index, data about other relations in 
the same tablespace might be of some use. 


This is the important part.  Model how the data needs to get stored such 
that the optimizer can make decisions using it, and I consider it easy 
to figure out how it will get populated later.  There are actually 
multiple ways to do it, and it may end up being something people plug-in 
an implementation that fits their workload into, rather than just having 
one available.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tackling full page writes

2011-05-25 Thread Greg Smith

On 05/24/2011 04:34 PM, Robert Haas wrote:

we could name this feature "partial full page writes" and enable it
either with a setting of full_page_writes=partial


+1 to overloading the initial name, but only if the parameter is named 
'maybe', 'sometimes', or 'perhaps'.


I've been looking into a similar refactoring of the names here, where we 
bundle all of these speed over safety things (fsync, full_page_writes, 
etc.) into one control so they're easier to turn off at once.  Not sure 
if it should be named "web_scale" or "do_you_feel_lucky_punk".



3. Going a bit further, Greg proposed the idea of ripping out our
current WAL infrastructure altogether and instead just having one WAL
record that says "these byte ranges on this page changed to have these
new contents".


The main thing that makes this idea particularly interesting to me, over 
the other two, is that it might translate well into the idea of using 
sync_file_range to aim for a finer fsync call on Linux than is currently 
possible.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-05-26 Thread Greg Smith

On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote:

For 1, I've just finish my work.  The latest patch is available at:
http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch
   


Reminder here--we can't accept code based on it being published to a web 
page.  You'll need to e-mail it to the pgsql-hackers mailing list to be 
considered for the next PostgreSQL CommitFest, which is starting in a 
few weeks.  Code submitted to the mailing list is considered a release 
of it to the project under the PostgreSQL license, which we can't just 
assume for things when given only a URL to them.


Also, you suggested you were out of time to work on this.  If that's the 
case, we'd like to know that so we don't keep cc'ing you about things in 
expectation of an answer.  Someone else may pick this up as a project to 
continue working on.  But it's going to need a fair amount of revision 
before it matches what people want here, and I'm not sure how much of 
what you've written is going to end up in any commit that may happen 
from this idea.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench--new transaction type

2011-05-29 Thread Greg Smith

On 05/29/2011 03:11 PM, Jeff Janes wrote:

If you use "pgbench -S -M prepared" at a scale where all data fits in
memory, most of what you are benchmarking is network/IPC chatter, and
table locking.


If you profile it, you'll find a large amount of the time is actually 
spent doing more mundane things, like memory allocation.  The network 
and locking issues are really not the bottleneck at all in a surprising 
number of these cases.  Your patch isn't really dependent on your being 
right about the cause here, which means this doesn't impact your 
submissions any.  Just wanted to clarify that what people expect are 
slowing things down in this situation and what actually shows up when 
you profile are usually quite different.


I'm not sure whether this feature makes sense to add to pgbench, but 
it's interesting to have it around for developer testing.  The way 
you've built this isn't messing with the code too much to accomplish 
that, and your comments about it being hard to do this using "-f" are 
all correct.  Using a custom test file aims to shoot your foot unless 
you apply a strong grip toward doing otherwise.



some numbers for single client runs on 64-bit AMD Opteron Linux:
12,567 sps  under -S
19,646 sps  under -S -M prepared
58,165 sps  under -P
   


10,000 is too big of a burst to run at once.  The specific thing I'm 
concerned about is what happens if you try this mode when using "-T" to 
enforce a runtime limit, and your SELECT rate isn't high.  If you're 
only doing 100 SELECTs/second because your scale is big enough to be 
seek bound, you could overrun by nearly two minutes.


I think this is just a matter of turning the optimization around a bit.  
Rather than starting with a large batch size and presuming that's ideal, 
in this case a different approach is really needed.  You want the 
smallest batch size that gives you a large win here.  My guess is that 
most of the gain here comes from increasing batch size to something in 
the 10 to 100 range; jumping to 10K is probably overkill.  Could you try 
some smaller numbers and see where the big increases start falling off at?


Obligatory code formatting nitpick:  try not to overrun the right margin 
any further than the existing code around line 1779, where you add more 
ttype comments.  That needs to turn into a multi-line comment.  Rest of 
the patch looks fine, and don't worry about resubmitting for that; just 
something to tweak on your next update.  A slightly more descriptive 
filename for the patch would help out those of us who look at a lot of 
pgbench patches, too.  Something like "pgbench_loop_v1.patch" for 
example would make it easier for me to remember which patch this was by 
its name.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting a bug tracker for the Postgres project

2011-05-30 Thread Greg Smith

On 05/29/2011 05:17 AM, Peter Eisentraut wrote:

Here is a list to choose from:
http://en.wikipedia.org/wiki/Comparison_of_issue_tracking_systems
   


I turned this into a spreadsheet to sort and prune more easily; if 
anyone wants that let me know, it's not terribly useful beyond what I'm 
posting here.  44 total, 16 that are open-source.  I would say that 
having an e-mail interface is the next major cut to make.  While 
distasteful, it's possible for this project to adopt a solution that 
doesn't use PostgreSQL, and one interesting candidate is in that 
category.  It's not feasible to adopt one that doesn't integrate well 
with e-mail though.


List of software without listed e-mail integration:  Fossil, GNATS, 
Liberum Help Desk, MantisBT, org-mode, Flyspray, ikiwiki, Trac.


The 8 F/OSS programs left after that filter are:

OTRS
Debbugs
Request Tracker
Zentrack
LibreSource
Redmine
Roundup
Bugzilla

The next two filters you might apply are:

Support for Git:  Redmine, Bugzilla
PostgreSQL back-end:  OTRS, Request Tracker, LibreSource, Redmine, 
Roundup, Bugzilla


There are a couple of additional nice to have items I saw on the feature 
list, and they all seem to spit out just Redmine & Bugzilla.  Those are 
the two I've ended up using the most on PostgreSQL related projects, 
too, so that isn't a surprise to me.  While I'm not a strong fan of 
Redmine, it has repeatedly been the lesser of the evils available here 
for three different companies I've worked at or dealt with.


Greg Stark is right that Debbugs has a lot of interesting features 
similar to the desired workflow here.  It's not tied to just Debian 
anymore; the GNU project is also using it now.  And the database backend 
isn't that terrible to consider:  it's flat files with a BerkleyDB index 
built on top.  I think if it was perfect except for that, it would still 
be worth considering.  Debbugs is far from a general purpose solution 
though, so any customization to support differences in this project's 
workflow would likely end up being one-off hacks.  The VCS support might 
be a problem, but I've gotten the impression that git is increasingly 
popular for other Debian work.  Since the program is in Perl, I can't 
imagine it's a gigantic task to switch that out, and probably one other 
people would like to see.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How can I check the treatment of bug fixes?

2011-05-30 Thread Greg Smith
On 05/27/2011 08:36 AM, MauMau wrote:
> I posted a patch for bug #6011 to pgsql-hackers several days ago. How
> can I check the status of bug fixes? I'm worried that the patch might
> be forgotten, because bug #5842 was missed for two months until Bruce
> noticed it.

Discussion here seems to have wandered far away from useful suggestions
for you, let's see if that's possible to return to that. Best way to
confirm when a bug is resolved is to subscribe to the pgsql-committers
mailing list. If a commit for this fix appears, odds are good the
original bug number will be referenced. Even if it isn't, you may
recognize it via its description. Until you see that, the bug is almost
certainly still open.

Bugs that are considered to impact the current version under development
are sometimes listed at http://wiki.postgresql.org/wiki/Open_Items
Adding a bug to there that's not really specific to the new version may
not be considered good form by some. It is the closest thing to an open
bug tracker around though, and adding items to there means they won't be
forgotten about; it's checked regularly by developers considering when
it's a good time to release another alpha or beta.

In my mind, clarifying what circumstances it's appropriate for people to
put a bug onto the Open Items list would be a useful way to spend a
little time. Certainly more productive than trying firing more bullets
at the unkillable zombie that is bug tracking software.

-- 
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for new feature: Buffer Cache Hibernation

2011-06-01 Thread Greg Smith

On 06/01/2011 03:03 AM, Tatsuo Ishii wrote:

Also I really want to see the performance comparison between these two
approaches in the real world database.
   


Well, tell me how big of a performance improvement you want PgFincore to 
win by, and I'll construct a benchmark where it does that.  If you pick 
a database size that fits in the OS cache, but is bigger than 
shared_buffers, the difference between the approaches is huge.  The 
opposite--trying to find a case where this hibernation approach wins--is 
extremely hard to do.


Anyway, further discussion of this patch is kind of a waste right now.  
We've never gotten the patch actually sent to the list to establish a 
proper contribution (just pointers to a web page), and no feedback on 
that or other suggestions for redesign (extension repackaging, GUC 
renaming, removing unused code, and a few more).  Unless the author 
shows up again in the next two weeks, this is getting bounced back with 
no review as code we can't use.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    1   2   3   4   5   6   7   8   9   10   >