Re: [GENERAL] explain plan visibility

2012-11-12 Thread Albe Laurenz
高健 wrote:
 Please let me dig it a little more:
 
 I think that  when a session is accessing a postgersql table. It will be 
 influenced by  the
 followings:
 
 
 
 Really old data (needed to be vacuumed, eg: old image at one day ago).
 
 Recent data (committed and uncommitted), because they are all in the data 
 block.
 
 
 
 Isn’t it strange that I have to access my data among somebody’s un-decided 
 data?
 
 How if there is a busy system having a table accessed by so many sessions at 
 same time?
 
 They will all be slowed down because of uncommitted but flushed out data, I 
 think.
 
 
 
 I hope in future the architecture of PostgreSQL can put the committed data  
 uncommitted data apart,
 
 Or even put them in separate physical disks.That will Help to improve 
 performance I think.

I don't think that this design choice will ever be changed.

You'll always be influenced by other people's data, no
matter how you handle MVCC.

Let's imagine a design where yet uncommitted data are
stored somewhere else.  How do you handle the following
problems:

1) The space for uncommitted data runs out.  That will keep you
   from doing any transaction exceeding the size of this storage
   space.  Note that Oracle suffers from that problem (though
   they do the opposite of what you propose:  they store uncommitted
   data in the table and move old data to the undo space).

2) Every COMMIT will require that everything is moved from the
   uncommited space to the regular table.  So you'd double the
   I/O cost for the normal case (I guess that's the reason why
   Oracle does it the other way round).

And you wouldn't even get rid of the most unpleasant side effect
of the way that PostgreSQL manages MVCC, the need to VACUUM away
old tuples.

As for Oracle's approach to MVCC, which I think is better than yours,
it has its pros and cons, but I don't think it is demonstrably better
than PostgreSQL's.  No matter how you implement MVCC, you'll have to
pay the price for the benefits somehow.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Understanding streaming replication

2012-11-12 Thread Albe Laurenz
I'll try to answer the questions I can.

Pawel Veselov wrote:
 I've been struggling with understanding all the necessary pieces for 
 streaming replication. So I put
 down the pieces as I did understand them, and would appreciate if you guys 
 could point out any of the
 stuff I understood or have done wrong.
 
 The set up is pgpool + streaming replication + hot stand by. No load 
 balancing, stand-by nodes will
 not receive any application queries (I don't have that big of a query load, 
 and I don't want to risk
 inconsistent reads). There are no shared file systems, but there is a way to 
 rsync/scp files between
 nodes. Fail-over is automatic, and should kick in within reasonably small 
 period after master failure.
 
 1. Archiving. Should be turned on on all the nodes. The archive command 
 should copy the archive file
 to the local archive directory, and rsync archive directory between all the 
 nodes. My understanding is
 that archiving is necessary if a stand-by node ever missed enough WAL 
 updates to need an old enough
 WAL that might have been removed from pg_xlog.

You don't give details about how the rsync is triggered,
but I'd advise against having rsync as part of archive_command.
First, it is slow and if there is a lot of activity, the
archiver will not be able to keep up.
Second, if rsync fails, the WAL file will not be considered
archived.

Both these things will keep the WAL files from being deleted
from pg_xlog.

I'd schedule rsync as a cron job or similar.

 QUESTION: After the failover, the new master will start archiving its WAL 
 files. These archived WALs
 will not collide in any way with the archived WALs generated by previous 
 master(s)?

They will not, because the standby starts a new time line
when it is promoted to primary, which will result in new
WAL file names.

 QUESTION: What is a good policy for archive clean up? From the perspective to 
 only remove archive
 files that are guaranteed to never be required by any nodes.

You cannot tell from the primary's side.
Since you also need the archives to restore an online backup,
I'd keep them a long as your backup policy dictates.
I hope you don't rely on standby databases for backup (just
imagine an accidental DROP TABLE that gets propagated to all
standbys withing seconds).

 2. Failover. On master failure, pgpool will automatically select a new 
 master, and degenerate all
 other nodes. The cluster is now in the emergency state and requires manual 
 intervention for
 reconfiguration and recovery. pgpool executes a script to promote a node, 
 that script will create a
 trigger file on a newly selected master node, and postgres will exist 
 stand-by mode.
 
 QUESTION: If multiple pgpools are running, and if there are no network 
 problems, and configuration
 files are identical, is there any guarantee that the same stand-by node will 
 be selected for
 promotion? Concern here is that with configuration of (M-SB0-SB1) one pgpool 
 decides to promote SB0
 and another - SB1, causing both of them to enter master mode, and splitting 
 the cluster. It does look
 that pgpool will always select next alive node for promotion, but I 
 couldn't find a definitive
 statement on that.

I don't know about pgpool and its abilities to handle
cluster failover, but I wouldn't go this way at all.
Even if the answer were that in the circumstances you
describe things would work, you can depend on it that
things will go wrong in ways different from what you
expect, e.g. a broken network card.
The consequences would be worse than I'd like to imagine.

If you want reliable automatic failover, consider cluster
software.

 3. Recovery. That part is a bit confusing. The majority of the documentation 
 says that in this case,
 the node should be re-loaded from the base backup, obtained from the master. 
 I'm not sure why this is
 necessary, if there are enough archived WALs.

Because of the new time line; streaming replication cannot
(yet) recover across a time line change.

 QUESTION: Is there any metric to understand whether hauling base will be 
 slower/faster than replaying
 missed WALs? Anyway, pgpool only has one recovery mechanism, and it does 
 invoke a base restore from
 whatever current master is.
 
 PROBLEM: This I see as a problem. The only way that I see to re-attach a node 
 to the pgpool, short of
 restarting it, is to call pcp_recovery_node. This will make the master take a 
 base back up, push it to
 the stand-by that needs recovery, and re-start the stand-by node. I am not 
 sure if there is a good way
 to check if that node has already been recovered. That because if there are 
 more than 2 pgpools, they
 both will attempt to recover the same stand-by, and this will probably get 
 ugly.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] PostgreSQL and a clustered file system

2012-11-12 Thread Ivan Voras
Hello,

Is anyone running PostgreSQL on a clustered file system on Linux? By
clustered I actually mean shared, such that the same storage is
mounted by different servers at the same time (of course, only one
instance of PostgreSQL on only one server can be running on such a
setup, and there are a lot of other precautions that need to be satisfied).




signature.asc
Description: OpenPGP digital signature


Re: [GENERAL] integer instead of 'double precision'?

2012-11-12 Thread Willy-Bas Loos
On Fri, Sep 9, 2011 at 5:09 PM, Guillaume Lelarge guilla...@lelarge.infowrote:

 You divide an integer with an integer, that should give you an integer.


Can you tell me the reasoning behind that idea?
Is it a rule that the output type of an operator must equal the input type?
In this case that doesn't seem locigal. I think that the / operator
should return something that allows fractions, since the operator creates
fractions so frequently.
If you should need it to be an integer, e.g. when you update an integer
column, casting should  be done just-in-time.

But i don't know much about the internals and the reasoning behind these
matters, i would be grateful if you could explain.

Cheers,

WBL

-- 
Quality comes from focus and clarity of purpose -- Mark Shuttleworth


Re: [GENERAL] integer instead of 'double precision'?

2012-11-12 Thread Chris Angelico
On Tue, Nov 13, 2012 at 12:16 AM, Willy-Bas Loos willy...@gmail.com wrote:
 On Fri, Sep 9, 2011 at 5:09 PM, Guillaume Lelarge guilla...@lelarge.info
 wrote:

 You divide an integer with an integer, that should give you an integer.

 Can you tell me the reasoning behind that idea?
 Is it a rule that the output type of an operator must equal the input type?
 In this case that doesn't seem locigal. I think that the / operator should
 return something that allows fractions, since the operator creates fractions
 so frequently.

This is an argument that comes up regularly on the Python list, partly
because version 2 had int/int - int, but version 3 declared that
int/int - float made more sense.

One of the problems of going to floating point is that it's not a
superset of integers - especially not when your integer type supports
arbitrary precision. It might seem obvious that 7/2 should yield 3.5
and not 3, but what about when the numbers are so large that you lose
precision by going float? Or are there to be some cases where int/int
makes float and some where it makes int? That would be nicely
confusing.

I'm generally happy with either behaviour, as long as its consistent,
and as long as it can be overridden with an explicit type cast when
the other is needed.

ChrisA


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Can dml realize the partition table's rule and make good execution plan?

2012-11-12 Thread Tom Lane
Craig Ringer cr...@2ndquadrant.com writes:
 It'd be really interesting to enhance the query planner to be smarter
 about this particular case,

At least for this particular example, it's not at all clear to me that
it'd be worth the cost.  Getting rid of accesses to tables altogether is
worthwhile, but avoiding one very-low-cost arm of a BitmapOr is less so.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] PSA: XFS and Linux Cache Poisoning

2012-11-12 Thread Shaun Thomas

Hey everyone,

We recently got bit by this, and I wanted to make sure it was known to 
the general community.


In new(er) Linux kernels, including late versions of the 2.6 tree, XFS 
has introduced dynamic speculative preallocation. What does this do? It 
was added to prevent filesystem fragmentation by preallocating a large 
chunk of memory to files so extensions to those files can go on the same 
allocation. The dynamic part just means it adjusts the size of this 
preallocation based on internal heuristics.


Unfortunately, they also changed the logic in how this extra space is 
tracked. At least in previous kernels, this space would eventually be 
deallocated. Now, it survives as long as there are any in-memory 
references to a file, such as in a busy PG database. The filesystem 
itself sees this space as used and will be reported as such with tools 
such as df or du.


How do you check if this is affecting you?

du -sm --apparent-size /your/pg/dir; du -sm /your/pg/dir

If you're using XFS, and there is a large difference in these numbers, 
you've been bitten by the speculative preallocation system.


But where does it go while allocated? Why, to your OS system cache, of 
course. Systems with several GB of RAM may experience extreme phantom 
database bloat, because of the dynamic aspect of the preallocation 
system, So there are actually two problems:


1. Data files are reported as larger than their actual size and have 
extra space around just in case. Since PG has a maximum file size of 
1GB, this is basically pointless.
2. Blocks that could be used for inode caching to improve query 
performance are reserved instead for caching empty segments for XFS.


The first can theoretically exhaust the free space on a file system. We 
were seeing 45GB(!) of bloat on one of our databases caused directly by 
this. The second, due to the new and improved PG planner, can result in 
terrible query performance and high system load since the OS cache does 
not match assumptions.


So how is this fixed? Luckily, the dynamic allocator can be disabled by 
choosing an allocation size. Add allocsize to your mount options. We 
used a size of 1m (for 1 megabyte) to retain some of the defragmentation 
benefits, while still blocking the dynamic allocator. The minimum size 
is 64k, so some experimentation is probably warranted.


This mount option *is not compatible* with the remount mount option, 
so you'll need to completely shut everything down and unmount the 
filesystem to apply.


We spent days trying to track down the reason our systems were reporting 
a load of 20-30 after a recent OS upgrade. I figured it was only fair to 
share this to save others the same effort.


Good luck!

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] PSA: XFS and Linux Cache Poisoning

2012-11-12 Thread Ben Chobot
On Nov 12, 2012, at 7:37 AM, Shaun Thomas wrote:

 Hey everyone,
 
 We recently got bit by this, and I wanted to make sure it was known to the 
 general community.
 
 In new(er) Linux kernels, including late versions of the 2.6 tree, XFS has 
 introduced dynamic speculative preallocation. What does this do? It was added 
 to prevent filesystem fragmentation by preallocating a large chunk of memory 
 to files so extensions to those files can go on the same allocation. The 
 dynamic part just means it adjusts the size of this preallocation based on 
 internal heuristics.
 
 Unfortunately, they also changed the logic in how this extra space is 
 tracked. At least in previous kernels, this space would eventually be 
 deallocated. Now, it survives as long as there are any in-memory references 
 to a file, such as in a busy PG database. The filesystem itself sees this 
 space as used and will be reported as such with tools such as df or du.
 
 How do you check if this is affecting you?
 
 du -sm --apparent-size /your/pg/dir; du -sm /your/pg/dir
 
 If you're using XFS, and there is a large difference in these numbers, you've 
 been bitten by the speculative preallocation system.
 
 But where does it go while allocated? Why, to your OS system cache, of 
 course. Systems with several GB of RAM may experience extreme phantom 
 database bloat, because of the dynamic aspect of the preallocation system, 
 So there are actually two problems:
 
 1. Data files are reported as larger than their actual size and have extra 
 space around just in case. Since PG has a maximum file size of 1GB, this is 
 basically pointless.
 2. Blocks that could be used for inode caching to improve query performance 
 are reserved instead for caching empty segments for XFS.
 
 The first can theoretically exhaust the free space on a file system. We were 
 seeing 45GB(!) of bloat on one of our databases caused directly by this. The 
 second, due to the new and improved PG planner, can result in terrible query 
 performance and high system load since the OS cache does not match 
 assumptions.
 
 So how is this fixed? Luckily, the dynamic allocator can be disabled by 
 choosing an allocation size. Add allocsize to your mount options. We used a 
 size of 1m (for 1 megabyte) to retain some of the defragmentation benefits, 
 while still blocking the dynamic allocator. The minimum size is 64k, so some 
 experimentation is probably warranted.
 
 This mount option *is not compatible* with the remount mount option, so 
 you'll need to completely shut everything down and unmount the filesystem to 
 apply.
 
 We spent days trying to track down the reason our systems were reporting a 
 load of 20-30 after a recent OS upgrade. I figured it was only fair to share 
 this to save others the same effort.
 
 Good luck!


Oh hey, I've been wondering for a while why our master dbs seem to be using so 
much more space than their slaves. This appears to be the reason. Thanks for 
the work in tracking it down!



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Understanding streaming replication

2012-11-12 Thread Philippe Amelant

Hello all,
I read this thread with interest but I still have some questions
about cascading replication as you describe it.

Le 12/11/2012 10:36, Albe Laurenz a écrit :

I'll try to answer the questions I can.



3. Recovery. That part is a bit confusing. The majority of the documentation 
says that in this case,
the node should be re-loaded from the base backup, obtained from the master. 
I'm not sure why this is
necessary, if there are enough archived WALs.

Because of the new time line; streaming replication cannot
(yet) recover across a time line change.





I'm setting up a 3 nodes cluster and after some tests
I just discover that the cascading slave does not recover.

As far as I can see in the 9.2 documentation it should work after
an automatic reconnect to the new master.

Is there any chance to get this fixed in 9.2.x ?

In case of disaster on master and on standby, can I just restart the 
cascading slave

after removing recovery.conf ?
Would it  be better to copy all archives log from the master in pg_xlog 
on the third node

and then restart it ?

What is the best way to get back this node with minimal loss?

Thanks for your advice

Regards




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] PostgreSQL and a clustered file system

2012-11-12 Thread Achilleas Mantzios
Hello Ivan,

this sounds so mainframe-ish, i recall, in IBM MVS (circa 1990+) we used to 
attach two systems to
the same DASDie storage, and then employ disk serialization provided by the 
OS to achieve some
integrity to the data. (do not get me wrong i had adequate Unix/SUNOS/Ultrix 
experience 
before i had to go through all that MVS stuff, which made it even more 
painful!).

On Δευ 12 Νοε 2012 11:03:14 Ivan Voras wrote:
 Hello,
 
 Is anyone running PostgreSQL on a clustered file system on Linux? By
 clustered I actually mean shared, such that the same storage is
 mounted by different servers at the same time (of course, only one
 instance of PostgreSQL on only one server can be running on such a
 setup, and there are a lot of other precautions that need to be satisfied).
 
 
-
Achilleas Mantzios
IT DEPT


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] FATAL: sorry, too many clients already when I start the replication

2012-11-12 Thread Velislav Stoychev
I have strange problem. I am trying to achieve streaming replication
between 2 PostgreSQL servers with version 9.2.1. The replications worked
just fine then the servers was without load. The problem is that now then
the servers are loaded I cannot start the replication without receiving
this error:
FATAL:  sorry, too many clients already

The slave server is out of the question and is stopped. I am changing only
the master configuration from this setup (I will paste only the end of the
file postgresql.conf because the other stuff are with default params except
the listen configuration):

#--
# CUSTOMIZED OPTIONS
#--

# Add settings for extensions here
 Postgre replication
#archive_command = 'cp -i %p /data/PostgreSQL/PostgresArchive/%f'
#archive_mode = on
#max_wal_senders = 3 # Maximum 'wal_senders', processes responsible for
managing a connection with a standby server
#wal_keep_segments = 32 # How many WAL segments (=files) should be kept on
the primary, whatever may happen (you won't have to copy them manually on
the standby if the standby gets too far behind)
#wal_level = hot_standby
#synchronous_standby_names = 'sync_replication'

#
max_locks_per_transaction = 500
#

#--
# pgtune wizard run on 2012-10-04
# Based on 65958512 KB RAM in the server
#--

default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 44GB
work_mem = 384MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 15GB
max_connections = 80


With this setup everything is OK and no errors are observed. The command:
netstat -aonlp | grep 5432  | wc -l return average 30 connections to
PostgreSQL. When I change the configuration to:


#--
# CUSTOMIZED OPTIONS
#--

# Add settings for extensions here
 Postgre replication
archive_command = 'cp -i %p /data/PostgreSQL/PostgresArchive/%f'
archive_mode = on
max_wal_senders = 3 # Maximum 'wal_senders', processes responsible for
managing a connection with a standby server
wal_keep_segments = 32 # How many WAL segments (=files) should be kept on
the primary, whatever may happen (you won't have to copy them manually on
the standby if the standby gets too far behind)
wal_level = hot_standby
synchronous_standby_names = 'sync_replication'

#
max_locks_per_transaction = 500
#

#--
# pgtune wizard run on 2012-10-04
# Based on 65958512 KB RAM in the server
#--

default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 44GB
work_mem = 384MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 15GB
max_connections = 80

In the log I can see this error FATAL:  sorry, too many clients already
spamming the log non stop and netstat -aonlp | grep 5432  | wc -l return
250 connections to PostgreSQL. I tried to change max_connections = 80 to
max_connections = 300 but this do not help because for short time netstat
-aonlp | grep 5432  | wc -l return 700 connections to PostgreSQL.

The slave server is stopped all the time. What is causing this problem and
why the server work just fine without replication configuration and if I
add them the server freak out ?


Re: [GENERAL] Understanding streaming replication

2012-11-12 Thread Pawel Veselov
On Mon, Nov 12, 2012 at 1:36 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote:

 I'll try to answer the questions I can.


Thank you!


 Pawel Veselov wrote:
  I've been struggling with understanding all the necessary pieces for
 streaming replication. So I put
  down the pieces as I did understand them, and would appreciate if you
 guys could point out any of the
  stuff I understood or have done wrong.
 
  The set up is pgpool + streaming replication + hot stand by. No load
 balancing, stand-by nodes will
  not receive any application queries (I don't have that big of a query
 load, and I don't want to risk
  inconsistent reads). There are no shared file systems, but there is a
 way to rsync/scp files between
  nodes. Fail-over is automatic, and should kick in within reasonably
 small period after master failure.
 
  1. Archiving. Should be turned on on all the nodes. The archive command
 should copy the archive file
  to the local archive directory, and rsync archive directory between all
 the nodes. My understanding is
  that archiving is necessary if a stand-by node ever missed enough WAL
 updates to need an old enough
  WAL that might have been removed from pg_xlog.
 You don't give details about how the rsync is triggered,

but I'd advise against having rsync as part of archive_command.
 First, it is slow and if there is a lot of activity, the
 archiver will not be able to keep up.
 Second, if rsync fails, the WAL file will not be considered
 archived.

 Both these things will keep the WAL files from being deleted
 from pg_xlog.

 I'd schedule rsync as a cron job or similar.


From your later comments, it's also apparent that these archived WALs will
be useless after failover (for the purpose of recovery), so there is no
reason to send them to all the nodes after all.



  QUESTION: After the failover, the new master will start archiving its
 WAL files. These archived WALs
  will not collide in any way with the archived WALs generated by previous
 master(s)?

 They will not, because the standby starts a new time line
 when it is promoted to primary, which will result in new
 WAL file names.

  QUESTION: What is a good policy for archive clean up? From the
 perspective to only remove archive
  files that are guaranteed to never be required by any nodes.

 You cannot tell from the primary's side.
 Since you also need the archives to restore an online backup,
 I'd keep them a long as your backup policy dictates.
 I hope you don't rely on standby databases for backup (just
 imagine an accidental DROP TABLE that gets propagated to all
 standbys withing seconds).


I don't relay on stand-by's for back up. But that timeline establishment
business is a key piece that I didn't realize.



  2. Failover. On master failure, pgpool will automatically select a new
 master, and degenerate all
  other nodes. The cluster is now in the emergency state and requires
 manual intervention for
  reconfiguration and recovery. pgpool executes a script to promote a
 node, that script will create a
  trigger file on a newly selected master node, and postgres will exist
 stand-by mode.
 
  QUESTION: If multiple pgpools are running, and if there are no network
 problems, and configuration
  files are identical, is there any guarantee that the same stand-by node
 will be selected for
  promotion? Concern here is that with configuration of (M-SB0-SB1) one
 pgpool decides to promote SB0
  and another - SB1, causing both of them to enter master mode, and
 splitting the cluster. It does look
  that pgpool will always select next alive node for promotion, but I
 couldn't find a definitive
  statement on that.

 I don't know about pgpool and its abilities to handle
 cluster failover, but I wouldn't go this way at all.
 Even if the answer were that in the circumstances you
 describe things would work, you can depend on it that
 things will go wrong in ways different from what you
 expect, e.g. a broken network card.
 The consequences would be worse than I'd like to imagine.


I would imagine this situation will happen in any case, I don't logically
see how it's avoidable. If you only have one agent that has power to
promote a node to be a new master, you have SPF. If you have multiple
agents that can do the promotion, there is always a risk that they fall out
of sync.


 If you want reliable automatic failover, consider cluster
 software.


Anything you could please recommend?

[skipped]


  Yours,
 Laurenz Albe



Re: [GENERAL] Understanding streaming replication

2012-11-12 Thread Pawel Veselov
On Mon, Nov 12, 2012 at 10:11 AM, Pawel Veselov pawel.vese...@gmail.comwrote:


 On Mon, Nov 12, 2012 at 1:36 AM, Albe Laurenz laurenz.a...@wien.gv.atwrote:

 I'll try to answer the questions I can.


 Thank you!


 Pawel Veselov wrote:
  I've been struggling with understanding all the necessary pieces for
 streaming replication. So I put
  down the pieces as I did understand them, and would appreciate if you
 guys could point out any of the
  stuff I understood or have done wrong.
 
  The set up is pgpool + streaming replication + hot stand by. No load
 balancing, stand-by nodes will
  not receive any application queries (I don't have that big of a query
 load, and I don't want to risk
  inconsistent reads). There are no shared file systems, but there is a
 way to rsync/scp files between
  nodes. Fail-over is automatic, and should kick in within reasonably
 small period after master failure.
 
  1. Archiving. Should be turned on on all the nodes. The archive command
 should copy the archive file
  to the local archive directory, and rsync archive directory between all
 the nodes. My understanding is
  that archiving is necessary if a stand-by node ever missed enough WAL
 updates to need an old enough
  WAL that might have been removed from pg_xlog.
 You don't give details about how the rsync is triggered,

 but I'd advise against having rsync as part of archive_command.
 First, it is slow and if there is a lot of activity, the
 archiver will not be able to keep up.
 Second, if rsync fails, the WAL file will not be considered
 archived.

 Both these things will keep the WAL files from being deleted
 from pg_xlog.

 I'd schedule rsync as a cron job or similar.


 From your later comments, it's also apparent that these archived WALs will
 be useless after failover (for the purpose of recovery), so there is no
 reason to send them to all the nodes after all.


I obviously lost it here. The archives do need to be synchronized, for the
purpose of recovering slaves. If a slave dies, and I want to recover it, it
may need the archived WALs, and for this, the archives should be available
on the node. So, rsync (or something like that) is necessary. But it's a
bad idea to run the rsync from the archive command itself.


Re: [GENERAL] Unexpectedly high disk space usage RESOLVED (Manual reindex/vacuum)

2012-11-12 Thread Lists

On 11/10/2012 02:21 PM, Jeff Janes wrote:

On Fri, Nov 9, 2012 at 4:28 PM, Lists li...@benjamindsmith.com wrote:


...


3) For each of the tables from #2, run the commands
REINDEX TABLE $table;
VACUUM FULL ANALYZE $table;

The end result is a squeaky-clean database server with expected disk usage.

NOTES:

...


2) It was sheer chance that I discovered the need to reindex prior to vacuum
in order to get the disk space back.

As of 9.0, a vacuum full inherently does a reindex, so doing an
explicit one is neither necessary nor beneficial.

I don't know if your discovery is based on a non-full vacuum, or on an
older server.


I can only state that merely doing a vacuum full or vacuum full 
$tables sequentially did not free the space, whereas the sequential  
reindex $table, each followed immediately by a vacuum full $table) did. 
If you'd like I can easily recreate the scenario by simply not cleaning 
up one of the DB servers until it bloats up and make available (limit 
distribution) a binary copy of the database (EG: rsync the 
/var/lib/pgsql/ filesystem late at night) in order to help identify why 
it didn't work as expected.





5) I don't yet know if the full option for the vacuum is necessary to free
up all space. I will experiment with this and post results if useful.

The answer to this is mostly non-deterministic.  non-full vacuum can
only free space from the end of the table.

If all of your long-lived objects were created before pg_attribute got
bloated and so the bloat was due only to short-lived objects, then
non-full vacuum (if run often enough) should eventually be able to
return that space as the short-lived objects near the end start to go
away.  However, if even a single long-live object finds itself at the
end of the table, then only a vacuum full will ever be able to reclaim
that space.



Since the time period involved (weeks/months) would have included both a 
large number of created/destroyed temp tables and occasionally altered 
persistent objects it would appear that the full option a very good 
idea, at least periodically.


-Ben


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Enabling Autovacuum Postgres 9.1 (was Unexpectedly high disk space usage)

2012-11-12 Thread Lists
The good news is that we have now resolved our critical problem (disk 
space overuse) with a somewhat hackish, slow answer that is nonetheless 
good enough for now.


Now I'd like to work out how to get autovacuum to work smoothly within 
our cluster. I'm happy to try to clarify my notes and post them either 
here or on the PG wiki as I get them resolved.


There's a wealth of how to tune PG instruction that's old and (based on 
this thread alone) often stale enough to be classified as 
disinformative. For example, nearest I can tell, the entirety of this 
page is just wrong and/or irrelevant for 9.x and up: 
http://wiki.postgresql.org/wiki/VACUUM_FULL. In constrast to the advice 
to avoid Vacuum Full ( 
http://wiki.postgresql.org/wiki/Introduction_to_VACUUM,_ANALYZE,_EXPLAIN,_and_COUNT) 
comes the insight that vacuum full is necessary to clean up stale data 
that is not at the end of the table. (See Jeff Janes 11/10/2012 email)



non-full vacuum can
only free space from the end of the table.
This would imply that a full analyze is a good idea, at least 
periodically (weekly/monthly/quarterly) in a database that combines the 
use of temp tables and periodic changes to persistent objects. Does 
autovacuum ever do a full analyze? What about autovacuum and the 
reindex question at the end of this email?


On 11/10/2012 02:49 PM, Adrian Klaver wrote:


Seems to have changed in 8.3:

http://www.postgresql.org/docs/8.3/interactive/routine-vacuuming.html

Beginning in PostgreSQL 8.3, autovacuum has a multiprocess 
architecture: There is a daemon process, called the autovacuum 
launcher, which is in charge of starting autovacuum worker processes 
for all databases. The launcher will distribute the work across time, 
but attempt to start one worker on each database every 
autovacuum_naptime seconds. One worker will be launched for each 
database, with a maximum of autovacuum_max_workers processes running 
at the same time...




Sadly, this change means that I can no be certain of the utility of the 
otherwise excellent-sounding advice originally offered by Scott, quoted 
below. It appears that naptime is (as of 9.x)  almost irrelevant since 
it's defined per database, and dropping this from 1 minute to 5 seconds 
would have very little overall impact.



These two can be killers.  Long running transactions can cause
autovacuum processes to stall out or be autocancelled.
Long running transactions - is now long? In our system it's rare to 
have a transaction (even a prepared transaction) last much longer than a 
few minutes. Is that enough time to cause problems with AutoVacuum?



As well, since the default nap time is 1 minute, it will take at least
50 minutes to vacuum each db as nap time is how long autovac waits
between databases.

Reducing autovacuum nap time to 5 or 10 seconds would be a good move
here, also possibly making it more aggressive by increasing max worker
threads, decreasing cost delay (possibly to zero or close to it) and /
or increasing cost limit.  After making such a change then watching
iostat when vacuum is running to see how hard its hitting your IO
subsystem.  I'm guessing that with SSDs it isn't gonna be a big
problem.

As Greg Smith has pointed out in the past, usually the answer to an
autovacuum problem is making it more, not less aggressive.  Unless
you're flooding your IO this is almost always the right answer.  Keep
in mind that autovacuum by default is setup to be VERY unaggressive
because it may be running on a netbook for all it knows.

To tune autovacuum with 50 databases, start by dropping nap time to
something much lower, like 10s.  Then if you need to, drop cost delay
until you get to 0.  If you get to 0 and it's still not hitting your
IO too hard, but not keeping up, then increase cost limit.  If you get
to something in the 5000 to 1 range, and its still not keeping up
then start bumping the thread count


Should I increase the max_workers field from the default of 3 to 
(perhaps) 10? Noting that my solution to the disk space problem is 
effectively a max_worker of 1 since it's all done sequentially, I wonder 
if reducing max_workers would actually be better?


Also, what's the thread count ? Is that max_workers?

Why would I want to reduce the cost delay to 0, and how does this relate 
to cost_limit? Careful reading of the docs: 
http://www.postgresql.org/docs/9.1/static/runtime-config-resource.html 
makes me believe that, given my substantial I/O subsystem, I'd want to 
drop cost_delay to near zero and set the cost_limit really high, which 
is a rough restatement of the last quoted paragraph above. (I think)


Assuming that I make these suggestions and notice a subsequent system 
load problem, what information should I be gathering in order to provide 
better post-incident forensics? We have statistics turned on, and aren't 
using replication. (yet)


Lastly, there's the question of reindexing before full vacuum. I've 
observed that not doing a manual reindex prior to 

[GENERAL] plpgsql cursor reuse

2012-11-12 Thread David Greco
Came across this problem when trying to assign to a variable a field from a 
record that could come from multiple cursors. PG throws an error - 
ERROR:  type of parameter 7 (bigint) does not match that when preparing the 
plan (unknown). If I make the null column in c1 null::bigint to match cursor 
c2, it works fine.

Where is this plan coming from? Why would it match c1 to a plan coming from c2? 
In reality, the two cursors in question are wildly different- a join of about 
10 completely different tables. When I saw the text of the error I was a bit 
concerned that it was being overly flexible in matching the current cursor to 
another.

It errors out on the assignment to I, not the fetch. (maybe the fetch isn't 
actually being done until the data in r is used).




CREATE OR REPLACE FUNCTION demo.test_cursor_bug ( a IN integer ) RETURNS void AS
$BODY$
DECLARE
c1 cursor FOR SELECT 1 as shipmentid, null as olmid;
c2 cursor FOR SELECT 2 as shipmentid, 32::bigint as olmid;
r record;
i bigint;

BEGIN
IF ( a = 0 ) THEN
open c1;
fetch c1 INTO r;
close c1;
END IF;
IF ( a = 1 ) THEN
open c2;
fetch c2 INTO r;
close c2;
END IF;
i := r.olmid;

END;
$BODY$
LANGUAGE plpgsql;


select demo.test_cursor_bug(0);
select demo.test_cursor_bug(1);





Re: [GENERAL] plpgsql cursor reuse

2012-11-12 Thread David Johnston
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of David Greco
Sent: Monday, November 12, 2012 3:35 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] plpgsql cursor reuse

 

Came across this problem when trying to assign to a variable a field from a
record that could come from multiple cursors. PG throws an error - 

ERROR:  type of parameter 7 (bigint) does not match that when preparing the
plan (unknown). If I make the null column in c1 null::bigint to match
cursor c2, it works fine.

 

Where is this plan coming from? Why would it match c1 to a plan coming from
c2? In reality, the two cursors in question are wildly different- a join of
about 10 completely different tables. When I saw the text of the error I was
a bit concerned that it was being overly flexible in matching the current
cursor to another.

 

It errors out on the assignment to I, not the fetch. (maybe the fetch isn't
actually being done until the data in r is used).

 

 

 

 

CREATE OR REPLACE FUNCTION demo.test_cursor_bug ( a IN integer ) RETURNS
void AS 

$BODY$

DECLARE

c1 cursor FOR SELECT 1 as shipmentid, null as olmid;

c2 cursor FOR SELECT 2 as shipmentid, 32::bigint as olmid;

r record;

i bigint;

 

BEGIN

IF ( a = 0 ) THEN

open c1;

fetch c1 INTO r;

close c1;

END IF;

IF ( a = 1 ) THEN

open c2;

fetch c2 INTO r;

close c2;

END IF;

i := r.olmid;

 

END;

$BODY$

LANGUAGE plpgsql;

 

 

select demo.test_cursor_bug(0);

select demo.test_cursor_bug(1);

 

 

i := r.olmid;

 

I'm guessing that the system evaluates the fact that since i is a bigint
that r.olmid must be a bigint as well.  When it goes to find and retrieve
olmid at parameter position 7 it expects to find a bigint but instead finds
an undefined type with a NULL and so it throws an error.

 

The only solution to your example is to cast the explicit NULL to a bigint.
In your real-life situation there are possible other options but generally
speaking you need to remove the ambiguity explicitly.

 

The exact mechanics behind why this specific message is being thrown is
beyond me but regardless the fundamental reason and solution remain
constant.  You would get a different but similar message if you avoided
cursors and used direct SQL instead.  The use of cursors just causes a
slightly different ordering of operations to occur.  But, it is not
match[ing] c1 to a plan coming from c2 but rather it is matching the plan
for c1 to the declared type for i.

 

David J.

 

 

 



Re: [GENERAL] PostgreSQL and a clustered file system

2012-11-12 Thread Gunnar Nick Bluth

Am 12.11.2012 11:03, schrieb Ivan Voras:

Hello,

Is anyone running PostgreSQL on a clustered file system on Linux? By
clustered I actually mean shared, such that the same storage is
mounted by different servers at the same time (of course, only one
instance of PostgreSQL on only one server can be running on such a
setup, and there are a lot of other precautions that need to be satisfied).


TBTH, I don't see the potential benefit. Clustered filesystems have 
benefits for special use cases (e.g. redundant fileservers or 
applications that actually can work in parallel, relying on file 
locking, DB clusters that coordinate writes themselves, ...), but PG 
certainly is not one of these...


I'm not even sure if PG will lock any files at all (I'd guess not, as 
SHM is used for process coordination), so you'd actually have to roll 
your own safety measures. OTOH, you pay a huge price for the clustering 
part (google for DLM), which will slow your I/O down significantly.


If you're just looking for shared storage (e.g. because SAN space is 
fricking expensive and you don't want to allocate it twice to your 
primary and secondary server ;-), it is probably safer  performing 
better to implement the  volumes and filesystems as underlying cluster 
resources. That's something your cluster software should be able to 
handle... ;-)


A newly imported filesystem will also not carry any leftover file locks 
around, should your primary server fail. The journal replay should not 
hurt you too much, as the relevant I/O PG does is sync anyway.


Just my 2p,

--
Gunnar Nick Bluth
RHCE/SCLA

Mobil   +49 172 8853339
Email: gunnar.bl...@pro-open.de
__
In 1984 mainstream users were choosing VMS over UNIX.  Ten years later
they are choosing Windows over UNIX.  What part of that message aren't you
getting? - Tom Payne



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Unexpectedly high disk space usage RESOLVED (Manual reindex/vacuum)

2012-11-12 Thread Jeff Janes
On Mon, Nov 12, 2012 at 10:38 AM, Lists li...@benjamindsmith.com wrote:
 On 11/10/2012 02:21 PM, Jeff Janes wrote:

 On Fri, Nov 9, 2012 at 4:28 PM, Lists li...@benjamindsmith.com wrote:


 2) It was sheer chance that I discovered the need to reindex prior to
 vacuum
 in order to get the disk space back.

 As of 9.0, a vacuum full inherently does a reindex, so doing an
 explicit one is neither necessary nor beneficial.

 I don't know if your discovery is based on a non-full vacuum, or on an
 older server.


 I can only state that merely doing a vacuum full or vacuum full $tables
 sequentially did not free the space, whereas the sequential  reindex $table,
 each followed immediately by a vacuum full $table) did.

With what version?


 If you'd like I can
 easily recreate the scenario by simply not cleaning up one of the DB
 servers until it bloats up and make available (limit distribution) a binary
 copy of the database (EG: rsync the /var/lib/pgsql/ filesystem late at
 night) in order to help identify why it didn't work as expected.

Do you think can make an easily script-able way to re-create the
resistant bloat?  That would be better than trying to disseminate
binary files, I think.

What I did was just create and drop temp tables in a tight loop, with
autovacuum off, and then once pg_attribute got good and bloated, did a
vacuum full as the database owner or superuser.


 If all of your long-lived objects were created before pg_attribute got
 bloated and so the bloat was due only to short-lived objects, then
 non-full vacuum (if run often enough) should eventually be able to
 return that space as the short-lived objects near the end start to go
 away.  However, if even a single long-live object finds itself at the
 end of the table, then only a vacuum full will ever be able to reclaim
 that space.


 Since the time period involved (weeks/months) would have included both a
 large number of created/destroyed temp tables and occasionally altered
 persistent objects it would appear that the full option a very good idea, at
 least periodically.

If you can prevent the extreme bloat from occurring in the first
place, then the end of the table would not be so far away from its
desired size that it needs to get reset by a vacuum full.

If you find your self in need of a vacuum full, then you should do
one.  But you should ask yourself what went wrong that you got into
that situation in the first place.

Cheers,

Jeff


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] plpgsql cursor reuse

2012-11-12 Thread salah jubeh
Hello David,

Well, I think this is normal you can t assign null to a variable without a 
proper cating in your example you can do somthing like this 

 c1 cursor FOR SELECT 1 as shipmentid, null::bigint as olmid;

Regards 






 From: David Greco david_gr...@harte-hanks.com
To: pgsql-general@postgresql.org pgsql-general@postgresql.org 
Sent: Monday, November 12, 2012 9:34 PM
Subject: [GENERAL] plpgsql cursor reuse
 

 
Came across this problem when trying to assign to a variable a field from a 
record that could come from multiple cursors. PG throws an error – “
ERROR:  type of parameter 7 (bigint) does not match that when preparing the 
plan (unknown)”. If I make the null column in c1 null::bigint to match cursor 
c2, it works fine.
 
Where is this plan coming from? Why would it match c1 to a plan coming from c2? 
In reality, the two cursors in question are wildly different- a join of about 
10 completely different tables. When I saw the text of the error I was a bit 
concerned that it was being overly flexible in matching the current cursor to 
another.
 
It errors out on the assignment to I, not the fetch. (maybe the fetch isn’t 
actually being done until the data in r is used).
 
 
 
 
CREATE OR REPLACE FUNCTION demo.test_cursor_bug ( a IN integer ) RETURNS void 
AS 
$BODY$
DECLARE
    c1 cursor FOR SELECT 1 as shipmentid, null as olmid;
    c2 cursor FOR SELECT 2 as shipmentid, 32::bigint as olmid;
    r record;
    i bigint;
 
BEGIN
    IF ( a = 0 ) THEN
    open c1;
    fetch c1 INTO r;
    close c1;
    END IF;
    IF ( a = 1 ) THEN
    open c2;
    fetch c2 INTO r;
    close c2;
    END IF;
    i := r.olmid;
 
END;
$BODY$
LANGUAGE plpgsql;
 
 
select demo.test_cursor_bug(0);
select demo.test_cursor_bug(1);

Re: [GENERAL] PostgreSQL and a clustered file system

2012-11-12 Thread David Boreham

On 11/12/2012 1:52 PM, Gunnar Nick Bluth wrote:

Am 12.11.2012 11:03, schrieb Ivan Voras:



Is anyone running PostgreSQL on a clustered file system on Linux? By
clustered I actually mean shared, such that the same storage is
mounted by different servers at the same time (of course, only one
instance of PostgreSQL on only one server can be running on such a
setup, and there are a lot of other precautions that need to be 
satisfied).



TBTH, I don't see the potential benefit. Clustered filesystems have 
benefits for special use cases (e.g. redundant fileservers or 
applications that actually can work in parallel, relying on file 
locking, DB clusters that coordinate writes themselves, ...), but PG 
certainly is not one of these...


Although I'm also a non-fan of database over clustered filesystems, I 
wanted to speak up here since I have a hunch the OP wasn't looking for 
the solution you're thinking of. I think he's asking if folk have run an 
HA setup with PG where the database files are stored in a 
dual-ported/clustered filesystem, the idea being that you can have a 
failure in the hardware for one DB server, and make the other one take 
over using the same underlying files. I've seen this done. You need some 
plumbing to ensure that only one PG instance can run at the same time. 
From memory, there are HA monitor tools available that do this in a 
generic way. As far as PG is concerned it has no idea there is a shared 
access filesystem, and so it has no need to perform file-level locking.





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Unexpectedly high disk space usage RESOLVED (Manual reindex/vacuum)

2012-11-12 Thread Lists

On 11/12/2012 01:31 PM, Jeff Janes wrote:

On Mon, Nov 12, 2012 at 10:38 AM, Lists li...@benjamindsmith.com wrote:

On 11/10/2012 02:21 PM, Jeff Janes wrote:

On Fri, Nov 9, 2012 at 4:28 PM, Lists li...@benjamindsmith.com wrote:


2) It was sheer chance that I discovered the need to reindex prior to
vacuum
in order to get the disk space back.

As of 9.0, a vacuum full inherently does a reindex, so doing an
explicit one is neither necessary nor beneficial.

I don't know if your discovery is based on a non-full vacuum, or on an
older server.


I can only state that merely doing a vacuum full or vacuum full $tables
sequentially did not free the space, whereas the sequential  reindex $table,
each followed immediately by a vacuum full $table) did.

With what version?

[root@alpha ~]# rpm -qi postgresql91-server
Name: postgresql91-server  Relocations: (not relocatable)
Version : 9.1.5 Vendor: (none)
Release : 3PGDG.rhel6   Build Date: Sun 02 Sep 2012 
12:13:18 PM UTC
Install Date: Wed 12 Sep 2012 03:04:41 AM UTC  Build Host: 
koji-sl6-x86-64-pg91
Group   : Applications/DatabasesSource RPM: 
postgresql91-9.1.5-3PGDG.rhel6.src.rpm

Size: 15191132 License: PostgreSQL
Signature   : DSA/SHA1, Sun 02 Sep 2012 12:13:24 PM UTC, Key ID 
1f16d2e1442df0f8

URL : http://www.postgresql.org/





If you'd like I can
easily recreate the scenario by simply not cleaning up one of the DB
servers until it bloats up and make available (limit distribution) a binary
copy of the database (EG: rsync the /var/lib/pgsql/ filesystem late at
night) in order to help identify why it didn't work as expected.

Do you think can make an easily script-able way to re-create the
resistant bloat?  That would be better than trying to disseminate
binary files, I think.
It would only be better if it actually created the situation that caused 
the space to not be freed. But, until you know the actual cause of a 
problem, I've found that it's often not productive to create simulations 
that may or may not be actually related to the problem.




What I did was just create and drop temp tables in a tight loop, with
autovacuum off, and then once pg_attribute got good and bloated, did a
vacuum full as the database owner or superuser.
Based on my understanding, if your loop included an intermittent schema 
change from within a transaction it might better approximate my actual 
scenario. Merely creating temp tables and then dropping them would 
create lots of activity at the end of the table which would free 
correctly. This still does not explain why reindex $table works when 
reindex is supposedly implicit in the vacuum.



If all of your long-lived objects were created before pg_attribute got
bloated and so the bloat was due only to short-lived objects, then
non-full vacuum (if run often enough) should eventually be able to
return that space as the short-lived objects near the end start to go
away.  However, if even a single long-live object finds itself at the
end of the table, then only a vacuum full will ever be able to reclaim
that space.


Since the time period involved (weeks/months) would have included both a
large number of created/destroyed temp tables and occasionally altered
persistent objects it would appear that the full option a very good idea, at
least periodically.

If you can prevent the extreme bloat from occurring in the first
place, then the end of the table would not be so far away from its
desired size that it needs to get reset by a vacuum full.

If you find your self in need of a vacuum full, then you should do
one.  But you should ask yourself what went wrong that you got into
that situation in the first place.

I agree; this is why my questions on enabling autovacuum in a related 
thread.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Enabling Autovacuum Postgres 9.1 (was Unexpectedly high disk space usage)

2012-11-12 Thread Kevin Grittner
Lists wrote:
 There's a wealth of how to tune PG instruction that's old and
 (based on this thread alone) often stale enough to be classified
 as disinformative. For example, nearest I can tell, the entirety of
 this page is just wrong and/or irrelevant for 9.x and up: 
 http://wiki.postgresql.org/wiki/VACUUM_FULL. In constrast to the
 advice to avoid Vacuum Full [...] comes the insight that vacuum
 full is necessary to clean up stale data that is not at the end
 of the table. (See Jeff Janes 11/10/2012 email)

I thought about posting a clarification to what Jeff said there -- it
is only necessary to run VACUUM FULL to free up space in the sense
of having the database give disk space back to the OS so that the
next time you need space in the table it must be re-allocated from
the OS. This can be a bad thing in terms of slower allocation of
space for new tuples and allocation of non-contiguous disk space. You
are much better off running autovacuum aggressively enough that you
don't need to run VACUUM FULL than to run it periodically. I have
gone for years at a time on large databases (up to several TB)
without ever running VACUUM FULL on a table. It is a valuable tool to
recover from a bad state due to failure to properly maintain the
database; it is not a substitute for doing things right.

 Long running transactions can cause autovacuum processes to stall
 out or be autocancelled. Long running transactions - is now
 long? In our system it's rare to have a transaction (even a
 prepared transaction) last much longer than a few minutes. Is that
 enough time to cause problems with AutoVacuum?

The only situation where I would expect that to be a problem is in a
very small table which is updated hundreds of times per second.

 Should I increase the max_workers field from the default of 3 to 
 (perhaps) 10? Noting that my solution to the disk space problem is 
 effectively a max_worker of 1 since it's all done sequentially, I
 wonder if reducing max_workers would actually be better?

I would increase autovacuum_max_workers.

 Also, what's the thread count ? Is that max_workers?

I believe that's what was meant.

 Why would I want to reduce the cost delay to 0, and how does this
 relate to cost_limit?

I've never gone all the way to 0. I would be reluctant to go below
10; rather I would increase the cost limit. Autovacuum adds costs for
what it is doing, and when it hits the limit it sleeps for the cost
delay interval. I would think you would want to allow the process to
work on other things periodically.

 given my substantial I/O subsystem, I'd want to drop cost_delay to
 near zero and set the cost_limit really high, which is a rough
 restatement of the last quoted paragraph above. (I think)

I agree.

 Assuming that I make these suggestions and notice a subsequent
 system load problem, what information should I be gathering in
 order to provide better post-incident forensics?

Run a report on bloat periodically, to make sure it is staying under
control. You might want to monitor for long-running transactions and
prepared transactions which the transaction manager has lost track
of. The can be real killers.

During any incident of poor performance, it is good to gather iostat
or vmstat samples at an interval of 1 to 10 seconds for a few
minutes, and to capture the contents of pg_stat_activity and
pg_locks.

Turning on logging of checkpoint activity, autovacuum activity, and
slow queries can provide useful information when you match times from
the logging up against the times of slow periods.

 Lastly, there's the question of reindexing before full vacuum.
 I've observed that not doing a manual reindex prior to vacuum full
 did not, in fact, free up the space, even though I've been told
 that reindex is implicit in the vacuum process. (?!) I'm confident
 that I can reproduce this behavior given a bit of time to allow one
 of our database servers to bloat back up.

You never did say what version that was on. In 9.0 and later, VACUUM
FULL will drop and recreate all indexes from scratch after rebuilding
the heap, so it's pretty hard to imagine why dropping and recreating
all indexes right *before* rebuilding the heap would have much
impact. Prior to 9.0, VACUUM FULL would bloat indexes, so it was
always a good idea to REINDEX in addition to running VACUUM FULL,
although it was much more effective to run it *after* the VACUUM
FULL.

-Kevin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] PostgreSQL and a clustered file system

2012-11-12 Thread Stephen Frost
Ivan,

* Ivan Voras (ivo...@freebsd.org) wrote:
 Is anyone running PostgreSQL on a clustered file system on Linux? By
 clustered I actually mean shared, such that the same storage is
 mounted by different servers at the same time (of course, only one
 instance of PostgreSQL on only one server can be running on such a
 setup, and there are a lot of other precautions that need to be satisfied).

Yes.  I've seen PG used across both DRBD and OCFS2.  It can be made to
work quite well as an HA setup, provided you do appropriate fencing to
make sure that two postmasters don't end up trying to access the data
directory at the same time, etc, etc.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [GENERAL] Enabling Autovacuum Postgres 9.1 (was Unexpectedly high disk space usage)

2012-11-12 Thread Greg Williamson
Kevin --

You wrote:
...
 running transactions can cause autovacuum processes to stall
 out or be autocancelled. Long running transactions - is now
 long? In our system it's rare to have a transaction (even a
 prepared transaction) last much longer than a few minutes. Is that
 enough time to cause problems with AutoVacuum?

The only situation where I would expect that to be a problem is in a
very small table which is updated hundreds of times per second.


Could you elaborate on this, or point me to a previous thread ?

I've got precisely such a database; currently we do an hourly reindex on all 
user tables in addition to some fairly standard autovac settings.

The tables themselves don't seem to bloat much during ordinary operations but 
if we don't reindex performance tanks; when the site has issues we sometimes 
see table bloat but it seems to be dealt with by autovac.

This is a postgres 9.1 instance w/ SSDs, lots of RAM (24 gigs) and relatively 
small tables (maybe a few thousands of rows in most cases, total footprint of 
the database is 240 megs) and being updated in the low thousands of times per 
second.

Thanks!

Greg Williamson



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Running out of memory while making a join

2012-11-12 Thread Carlos Henrique Reimer
Hi,

That is what I got from gdb:

TopMemoryContext: 88992 total in 10 blocks; 10336 free (7 chunks); 78656
used
  Type information cache: 24576 total in 2 blocks; 11888 free (5 chunks);
12688 used
  Operator lookup cache: 24576 total in 2 blocks; 11888 free (5 chunks);
12688 used
  Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512
used
  TopTransactionContext: 8192 total in 1 blocks; 7648 free (1 chunks); 544
used
  MessageContext: 73728 total in 4 blocks; 14752 free (3 chunks); 58976 used
  smgr relation table: 24576 total in 2 blocks; 13904 free (4 chunks);
10672 used
  TransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks);
32 used
  Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
PortalHeapMemory: 1024 total in 1 blocks; 744 free (0 chunks); 280 used
  ExecutorState: 11586756656 total in 1391 blocks; 4938408 free (6
chunks); 11581818248 used
TIDBitmap: 57344 total in 3 blocks; 23936 free (8 chunks); 33408
used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8016 free (0 chunks); 176 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
  Relcache by OID: 24576 total in 2 blocks; 14912 free (3 chunks); 9664 used
  CacheMemoryContext: 817392 total in 20 blocks; 145728 free (0 chunks);
671664 used
uni_codauten: 2048 total in 1 blocks; 752 free (0 chunks); 1296 used
pk_nfe: 2048 total in 1 blocks; 632 free (0 chunks); 1416 used
idx_wm_nfsp_dtemis: 2048 total in 1 blocks; 752 free (0 chunks); 1296
used
idx_wm_nfsp_codpre: 2048 total in 1 blocks; 704 free (0 chunks); 1344
used
wm_nfsp_pkey: 2048 total in 1 blocks; 440 free (0 chunks); 1608 used
pg_index_indrelid_index: 2048 total in 1 blocks; 704 free (0 chunks);
1344 used
pg_attrdef_adrelid_adnum_index: 2048 total in 1 blocks; 608 free (0
chunks); 1440 used
pg_ts_dict_oid_index: 3072 total in 2 blocks; 1744 free (3 chunks);
1328 used
pg_aggregate_fnoid_index: 3072 total in 2 blocks; 1744 free (3 chunks);
1328 used
pg_language_name_index: 3072 total in 2 blocks; 1744 free (3 chunks);
1328 used
pg_statistic_relid_att_index: 3072 total in 2 blocks; 1600 free (2
chunks); 1472 used
pg_ts_dict_dictname_index: 3072 total in 2 blocks; 1648 free (2
chunks); 1424 used
pg_namespace_nspname_index: 3072 total in 2 blocks; 1696 free (2
chunks); 1376 used
pg_opfamily_oid_index: 3072 total in 2 blocks; 1744 free (3 chunks);
1328 used
pg_opclass_oid_index: 3072 total in 2 blocks; 1696 free (2 chunks);
1376 used
pg_ts_parser_prsname_index: 3072 total in 2 blocks; 1648 free (2
chunks); 1424 used
pg_amop_fam_strat_index: 3072 total in 2 blocks; 1384 free (2 chunks);
1688 used
pg_opclass_am_name_nsp_index: 3072 total in 2 blocks; 1624 free (3
chunks); 1448 used
pg_trigger_tgrelid_tgname_index: 3072 total in 2 blocks; 1600 free (2
chunks); 1472 used
pg_cast_source_target_index: 3072 total in 2 blocks; 1600 free (2
chunks); 1472 used
pg_auth_members_role_member_index: 3072 total in 2 blocks; 1648 free (2
chunks); 1424 used
pg_attribute_relid_attnum_index: 3072 total in 2 blocks; 1600 free (2
chunks); 1472 used
pg_ts_config_cfgname_index: 3072 total in 2 blocks; 1648 free (2
chunks); 1424 used
pg_authid_oid_index: 3072 total in 2 blocks; 1696 free (2 chunks); 1376
used
pg_ts_config_oid_index: 3072 total in 2 blocks; 1744 free (3 chunks);
1328 used
pg_conversion_default_index: 3072 total in 2 blocks; 1432 free (3
chunks); 1640 used
pg_language_oid_index: 3072 total in 2 blocks; 1744 free (3 chunks);
1328 used
pg_enum_oid_index: 3072 total in 2 blocks; 1744 free (3 chunks); 1328
used
pg_proc_proname_args_nsp_index: 3072 total in 2 blocks; 1624 free (3
chunks); 1448 used
pg_ts_parser_oid_index: 3072 total in 2 blocks; 1744 free (3 chunks);
1328 used
pg_database_oid_index: 3072 total in 2 blocks; 1696 free (2 chunks);
1376 used
pg_conversion_name_nsp_index: 3072 total in 2 blocks; 1648 free (2
chunks); 1424 used
pg_class_relname_nsp_index: 3072 total in 2 blocks; 1600 free (2
chunks); 1472 used
pg_attribute_relid_attnam_index: 3072 total in 2 blocks; 1648 free (2
chunks); 1424 used
pg_class_oid_index: 3072 total in 2 blocks; 1696 free (2 chunks); 1376
used
pg_amproc_fam_proc_index: 3072 total in 2 blocks; 1384 free (2 chunks);
1688 used
pg_operator_oprname_l_r_n_index: 3072 total in 2 blocks; 1384 free (2
chunks); 1688 used
pg_index_indexrelid_index: 3072 total in 2 blocks; 1696 free (2
chunks); 1376 used
pg_type_oid_index: 3072 total in 2 blocks; 1696 free (2 chunks); 1376
used
pg_rewrite_rel_rulename_index: 3072 total in 2 blocks; 1648 free (2
chunks); 1424 used
pg_authid_rolname_index: 3072 total in 2 blocks; 1696 free (2 chunks);
1376 used
pg_auth_members_member_role_index: 3072 total 

Re: [GENERAL] Enabling Autovacuum Postgres 9.1 (was Unexpectedly high disk space usage)

2012-11-12 Thread Craig Ringer
On 11/13/2012 04:04 AM, Lists wrote:

 There's a wealth of how to tune PG instruction that's old and (based
 on this thread alone) often stale enough to be classified as
 disinformative. For example, nearest I can tell, the entirety of this
 page is just wrong and/or irrelevant for 9.x and up:
 http://wiki.postgresql.org/wiki/VACUUM_FULL.
Agreed, that needs fixing. I'll sort that out. That information was
important before the VACUUM FULL rewrite, but is now severely outdated.
It needs to be split into pre-9.1 and 9.1+ sections.

 In constrast to the advice to avoid Vacuum Full (
 http://wiki.postgresql.org/wiki/Introduction_to_VACUUM,_ANALYZE,_EXPLAIN,_and_COUNT)
 comes the insight that vacuum full is necessary to clean up stale data
 that is not at the end of the table. (See Jeff Janes 11/10/2012 email)
That part isn't quite right AFAIK.

If you keep table bloat under control, ordinary VACCUM is perfectly
sufficient. You only need/want to VACUUM FULL if you wish to truncate a
table, reducing its size by compacting rows. In an actively used
database that's usually pretty pointless, since new rows will then be
added to the end, growing the table. You might as well just keep on
re-using the space, rather than going through those compact-and-expand
cycles. You'll suffer from less file fragmentation that way and won't be
paying the costs of file system allocations.

If you have a table that you've dramatically reduced in size (say, by
deleting the vast majority of it) and you won't be adding more rows to
replace the old ones, that's when VACUUM FULL makes sense.

It's a bit like those utilities that claim to free or clean or
de-fragment memory. They seem good, but they're actually grossly
counter-productive, because the system then has to re-read cached data
and otherwise fight to get back to its old equilibrium. It's typically
the same for Pg: you want to aim for equilibrium, not free space that'll
just promptly get re-allocated.

If you do have a bad bloat problem, I'd set a non-default FILLFACTOR
before doing a VACUUM FULL, so you still have some free space within the
table after vacuum completes. That way you won't be immediately paying
the cost of allocating space for new rows as soon as any UPDATEs or
INSERTs come in.


 non-full vacuum can
 only free space from the end of the table.
 This would imply that a full analyze is a good idea, at least
 periodically (weekly/monthly/quarterly) in a database that combines
 the use of temp tables and periodic changes to persistent objects.
 Does autovacuum ever do a full analyze? What about autovacuum and
 the reindex question at the end of this email?
I'm not sure there's any such thing as a full analyze. VACUUM FULL
ANALYZE is Do a VACUUM FULL and an ANALYZE, not Do a full analyze.

Autovacuum should be taking care of analyze and table statistics. If it
isn't, adjust autovacuum parameters so that it does.

 These two can be killers.  Long running transactions can cause
 autovacuum processes to stall out or be autocancelled.
 Long running transactions - is now long? In our system it's rare to
 have a transaction (even a prepared transaction) last much longer than
 a few minutes. Is that enough time to cause problems with AutoVacuum?

Not generally, no.

PostgreSQL can't clean up rows that are still visible to a transaction.
So if your transactions are three minutes long, that's a three minute
delay before VACUUM can clean up DELETEd rows or dead rows left by
UPDATEs. Not a biggie even on a pretty high load DB.

You should generally be concerned only when transactions are open over
user think time or are abandoned by buggy applications - cases where
the transaction length is many minutes or hours, potentially unbounded.
Uncommitted prepared transactions are also a problem for similar reasons.

 Lastly, there's the question of reindexing before full vacuum. I've
 observed that not doing a manual reindex prior to vacuum full did not,
 in fact, free up the space, even though I've been told that reindex is
 implicit in the vacuum process. (?!)

VACUUM FULL, post-9.1, should take care of index bloat.

Pre-9.1 VACUUM FULL could make index bloat worse.

Ordinary VACUUM will not truncate indexes AFAIK, only mark free space
within them so it can be re-used. Same deal as with the table its self:
this is usually what you want.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



Re: [GENERAL] Enabling Autovacuum Postgres 9.1 (was Unexpectedly high disk space usage)

2012-11-12 Thread Craig Ringer
On 11/13/2012 10:29 AM, Craig Ringer wrote:
 On 11/13/2012 04:04 AM, Lists wrote:

 There's a wealth of how to tune PG instruction that's old and (based
 on this thread alone) often stale enough to be classified as
 disinformative. For example, nearest I can tell, the entirety of this
 page is just wrong and/or irrelevant for 9.x and up:
 http://wiki.postgresql.org/wiki/VACUUM_FULL.
 Agreed, that needs fixing. I'll sort that out. That information was
 important before the VACUUM FULL rewrite, but is now severely
 outdated. It needs to be split into pre-9.1 and 9.1+ sections.
Correction - pre-9.0 and 9.0+

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



Re: [GENERAL] integer instead of 'double precision'?

2012-11-12 Thread Martijn van Oosterhout
On Mon, Nov 12, 2012 at 02:16:21PM +0100, Willy-Bas Loos wrote:
 On Fri, Sep 9, 2011 at 5:09 PM, Guillaume Lelarge 
 guilla...@lelarge.infowrote:
 
  You divide an integer with an integer, that should give you an integer.
 
 
 Can you tell me the reasoning behind that idea?
 Is it a rule that the output type of an operator must equal the input type?
 In this case that doesn't seem locigal. I think that the / operator
 should return something that allows fractions, since the operator creates
 fractions so frequently.

The thing is, you often do need the version that truncates. It's
supported by the underlying system and if you want a float as output
you can cast one of the arguments to float to do that.  It's been like
this forever (C does it too for example).

For integers it may help if you think of it in combination with the
modulus operator (%).

Python 3 recently changed to give float output by default, but also
provides a // operator to access the truncated version.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature