Re: [GENERAL] window function ordering not working as expected
On Tue, Feb 17, 2015 at 4:18 PM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: I'm interested in seeing: * the date for the most recent result * test name (identifier) * most recent result (decimal value) * the worst (lowest decimal value) test result from the past 21 days * the date which corresponds with the worst test result from the past 21 days * the 2nd worst (2nd lowest decimal value) test result ... The problem that I'm seeing is in the prv_score column. It should show a value of 0.6, which corresponds with 2015-02-13, however instead its returning 0.7. I thought by ordering by metrics-'PT TWBR' I'd always be sorting by the scores, and as a result, the lead(metrics-'PT TWBR', 1) would give me the next greatest value of the score. Thus my confusion as to why ORDER BY metrics-'PT TWBR' isn't working as expected. lead() and lag() retrieve values from rows that are N away from the current row in the specified ordering. That isn't what you want here AFAICS. I think the worst test result would be obtained with nth_value(metrics-'PT TWBR', 1) which is equivalent to what you used, first_value(metrics-'PT TWBR') while the 2nd worst result would be obtained with nth_value(metrics-'PT TWBR', 2) However, worst and 2nd worst with this implementation would mean worst and 2nd worst within the partition, which isn't the stated goal either, at least not with the partition definition you're using. What you really want for the worst in last 21 days is something like min(metrics-'PT TWBR') OVER ( PARTITION BY ... that same mess you used ... ORDER BY tstamp RANGE BETWEEN '21 days'::interval PRECEDING AND CURRENT ROW) However Postgres doesn't implement RANGE x PRECEDING yet. You could get worst in last 21 observations easily: min(metrics-'PT TWBR') OVER ( PARTITION BY ... that mess ... ORDER BY tstamp ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) and maybe that's close enough. I do not know an easy way to get second worst :-(. You could build a user-defined aggregate to produce second smallest value among the inputs and then apply it in the same way as I used min() here. Thanks Tom, much appreciate the fast reply. I'll chew this over and see if I have any other questions. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] window function ordering not working as expected
Greetings, I have a postgresql-9.3.x database with a table with a variety of date stamped test results, some of which are stored in json format (natively in the database). I'm attempting to use some window functions to pull out specific data from the test results over a a time window, but part of the results are not making sense. Some tests run every day, others less frequently. For each unique test's results, I'm interested in seeing: * the date for the most recent result * test name (identifier) * most recent result (decimal value) * the worst (lowest decimal value) test result from the past 21 days * the date which corresponds with the worst test result from the past 21 days * the 2nd worst (2nd lowest decimal value) test result Here's a sample of the data and resulting score for one test (tname) from the past few weeks: tstamp | tname | score +-+ 2015-02-17 | dfw001.ix-cr-02 | 0.7 2015-02-15 | dfw001.ix-cr-02 | 0.6 2015-02-14 | dfw001.ix-cr-02 | 0.6 2015-02-14 | dfw001.ix-cr-02 | 0.7 2015-02-13 | dfw001.ix-cr-02 | 0.6 2015-02-12 | dfw001.ix-cr-02 | 0.7 2015-02-11 | dfw001.ix-cr-02 | 0.7 2015-02-10 | dfw001.ix-cr-02 | 0.7 2015-02-09 | dfw001.ix-cr-02 | 0.7 2015-02-08 | dfw001.ix-cr-02 | 0.7 2015-02-08 | dfw001.ix-cr-02 | 0.5 2015-02-07 | dfw001.ix-cr-02 | 0.7 2015-02-07 | dfw001.ix-cr-02 | 0.5 2015-02-06 | dfw001.ix-cr-02 | 0.7 2015-02-05 | dfw001.ix-cr-02 | 0.7 2015-02-04 | dfw001.ix-cr-02 | 0.7 2015-01-30 | dfw001.ix-cr-02 | 0.7 Here's the SQL query that I'm running: SELECT * FROM (SELECT tstamp, concat_ws('/',attrs-'RCluster ID', regexp_replace(replace(replace(attrs-'ASN HTML','/a',''),'a href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(d=5d'' target=''_blank'')','')) AS tname , metrics-'PT TWBR' AS score, first_value(metrics-'PT TWBR') OVER (PARTITION BY concat_ws('/',attrs-'Route Cluster ID', regexp_replace(replace(replace(attrs-'ASN HTML','/a',''),'a href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(d=5d'' target=''_blank'')','')) ORDER BY metrics-'PT TWBR') AS worst_score, first_value(tstamp) OVER (PARTITION BY concat_ws('/',attrs-'Route Cluster ID', regexp_replace(replace(replace(attrs-'ASN HTML','/a',''),'a href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(d=5d'' target=''_blank'')','')) ORDER BY metrics-'PT TWBR') AS worst_date, lead(metrics-'PT TWBR', 1) OVER (PARTITION BY concat_ws('/',attrs-'Route Cluster ID', regexp_replace(replace(replace(attrs-'ASN HTML','/a',''),'a href= ''http://ncapp100.prod.com/Cluster3.php?asn=',''),'\d+(d=5d'' target=''_blank'')','')) ORDER BY metrics-'PT TWBR') AS prv_score FROM btworks WHERE age(now(),tstamp) '21 days' ORDER BY tstamp DESC, rank ) AS stuff WHERE tstamp = '2015-02-17'; Here's the data from the above query as it pertains to the data (tname='dfw001.ix-cr-02') set that I posted above: tstamp | tname | score | worst_score | worst_date | prv_score +---+---+-++--- 2015-02-17 | dfw001.ix-cr-02 | 0.7 | 0.5 | 2015-02-08 | 0.7 The problem that I'm seeing is in the prv_score column. It should show a value of 0.6, which corresponds with 2015-02-13, however instead its returning 0.7. I thought by ordering by metrics-'PT TWBR' I'd always be sorting by the scores, and as a result, the lead(metrics-'PT TWBR', 1) would give me the next greatest value of the score. Thus my confusion as to why ORDER BY metrics-'PT TWBR' isn't working as expected. thanks in advance for any pointers. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] partitioned table + postgres_FDW not working in 9.3
Hi Shigeru, Thanks for your reply. This sounds like a relatively simple workaround, so I'll give it a try. Is the search_path of the remote session that postgres_fdw forces considered to be intentional, expected behavior, or is it a bug? thanks! On Wed, Sep 25, 2013 at 7:13 PM, Shigeru Hanada shigeru.han...@gmail.com wrote: Hi Lonni, 2013/9/25 Lonni J Friedman netll...@gmail.com: The problem that I'm experiencing is if I attempt to perform an INSERT on the foreign nppsmoke table on cluster a, it fails claiming that the table partition which should hold the data in the INSERT does not exist: ERROR: relation nppsmoke_2013_09 does not exist CONTEXT: Remote SQL command: INSERT INTO public.nppsmoke(id, date_created, last_update, build_type, current_status, info, cudacode, gpu, subtest, os, osversion, arch, cl, dispvers, branch, pass, fail, oldfail, newfail, failureslog, totdriver, ddcl, buildid, testdcmd, pclog, filtercount, filterlog, error) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28) PL/pgSQL function public.nppsmoke_insert_trigger() line 30 at SQL statement I could reproduce the problem. If I run the same exact SQL INSERT on cluster b (not using the foreign table), then it works. So whatever is going wrong seems to be related to the foreign table. Initially I thought that perhaps the problem was that I needed to create all of the partitions as foreign tables on cluster a, but that doesn't help. Am I hitting some kind of foreign data wrapper limitation, or am I doing something wrong? The cause of the problem is search_path setting of remote session. For some reasons, postgres_fdw forces the search_path on the remote side to be 'pg_catalog', so all objects used in the session established by postgres_fdw have to be schema-qualified. Trigger function is executed in such context, so you need to qualify all objects in your trigger function with schema name, like 'public.nppsmoke_2013_09'. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] partitioned table + postgres_FDW not working in 9.3
On Thu, Sep 26, 2013 at 8:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: Thanks for your reply. This sounds like a relatively simple workaround, so I'll give it a try. Is the search_path of the remote session that postgres_fdw forces considered to be intentional, expected behavior, or is it a bug? It's intentional. Possibly more to the point, don't you think your trigger function is rather fragile if it assumes the caller has provided a particular search path setting? To be honest, I don't have much experience with functions, and was using the trigger function from the official documentation: http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] pg_basebackup: ERROR: could not find any WAL files (9.3)
Greetings, I've recently pushed a new postgres-9.3 (Linux-x86_64/RHEL6) cluster into production, with one master, and two hot standby streaming replication slaves. Everything seems to be working ok, however roughly half of my pg_basebackup attempts are failing at the very end with the error: pg_basebackup: could not get transaction log end position from server: ERROR: could not find any WAL files I should note that I'm running pg_basebackup on one of the two slaves, and not the master. However, I've got an older, separate 9.3 cluster with the same setup, and pg_basebackup never fails there. I thought that the WAL files in question were coming from the pg_xlog subdirectory. But I don't see any lack of files there on the server running pg_basebackup. They are being generated continuously (as expected), before, during after the pg_basebackup. I scanned the source ( http://doxygen.postgresql.org/basebackup_8c_source.html ), and it seems to backup my understanding of the expected behavior: 306 /* 307 * There must be at least one xlog file in the pg_xlog directory, 308 * since we are doing backup-including-xlog. 309 */ 310 if (nWalFiles 1) 311 ereport(ERROR, 312 (errmsg(could not find any WAL files))); However, what I see on the server conflicts with the error. pg_basebackup was invoked on Thu Sep 26 01:00:01 PDT 2013, and failed on Thu Sep 26 02:09:12 PDT 2013. In the pg_xlog subdirectory, I see lots of WAL files present, before, during after pg_basebackup was run: -rw--- 1 postgres postgres 16777216 Sep 26 00:38 0001208A00E3 -rw--- 1 postgres postgres 16777216 Sep 26 00:43 0001208A00E4 -rw--- 1 postgres postgres 16777216 Sep 26 00:48 0001208A00E5 -rw--- 1 postgres postgres 16777216 Sep 26 00:53 0001208A00E6 -rw--- 1 postgres postgres 16777216 Sep 26 00:58 0001208A00E7 -rw--- 1 postgres postgres 16777216 Sep 26 01:03 0001208A00E8 -rw--- 1 postgres postgres 16777216 Sep 26 01:08 0001208A00E9 -rw--- 1 postgres postgres 16777216 Sep 26 01:14 0001208A00EA -rw--- 1 postgres postgres 16777216 Sep 26 01:19 0001208A00EB -rw--- 1 postgres postgres 16777216 Sep 26 01:24 0001208A00EC -rw--- 1 postgres postgres 16777216 Sep 26 01:29 0001208A00ED -rw--- 1 postgres postgres 16777216 Sep 26 01:34 0001208A00EE -rw--- 1 postgres postgres 16777216 Sep 26 01:38 0001208A00EF -rw--- 1 postgres postgres 16777216 Sep 26 01:43 0001208A00F0 -rw--- 1 postgres postgres 16777216 Sep 26 01:48 0001208A00F1 -rw--- 1 postgres postgres 16777216 Sep 26 01:53 0001208A00F2 -rw--- 1 postgres postgres 16777216 Sep 26 01:58 0001208A00F3 -rw--- 1 postgres postgres 16777216 Sep 26 02:03 0001208A00F4 -rw--- 1 postgres postgres 16777216 Sep 26 02:08 0001208A00F5 -rw--- 1 postgres postgres 16777216 Sep 26 02:14 0001208A00F6 Thanks in advance for any pointers. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] postgres FDW doesn't support sequences?
I've got two 9.3 clusters, with a postgres foreign data wrapper (FDW) setup to point from one cluster to the other. One of the (foreign) tables associated with the foreign server has a bigint sequence for its primary key, defined as: id | bigint | not null default nextval('nppsmoke_id_seq1'::regclass) If I INSERT a new row into the local table (not the foreign table version), without specifying the 'id' column explicitly, it automatically is assigned the nextval in the sequence counter. However, if I attempt to run the same INSERT using the foreign table, it always fails complaining that null value in column id violates not-null constraint. It seems like the FDW is somehow ignoring the existence of the sequence default value, and rewriting the SQL query to explicitly attempt to insert a NULL value. Here's the full query resulting error output: nightly=# INSERT into nppsmoke (date_created,last_update,build_type,current_status,info,cudacode,gpu,subtest,os,arch,cl,dispvers,branch,totdriver,ddcl,testdcmd,osversion) VALUES ((date_trunc('second',now())),(date_trunc('second',now())),'release','Building','npp-release-gpu-buildCUDA-2013-09-24-1380041350.log','2013-09-24.cuda-linux64-test42.release.gpu','380','CUDA build','Linux','x86_64','16935289','CBS_cuda_a_2013-09-24_16935289','cuda_a','1','16935289','./npp-smoke.sh --testtype release --amodel f --vc g --drvpath /home/lfriedman/cuda-stuff/sw/dev/gpu_drv/cuda_a/drivers/gpgpu --cudaroot /home/lfriedman/cuda-stuff/sw/gpgpu --totdriver t --email lfriedman','2.6.32-358.el6.x86_64'); ERROR: null value in column id violates not-null constraint DETAIL: Failing row contains (null, 2013-09-25 08:00:46, 2013-09-25 08:00:46, release, Building, npp-release-gpu-buildCUDA-2013-09-24-1380041350.log, 2013-09-24.cuda-linux64-test42.release.gpu, 380, CUDA build, Linux, 2.6.32-358.el6.x86_64, x86_64, 16935289, CBS_cuda_a_2013-09-24_16935289, cuda_a, null, null, null, null, null, t, 16935289, null, ./npp-smoke.sh --testtype release --amodel f --vc g --drvpath /h..., null, null, null, null, g). CONTEXT: Remote SQL command: INSERT INTO public.nppsmoke(id, date_created, last_update, build_type, current_status, info, cudacode, gpu, subtest, os, osversion, arch, cl, dispvers, branch, pass, fail, oldfail, newfail, failureslog, totdriver, ddcl, buildid, testdcmd, pclog, filtercount, filterlog, error) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28) I tried to recreate the foreign table definition with a primary key, and that failed: ERROR: constraints are not supported on foreign tables Are sequences supported with the postgres FDW? If not, is there any workaround for inserting into a foreign table that doesn't require me to explicitly specify a value for the primary key sequence column in my INSERT statements? thanks! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] postgres FDW doesn't support sequences?
On Wed, Sep 25, 2013 at 2:47 PM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: If I INSERT a new row into the local table (not the foreign table version), without specifying the 'id' column explicitly, it automatically is assigned the nextval in the sequence counter. However, if I attempt to run the same INSERT using the foreign table, it always fails complaining that null value in column id violates not-null constraint. It seems like the FDW is somehow ignoring the existence of the sequence default value, and rewriting the SQL query to explicitly attempt to insert a NULL value. Yeah, there was quite a bit of discussion about that back in February or so. The short of it is that column default values that are defined on the foreign server are not respected by operations on a foreign table; rather, you have to attach a DEFAULT specification to the foreign table definition if you want inserts into the foreign table to use that default. The default expression is executed locally, too, which means that if you'd like it to read like nextval('some_seq') then some_seq has to be a local sequence, not one on the foreign server. Is there some elegant mechanism for keeping the local foreign sequences in sync? I realize that this isn't ideal for serial-like columns, but honoring default expressions that would execute on the foreign server turned out to be a huge can of worms. We might figure out how to fix that some day; but if we'd insisted on a solution now, there wouldn't be writable foreign tables at all in 9.3. Understood. Other than reading the code, is there somewhere that these limitations are documented that I overlooked? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] partitioned table + postgres_FDW not working in 9.3
Greetings, I've got two different 9.3 clusters setup, a b (on Linux if that matters). On cluster b, I have a table (nppsmoke) that is partitioned by date (month), which uses a function which is called by a trigger to manage INSERTS (exactly as documented in the official documentation for partitioning of tables). I've setup a postgres foreign data wrapper server on cluster a which points to cluster b, and then setup a foreign table (nppsmoke) on cluster a which points to the actual partitioned (nppsmoke) table on cluster b. The partitions on cluster b use the naming scheme nppsmoke_$_$MM (where Y=4 digit year, and M=2 digit month). For example, the current month's partition is named nppsmoke_2013_09 . The problem that I'm experiencing is if I attempt to perform an INSERT on the foreign nppsmoke table on cluster a, it fails claiming that the table partition which should hold the data in the INSERT does not exist: ERROR: relation nppsmoke_2013_09 does not exist CONTEXT: Remote SQL command: INSERT INTO public.nppsmoke(id, date_created, last_update, build_type, current_status, info, cudacode, gpu, subtest, os, osversion, arch, cl, dispvers, branch, pass, fail, oldfail, newfail, failureslog, totdriver, ddcl, buildid, testdcmd, pclog, filtercount, filterlog, error) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28) PL/pgSQL function public.nppsmoke_insert_trigger() line 30 at SQL statement If I run the same exact SQL INSERT on cluster b (not using the foreign table), then it works. So whatever is going wrong seems to be related to the foreign table. Initially I thought that perhaps the problem was that I needed to create all of the partitions as foreign tables on cluster a, but that doesn't help. Am I hitting some kind of foreign data wrapper limitation, or am I doing something wrong? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation
On Wed, Sep 18, 2013 at 2:02 AM, Kevin Grittner kgri...@ymail.com wrote: Lonni J Friedman netll...@gmail.com wrote: top shows over 90% of the load is in sys space. vmstat output seems to suggest that its CPU bound (or bouncing back forth): Can you run `perf top` during an episode and see what kernel functions are using all that CPU? Oddly, the problem went away on its own yesterday just after 4PM, and performance has remained 'normal' since that time. I changed absolutely nothing. If/when it returns, I'll certainly capture that output. This looks similar to cases I've seen of THP defrag going wild. Did the OS version or configuration change? Did the PostgreSQL memory settings (like shared_buffers) change? Nothing changed other than the version of postgres. I re-used the same postgresql.conf that was in place when running 9.2.x. Anyway, here are the current THP related settings on the server: [root@cuda-db7 ~]# grep AnonHugePages /proc/meminfo AnonHugePages:548864 kB [root@cuda-db7 ~]# egrep 'trans|thp' /proc/vmstat nr_anon_transparent_hugepages 272 thp_fault_alloc 129173889 thp_fault_fallback 17462551 thp_collapse_alloc 148437 thp_collapse_alloc_failed 15143 thp_split 242 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation
On Wed, Sep 18, 2013 at 2:02 AM, Kevin Grittner kgri...@ymail.com wrote: Lonni J Friedman netll...@gmail.com wrote: top shows over 90% of the load is in sys space. vmstat output seems to suggest that its CPU bound (or bouncing back forth): Can you run `perf top` during an episode and see what kernel functions are using all that CPU? I take back what I said earlier. While the master is currently back to normal performance, the two hot standby slaves are still churning something awful. If I run 'perf top' on either slave, after a few seconds, these are consistently the top three in the list: 84.57% [kernel] [k] _spin_lock_irqsave 6.21% [unknown] [.] 0x00659f60 4.69% [kernel] [k] compaction_alloc This looks similar to cases I've seen of THP defrag going wild. Did the OS version or configuration change? Did the PostgreSQL memory settings (like shared_buffers) change? I think you're onto something here with respect to THP defrag going wild. I set /sys/kernel/mm/transparent_hugepage/defrag to 'never' and immediately the load dropped on both slaves from over 5.00 to under 1.00. So this raises the question, is this a kernel bug, or is there some other solution to the problem? Also, seems weird that the problem didn't happen until I switched from 9.2 to 9.3. Is it possible this is somehow related to the change from using SysV shared memory to using Posix shared memory and mmap for memory management? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation
Greetings, I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming replication hot standby slaves) on RHEL6-x86_64. Yesterday I upgraded from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant performance degradation. PostgreSQL simply feels slower. Nothing other than the version of PostgreSQL changed yesterday. I used pg_upgrade to perform the upgrade, and ran the generated analyze_new_cluster.sh immediately afterwards, which completed successfully. Prior to the upgrade, I'd generally expect a load average of less than 2.00 on the master, and less than 1.00 on each of the slaves. Since the upgrade, the load average on the master has been in double digits (hitting 100.00 for a few minutes), and the slaves are consistently above 5.00. There are a few things that are jumping out at me as behaving differently since the upgrade. vmstat processes waiting for runtime counts have increased dramatically. Prior to the upgrade the process count would be consistently less than 10, however since upgrading it hovers between 40 60 at all times. /proc/interrupts Local timer interrupts has increased dramatically as well. It used to hover around 6000 and is now over 20k much of the time. However, I'm starting to suspect that they are both symptoms of the problem rather than the cause. At this point, I'm looking for guidance on how to debug this problem more effectively. thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation
On Tue, Sep 17, 2013 at 9:54 AM, Eduardo Morras emorr...@yahoo.es wrote: On Tue, 17 Sep 2013 09:19:29 -0700 Lonni J Friedman netll...@gmail.com wrote: Greetings, I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming replication hot standby slaves) on RHEL6-x86_64. Yesterday I upgraded from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant performance degradation. PostgreSQL simply feels slower. Nothing other than the version of PostgreSQL changed yesterday. I used pg_upgrade to perform the upgrade, and ran the generated analyze_new_cluster.sh immediately afterwards, which completed successfully. Prior to the upgrade, I'd generally expect a load average of less than 2.00 on the master, and less than 1.00 on each of the slaves. Since the upgrade, the load average on the master has been in double digits (hitting 100.00 for a few minutes), and the slaves are consistently above 5.00. There are a few things that are jumping out at me as behaving differently since the upgrade. vmstat processes waiting for runtime counts have increased dramatically. Prior to the upgrade the process count would be consistently less than 10, however since upgrading it hovers between 40 60 at all times. /proc/interrupts Local timer interrupts has increased dramatically as well. It used to hover around 6000 and is now over 20k much of the time. However, I'm starting to suspect that they are both symptoms of the problem rather than the cause. At this point, I'm looking for guidance on how to debug this problem more effectively. Don't know what happens but: a) Does analyze_new_cluster.sh include a reindex? If not, indexs are useless because analyze statistics says so. No, it doesn't include a reindex. It merely invokes vacuumdb --all --analyze-only with different values for default_statistics_target=1 -c vacuum_cost_delay=0. According to the documentation for pg_upgrade, post-upgrade scripts to rebuild tables and indexes will be generated automatically. Nothing was generated for this purpose, at least not in any obvious place. The analyze_new_cluster.sh script is the only one that was automatically generated as far as I can tell. b) Did you configure postgresql.conf on 9.3.0 for your server/load? Perhaps it has default install values. Yes, I'm using the same postgresql.conf as I was using when running 9.2.4. Its definitely not running with default install values. c) What does logs say? The postgres server logs look perfectly normal, minus a non-trivial slower run time for most queries. There's nothing unusual in any of the OS level logs (/var/log/messages, etc) or dmesg. Do you have any other suggestions? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation
Thanks for your reply. Comments/answers inline below On Tue, Sep 17, 2013 at 11:28 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Tue, Sep 17, 2013 at 11:22 AM, Lonni J Friedman netll...@gmail.com wrote: c) What does logs say? The postgres server logs look perfectly normal, minus a non-trivial slower run time for most queries. There's nothing unusual in any of the OS level logs (/var/log/messages, etc) or dmesg. Are you generally CPU limited or IO limited? top shows over 90% of the load is in sys space. vmstat output seems to suggest that its CPU bound (or bouncing back forth): procs ---memory-- ---swap-- -io --system-- -cpu- r b swpd free buff cache si sobibo in cs us sy id wa st 1 0 17308 852016 141104 12707419200101800 6 4 90 0 0 0 0 17308 872316 141104 12707420000 0 988 940 564 1 0 99 0 0 0 0 17308 884288 141104 12707420800 0 1921 1202 2132 1 0 99 0 0 0 0 17308 898728 141104 12707420800 0 0 1064 577 1 0 99 0 0 2 0 17308 914920 141104 12707422400 044 820 427 1 0 99 0 0 0 0 17308 926524 141104 12707427200 048 1173 585 1 0 99 0 0 108 1 17308 753648 141104 12707422400 0 236 9825 3901 12 5 83 0 0 50 0 17308 723156 141104 12707440000 0 144 43481 9105 20 79 1 0 0 45 0 17308 722860 141104 12707441600 0 8 32969 1998 1 97 2 0 0 47 0 17308 738996 141104 12707441600 0 0 34099 1739 1 99 0 0 0 101 0 17308 770220 141104 12707448000 032 38550 5998 7 93 0 0 0 101 0 17308 775732 141104 12707451200 0 156 33889 5809 4 96 0 0 0 99 0 17308 791232 141104 12707454400 0 0 32385 4981 0 100 0 0 0 96 0 17308 803156 141104 12707454400 024 32413 4824 0 100 0 0 0 87 0 17308 811624 141104 12707454400 0 0 32438 4470 0 100 0 0 0 83 0 17308 815500 141104 12707454400 0 0 32489 4159 0 100 0 0 0 80 0 17308 826572 141104 12707455200 033 32582 3948 0 100 0 0 0 73 0 17308 853264 141108 12707455200 052 32833 3840 0 100 0 0 0 73 0 17308 882240 141108 12707456000 0 4 32820 3594 0 100 0 0 0 72 0 17308 892256 141108 12707456000 0 0 32368 3516 0 100 0 0 0 ### iostat consistently shows %util under 1.00 which also suggests that disk IO is not the bottleneck: # iostat -dx /dev/sdb 5 Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7) 09/17/2013 _x86_64_ (32 CPU) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.02 0.216.91 31.33 651.60 1121.85 46.38 0.092.25 0.08 0.31 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.000.008.00 0.0093.00 11.62 0.000.28 0.20 0.16 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.000.00 11.00 0.00 125.40 11.40 0.000.16 0.16 0.18 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.000.00 105.00 0.00 3380.40 32.19 0.292.76 0.03 0.34 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.000.00 14.80 0.00 2430.60 164.23 0.000.12 0.09 0.14 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 1.200.00 41.60 0.00 1819.40 43.74 0.020.45 0.05 0.20 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.000.002.80 0.0032.00 11.43 0.000.00 0.00 0.00 # mpstat also shows a virtually 0 iowait, with a ton of sys (CPU) time: # mpstat 2 10 Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7) 09/17/2013 _x86_64_ (32 CPU) 12:53:19 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle 12:53:21 PM all7.360.00 92.580.000.000.03 0.000.000.03 12:53:23 PM all6.350.00 90.430.000.000.03 0.000.003.19 12:53:25 PM all3.130.00 68.200.000.000.02 0.000.00 28.66 12:53:27 PM all6.070.00 68.460.000.000.03 0.000.00 25.44 12:53:29 PM all5.830.00 94.140.000.000.03 0.000.000.00 12:53:31 PM all5.750.00 94.140.000.000.11 0.000.000.00 12:53:33 PM all7.650.00 40.32
Re: [GENERAL] upgrade from 9.2.x to 9.3 causes significant performance degradation
On Tue, Sep 17, 2013 at 3:47 PM, Andres Freund and...@2ndquadrant.com wrote: Hi, On 2013-09-17 09:19:29 -0700, Lonni J Friedman wrote: I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming replication hot standby slaves) on RHEL6-x86_64. Yesterday I upgraded from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant performance degradation. PostgreSQL simply feels slower. Nothing other than the version of PostgreSQL changed yesterday. I used pg_upgrade to perform the upgrade, and ran the generated analyze_new_cluster.sh immediately afterwards, which completed successfully. Where did you get 9.3.0 from? Compiled it yourself? Any chance you compile with --enable-cassert or somesuch? Directly from http://yum.postgresql.org. So unless the RPMs on there are built weird/wrong, I don't think that's the problem. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] WAL Replication Working but Not Working
The first thing to do is look at your server logs around the time when it stopped working. On Wed, Aug 21, 2013 at 7:08 AM, Joseph Marlin jmar...@saucontech.com wrote: We're having an issue with our warm standby server. About 9:30 last night, it stopped applying changes it received in WAL files that are shipped over to it as they are created. It is still reading WAL files as they delivered, as the startup_log.txt shows, but the changes in the primary database aren't actually being made to the standby, and haven't been since last night. Is there any way we can figure out what is going on here? We'd like to recover somehow without having to restore from a base backup, and we'd like to figure out what is happening so we can prevent it in the future. Thanks! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Streaming Replication Randomly Locking Up
I've never seen this happen. Looks like you might be using 9.1? Are you up to date on all the 9.1.x releases? Do you have just 1 slave syncing from the master? Which OS are you using? Did you verify that there aren't any network problems between the slave master? Or hardware problems (like the NIC dying, or dropping packets)? On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman rexx...@gmail.com wrote: Hello, I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a waiting status on the slave: postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 0001053D003F waiting postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process postgres 5659 0.0 0.0 177524 788 ?Ss Aug14 0:03 postgres: stats collector process postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730 The replication works great for days, but randomly seems to lock up and replication halts. I verified that the two databases were out of sync with a query on both of them. Has anyone experienced this issue before? Here are some relevant config settings: Master: wal_level = hot_standby checkpoint_segments = 32 checkpoint_completion_target = 0.9 archive_mode = on archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f /dev/null' max_wal_senders = 2 wal_keep_segments = 32 Slave: wal_level = hot_standby checkpoint_segments = 32 #checkpoint_completion_target = 0.5 hot_standby = on max_standby_archive_delay = -1 max_standby_streaming_delay = -1 #wal_receiver_status_interval = 10s #hot_standby_feedback = off Thank you for any help you can provide! Andrew -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Streaming Replication Randomly Locking Up
Are you certain that there are no relevant errors in the database logs (on both master slave)? Also, are you sure that you didn't misconfigure logging such that errors wouldn't appear? On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman rexx...@gmail.com wrote: Hi Lonni, Yes, I am using PG 9.1.9. Yes, 1 slave syncing from the master CentOS 6.4 I don't see any network or hardware issues (e.g. NIC) but will look more into this. They are communicating on a private network and switch. I forgot to mention that after I restart the slave, everything syncs right back up and all if working again so if it is a network issue, the replication is just stopping after some hiccup instead of retrying and resuming when things are back up. Thanks! On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman netll...@gmail.com wrote: I've never seen this happen. Looks like you might be using 9.1? Are you up to date on all the 9.1.x releases? Do you have just 1 slave syncing from the master? Which OS are you using? Did you verify that there aren't any network problems between the slave master? Or hardware problems (like the NIC dying, or dropping packets)? On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman rexx...@gmail.com wrote: Hello, I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a waiting status on the slave: postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 0001053D003F waiting postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process postgres 5659 0.0 0.0 177524 788 ?Ss Aug14 0:03 postgres: stats collector process postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730 The replication works great for days, but randomly seems to lock up and replication halts. I verified that the two databases were out of sync with a query on both of them. Has anyone experienced this issue before? Here are some relevant config settings: Master: wal_level = hot_standby checkpoint_segments = 32 checkpoint_completion_target = 0.9 archive_mode = on archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f /dev/null' max_wal_senders = 2 wal_keep_segments = 32 Slave: wal_level = hot_standby checkpoint_segments = 32 #checkpoint_completion_target = 0.5 hot_standby = on max_standby_archive_delay = -1 max_standby_streaming_delay = -1 #wal_receiver_status_interval = 10s #hot_standby_feedback = off Thank you for any help you can provide! Andrew -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Streaming Replication Randomly Locking Up
I'd suggest enhancing your logging to include time/datestamps for every entry, and also the client hostname. That will help to rule in/out those 'unexpected EOF' errors. On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman rexx...@gmail.com wrote: The only thing I see that is a possibility for the issue is in the slave log: LOG: unexpected EOF on client connection LOG: could not receive data from client: Connection reset by peer I don't know if that's related or not as it could just be somebody running a query. The log file does seem to be riddled with these but the replication failures don't happen constantly. As far as I know I'm not swallowing any errors. The logging is all set as the default: log_destination = 'stderr' logging_collector = on #client_min_messages = notice #log_min_messages = warning #log_min_error_statement = error #log_min_duration_statement = -1 #log_checkpoints = off #log_connections = off #log_disconnections = off #log_error_verbosity = default I'm going to have a look at the NICs to make sure there's no issue there. Thanks again for your help! On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman netll...@gmail.com wrote: Are you certain that there are no relevant errors in the database logs (on both master slave)? Also, are you sure that you didn't misconfigure logging such that errors wouldn't appear? On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman rexx...@gmail.com wrote: Hi Lonni, Yes, I am using PG 9.1.9. Yes, 1 slave syncing from the master CentOS 6.4 I don't see any network or hardware issues (e.g. NIC) but will look more into this. They are communicating on a private network and switch. I forgot to mention that after I restart the slave, everything syncs right back up and all if working again so if it is a network issue, the replication is just stopping after some hiccup instead of retrying and resuming when things are back up. Thanks! On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman netll...@gmail.com wrote: I've never seen this happen. Looks like you might be using 9.1? Are you up to date on all the 9.1.x releases? Do you have just 1 slave syncing from the master? Which OS are you using? Did you verify that there aren't any network problems between the slave master? Or hardware problems (like the NIC dying, or dropping packets)? On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman rexx...@gmail.com wrote: Hello, I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a waiting status on the slave: postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 0001053D003F waiting postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process postgres 5659 0.0 0.0 177524 788 ?Ss Aug14 0:03 postgres: stats collector process postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730 The replication works great for days, but randomly seems to lock up and replication halts. I verified that the two databases were out of sync with a query on both of them. Has anyone experienced this issue before? Here are some relevant config settings: Master: wal_level = hot_standby checkpoint_segments = 32 checkpoint_completion_target = 0.9 archive_mode = on archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f /dev/null' max_wal_senders = 2 wal_keep_segments = 32 Slave: wal_level = hot_standby checkpoint_segments = 32 #checkpoint_completion_target = 0.5 hot_standby = on max_standby_archive_delay = -1 max_standby_streaming_delay = -1 #wal_receiver_status_interval = 10s #hot_standby_feedback = off Thank you for any help you can provide! Andrew -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] postgres FDW cost estimation options unrecognized in 9.3-beta1
Greetings, I have a postgresql-9.3-beta1 cluster setup (from the yum.postgresql.org RPMs), where I'm experimenting with the postgres FDW extension. The documentation ( http://www.postgresql.org/docs/9.3/static/postgres-fdw.html ) references three Cost Estimation Options which can be set for a foreign table or a foreign server. However when I attempt to set them, I always get an error that the option is not found: ### nightly=# show SERVER_VERSION ; server_version 9.3beta1 nightly=# \des+ List of foreign servers Name| Owner | Foreign-data wrapper | Access privileges | Type | Version | FDW Options| Description ---+---+--+---+--+-+-- -+- cuda_db10 | lfriedman | postgres_fdw | | | | (host 'cuda-db10', dbname 'nightly', port '5432') | (1 row) nightly=# ALTER SERVER cuda_db10 OPTIONS (SET use_remote_estimate 'true') ; ERROR: option use_remote_estimate not found ### Am I doing something wrong, or is this a bug? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] postgres FDW cost estimation options unrecognized in 9.3-beta1
On Fri, Jul 26, 2013 at 3:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: nightly=# ALTER SERVER cuda_db10 OPTIONS (SET use_remote_estimate 'true') ; ERROR: option use_remote_estimate not found Am I doing something wrong, or is this a bug? [ experiments... ] You need to say ADD, not SET, to add a new option to the list. SET might more appropriately be spelled REPLACE, because it requires that the object already have a defined value for the option, which will be replaced. Our documentation appears not to disclose this fine point, but a look at the SQL-MED standard says it's operating per spec. The standard also says that ADD is an error if the option is already defined, which is a bit more defensible, but still not exactly what I'd call user-friendly. And the error we issue for that case is pretty misleading too: regression=# ALTER SERVER cuda_db10 OPTIONS (use_remote_estimate 'true') ; ALTER SERVER regression=# ALTER SERVER cuda_db10 OPTIONS (use_remote_estimate 'false') ; ERROR: option use_remote_estimate provided more than once I think we could do with both more documentation, and better error messages for these cases. In the SET-where-you-should-use-ADD case, perhaps ERROR: option use_remote_estimate has not been set HINT: Use ADD not SET to define an option that wasn't already set. In the ADD-where-you-should-use-SET case, perhaps ERROR: option use_remote_estimate is already set HINT: Use SET not ADD to change an option's value. The provided more than once wording would be appropriate if the same option is specified more than once in the command text, but I'm not sure that it's worth the trouble to detect that case. Thoughts, better wordings? Thanks Tom, I've confirmed that using ADD was the solution. I think your suggested updated ERROR HINT text is an excellent improvement. It definitely would have given me the clue I was missing earlier. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] upgrading from 9.3-beta1 to 9.3-beta2 requires dump reload?
Greetings, I just got around to upgrading from 9.3-beta1 to 9.3-beta2, and was surprised to see that the server was refusing to start. In the log, I'm seeing: 2013-07-24 13:41:47 PDT [7083]: [1-1] db=,user= FATAL: database files are incompatible with server 2013-07-24 13:41:47 PDT [7083]: [2-1] db=,user= DETAIL: The database cluster was initialized with CATALOG_VERSION_NO 201305061, but the server was compiled with CATALOG_VERSION_NO 201306121. 2013-07-24 13:41:47 PDT [7083]: [3-1] db=,user= HINT: It looks like you need to initdb. I'm using the RPMs from yum.postgresql.org on RHEL6. Is this expected, intentional behavior? Do I really need to dump reload to upgrade between beta releases of 9.3, or is there some more efficient way? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] upgrading from 9.3-beta1 to 9.3-beta2 requires dump reload?
On Wed, Jul 24, 2013 at 2:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Lonni J Friedman escribió: I'm using the RPMs from yum.postgresql.org on RHEL6. Is this expected, intentional behavior? Do I really need to dump reload to upgrade between beta releases of 9.3, or is there some more efficient way? We try to avoid forcing initdb between beta versions, but it's not guaranteed. You should be able to use pg_upgrade, also. Unfortunately, the RPMs probably won't be very helpful for using pg_upgrade, since there's no convenient way to get beta1 and beta2 postmaster executables installed at the same time (unless Devrim foresaw this case and packaged things differently than I did for Red Hat ;-)). Sounds like I'm out of luck. Thanks anyway. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages
Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. If you re-seed the standby with a full (base) backup, does the problem go away? On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan d...@iqtell.com wrote: Hello, Today our standby instance stopped working with this error in the log: 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of relation pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains references to invalid pages 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID 8367) was terminated by signal 6: Aborted 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other active server processes After re-start the same exact error occurred. We thought that maybe we hit this bug - http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html. However, there is nothing in our log about sub-transactions, so it didn't seem the same to us. Any advice on how to further debug this so we can avoid this in the future is appreciated. Environment: AWS, High I/O instance (hi1.4xlarge), 60GB RAM Software and settings: PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 64-bit archive_command rsync -a %p slave:/var/lib/postgresql/replication_load/%f archive_mode on autovacuum_freeze_max_age 10 autovacuum_max_workers6 checkpoint_completion_target 0.9 checkpoint_segments 128 checkpoint_timeout 30min default_text_search_config pg_catalog.english hot_standby on lc_messages en_US.UTF-8 lc_monetary en_US.UTF-8 lc_numeric en_US.UTF-8 lc_time en_US.UTF-8 listen_addresses * log_checkpoints on log_destinationstderr log_line_prefix %t [%p]: [%l-1] [%h] log_min_duration_statement-1 log_min_error_statement error log_min_messages error log_timezoneUTC maintenance_work_mem 1GB max_connections1200 max_standby_streaming_delay90s max_wal_senders 5 port 5432 random_page_cost2 seq_page_cost 1 shared_buffers4GB ssl off ssl_cert_file /etc/ssl/certs/ssl-cert-snakeoil.pem ssl_key_file/etc/ssl/private/ssl-cert-snakeoil.key synchronous_commitoff TimeZoneUTC wal_keep_segments 128 wal_level hot_standby work_mem8MB root@ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata /usr/local/pgsql/data pg_control version number:922 Catalog version number: 201204301 Database system identifier: 5838668587531239413 Database cluster state: in archive recovery pg_control last modified: Sat 22 Jun 2013 06:13:07 PM UTC Latest checkpoint location: 2250/18CA0790 Prior checkpoint location:2250/18CA0790 Latest checkpoint's REDO location:224F/E127B078 Latest checkpoint's TimeLineID: 2 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 1/2018629527 Latest checkpoint's NextOID: 43086248 Latest checkpoint's NextMultiXactId: 7088726 Latest checkpoint's NextMultiOffset: 20617234 Latest checkpoint's oldestXID:1690316999 Latest checkpoint's oldestXID's DB: 16448 Latest checkpoint's oldestActiveXID: 2018629527 Time of latest checkpoint:Sat 22 Jun 2013 03:24:05 PM UTC Minimum recovery ending location: 2251/5EA631F0 Backup start location:0/0 Backup end location: 0/0 End-of-backup record required:no Current wal_level setting:hot_standby Current max_connections setting: 1200 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 64 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment:16777216 Maximum length of identifiers:64 Maximum columns in an index: 32 Maximum size of a TOAST chunk:1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value root@ip-10-148-131-236:~# Thanks again. Dan --
Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages
Assuming that you still have $PGDATA from the broken instance (such that you can reproduce the crash again), there might be a way to debug it further. I'd guess that something like bad RAM or storage could cause an index to get corrupted in this fashion, but the fact that you're using AWS makes that less likely. Someone far more knowledgeable than I will need to provide guidance on how to debug this though. On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan d...@iqtell.com wrote: Re-seeding the standby with a full base backup does seem to make the error go away. The standby started, caught up and has been working for about 2 hours. The file in the error message was an index. We rebuilt it just in case. Is there any way to debug the issue at this point? -Original Message- From: Lonni J Friedman [mailto:netll...@gmail.com] Sent: Saturday, June 22, 2013 4:11 PM To: Dan Kogan Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. If you re-seed the standby with a full (base) backup, does the problem go away? On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan d...@iqtell.com wrote: Hello, Today our standby instance stopped working with this error in the log: 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of relation pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains references to invalid pages 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum: rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID 8367) was terminated by signal 6: Aborted 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other active server processes After re-start the same exact error occurred. We thought that maybe we hit this bug - http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html. However, there is nothing in our log about sub-transactions, so it didn't seem the same to us. Any advice on how to further debug this so we can avoid this in the future is appreciated. Environment: AWS, High I/O instance (hi1.4xlarge), 60GB RAM Software and settings: PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 64-bit archive_command rsync -a %p slave:/var/lib/postgresql/replication_load/%f archive_mode on autovacuum_freeze_max_age 10 autovacuum_max_workers6 checkpoint_completion_target 0.9 checkpoint_segments 128 checkpoint_timeout 30min default_text_search_config pg_catalog.english hot_standby on lc_messages en_US.UTF-8 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] 9.3-beta postgres-fdw COPY error
Greetings, I'm trying to test out the new postgres-fdw support in postgresql-9.3 (beta) in preparation for an upgrade from 9.2 later this year. So far, everything is working ok, however one problem I'm encountering is with the COPY command. When I run it against a foreign table (which is also in a 9.3 instance), it fails: COPY my_foreigntbl (id,testname) TO '/tmp/testlist_aid' (DELIMITER ','); ERROR: cannot copy from foreign table my_foreigntbl Is this expected behavior or a bug? thanks! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] 9.3-beta postgres-fdw COPY error
I was afraid someone would say that. Is this a limitation that might be removed in the future (like 9.4), or is there a technical reason why its not possible to do a COPY against a foreign table? On Fri, Jun 21, 2013 at 10:52 AM, Adrian Klaver adrian.kla...@gmail.com wrote: On 06/21/2013 10:39 AM, Lonni J Friedman wrote: Greetings, I'm trying to test out the new postgres-fdw support in postgresql-9.3 (beta) in preparation for an upgrade from 9.2 later this year. So far, everything is working ok, however one problem I'm encountering is with the COPY command. When I run it against a foreign table (which is also in a 9.3 instance), it fails: COPY my_foreigntbl (id,testname) TO '/tmp/testlist_aid' (DELIMITER ','); ERROR: cannot copy from foreign table my_foreigntbl Is this expected behavior or a bug? Expected I believe: http://www.postgresql.org/docs/9.3/static/postgres-fdw.html Now you need only SELECT from a foreign table to access the data stored in its underlying remote table. You can also modify the remote table using INSERT, UPDATE, or DELETE. thanks! -- Adrian Klaver adrian.kla...@gmail.com -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] how to reference variables in pgbench custom scripts?
I'm attempting to write a custom pgbench script (called via the -f option), with a variable set at the top with: \setrandom aid 100 50875000 However, I can't quite figure out how to reference the new aid variable. The documentation simply states that a variable is referenced with a colon in front of its name (:aid ). However, I can't figure out how to make this work if the variable is embedded between other characters. For example, I want to do something like this: CREATE INDEX nppsmoke_tmp_:aid_idx ON nppsmoke_tmp_:aid (testname); However, when I run pgbench, it fails: Client 0 aborted in state 4: ERROR: syntax error at or near : LINE 1: CREATE INDEX nppsmoke_tmp_:aid_idx ON nppsmoke_tmp_29954053 ... Is there a default delimiter, such as bash's curly brackets (${aid}) ? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_upgrade link mode
On Thu, May 16, 2013 at 11:03 AM, Igor Neyman iney...@perceptron.com wrote: From: pgsql-general-ow...@postgresql.org [mailto:pgsql-general-ow...@postgresql.org] On Behalf Of AI Rumman Sent: Thursday, May 16, 2013 1:56 PM To: Fabio Rueda Carrascosa Cc: pgsql-general Subject: Re: [GENERAL] pg_upgrade link mode I always think its a bit risky to use link instead of copying. However, I'd suggest to try the --check at first of pg_upgrade. -- Why? Do you have specific experience, when link mode caused any problems? Could you share? I assume what he's referring to is if the upgrade gets partially completed and fails for any reason, then you have a broken mess, with no simple rollback path. Since your database is only about 1GB in size, it shouldn't take very long to run a base backup before doing the upgrade. You can send that backup over the network to a remote system, so that you have a fallback solution if the upgrade fails. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_basebackup, requested WAL has already been removed
Its definitely not a bug. You need to set/increase wal_keep_segments to a value that ensures that they aren't recycled faster than the time required to complete the base backup (plus some buffer). On Fri, May 10, 2013 at 9:48 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote: Hi, I've recently started to use pg_basebackup --xlog-method=stream to backup my multi-Tb database. Before I did the backup when there was not much activity in the DB and it went perfectly fine, but today, I've started the backup and it failed twice almost at the same time as the CREATE INDEX (and another time CLUSTER) commands were finished. Here: postgres@cappc118:/mnt/backup/wsdb_130510$ pg_basebackup --xlog-method=stream --progress --verbose --pg transaction log start point: 23AE/BD003E70 pg_basebackup: starting background WAL receiver pg_basebackup: unexpected termination of replication stream: FATAL: requested WAL segment 000123B100FE has already been removed 4819820/16816887078 kB (4%), 0/1 tablespace (/mnt/backup/wsdb_130510/base/1) And the logs from around that time contained: some_user:wsdb:2013-05-10 14:35:41 BST:10587LOG: duration: 40128.163 ms statement: CREATE INDEX usno_cle an_q3c_idx ON usno_clean (q3c_ang2ipix(ra,dec)); ::2013-05-10 14:35:43 BST:25529LOG: checkpoints are occurring too frequently (8 seconds apart) ::2013-05-10 14:35:43 BST:25529HINT: Consider increasing the configuration parameter checkpoint_segmen ts. ::2013-05-10 14:35:51 BST:25529LOG: checkpoints are occurring too frequently (8 seconds apart) ::2013-05-10 14:35:51 BST:25529HINT: Consider increasing the configuration parameter checkpoint_segmen ts. postgres:[unknown]:2013-05-10 14:35:55 BST:8177FATAL: requested WAL segment 000123B100FE has already been removed some_user:wsdb:2013-05-10 14:36:59 BST:10599LOG: duration: 78378.194 ms statement: CLUSTER usno_clean_q3c_idx ON usno_clean; One the previous occasion when it happened the CREATE INDEX() was being executed: some_user:wsdb:2013-05-10 09:17:20 BST:3300LOG: duration: 67.680 ms statement: SELECT name FROM (SELECT pg_catalog.lower(name) AS name FROM pg_catalog.pg_settings UNION ALL SELECT 'session authorization' UNION ALL SELECT 'all') ss WHERE substring(name,1,4)='rand' LIMIT 1000 ::2013-05-10 09:22:47 BST:25529LOG: checkpoints are occurring too frequently (18 seconds apart) ::2013-05-10 09:22:47 BST:25529HINT: Consider increasing the configuration parameter checkpoint_segments. postgres:[unknown]:2013-05-10 09:22:49 BST:27659FATAL: requested WAL segment 000123990040 has already been removed some_user:wsdb:2013-05-10 09:22:57 BST:3236LOG: duration: 542955.262 ms statement: CREATE INDEX xmatch_temp_usnoid_idx ON xmatch_temp (usno_id); The .configuration PG 9.2.4, Debian 7.0, amd64 shared_buffers = 10GB work_mem = 1GB maintenance_work_mem = 1GB effective_io_concurrency = 5 synchronous_commit = off checkpoint_segments = 32 max_wal_senders = 2 effective_cache_size = 30GB autovacuum_max_workers = 3 wal_level=archive archive_mode = off Does it look like a bug or am I missing something ? Thanks, Sergey -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_basebackup, requested WAL has already been removed
That's a good point. Then i dunno, perhaps it is a bug, but I'd be surprised if this wasn't working, as its not really a corner case that could be missed in testing, as long as all the options were exercised. Hopefully someone else can weigh in. On Fri, May 10, 2013 at 10:00 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote: On Fri, 10 May 2013, Lonni J Friedman wrote: Its definitely not a bug. You need to set/increase wal_keep_segments to a value that ensures that they aren't recycled faster than the time required to complete the base backup (plus some buffer). But I thought that wal_keep_segments is not needed for the streaming regime ( --xlog-method=stream) And the documentation http://www.postgresql.org/docs/9.2/static/app-pgbasebackup.html only mentions wal_keep_segments when talking about --xlog-method=fetch. On Fri, May 10, 2013 at 9:48 AM, Sergey Koposov kopo...@ast.cam.ac.uk wrote: Hi, I've recently started to use pg_basebackup --xlog-method=stream to backup my multi-Tb database. Before I did the backup when there was not much activity in the DB and it went perfectly fine, but today, I've started the backup and it failed twice almost at the same time as the CREATE INDEX (and another time CLUSTER) commands were finished. Here: postgres@cappc118:/mnt/backup/wsdb_130510$ pg_basebackup --xlog-method=stream --progress --verbose --pg transaction log start point: 23AE/BD003E70 pg_basebackup: starting background WAL receiver pg_basebackup: unexpected termination of replication stream: FATAL: requested WAL segment 000123B100FE has already been removed 4819820/16816887078 kB (4%), 0/1 tablespace (/mnt/backup/wsdb_130510/base/1) And the logs from around that time contained: some_user:wsdb:2013-05-10 14:35:41 BST:10587LOG: duration: 40128.163 ms statement: CREATE INDEX usno_cle an_q3c_idx ON usno_clean (q3c_ang2ipix(ra,dec)); ::2013-05-10 14:35:43 BST:25529LOG: checkpoints are occurring too frequently (8 seconds apart) ::2013-05-10 14:35:43 BST:25529HINT: Consider increasing the configuration parameter checkpoint_segmen ts. ::2013-05-10 14:35:51 BST:25529LOG: checkpoints are occurring too frequently (8 seconds apart) ::2013-05-10 14:35:51 BST:25529HINT: Consider increasing the configuration parameter checkpoint_segmen ts. postgres:[unknown]:2013-05-10 14:35:55 BST:8177FATAL: requested WAL segment 000123B100FE has already been removed some_user:wsdb:2013-05-10 14:36:59 BST:10599LOG: duration: 78378.194 ms statement: CLUSTER usno_clean_q3c_idx ON usno_clean; One the previous occasion when it happened the CREATE INDEX() was being executed: some_user:wsdb:2013-05-10 09:17:20 BST:3300LOG: duration: 67.680 ms statement: SELECT name FROM (SELECT pg_catalog.lower(name) AS name FROM pg_catalog.pg_settings UNION ALL SELECT 'session authorization' UNION ALL SELECT 'all') ss WHERE substring(name,1,4)='rand' LIMIT 1000 ::2013-05-10 09:22:47 BST:25529LOG: checkpoints are occurring too frequently (18 seconds apart) ::2013-05-10 09:22:47 BST:25529HINT: Consider increasing the configuration parameter checkpoint_segments. postgres:[unknown]:2013-05-10 09:22:49 BST:27659FATAL: requested WAL segment 000123990040 has already been removed some_user:wsdb:2013-05-10 09:22:57 BST:3236LOG: duration: 542955.262 ms statement: CREATE INDEX xmatch_temp_usnoid_idx ON xmatch_temp (usno_id); The .configuration PG 9.2.4, Debian 7.0, amd64 shared_buffers = 10GB work_mem = 1GB maintenance_work_mem = 1GB effective_io_concurrency = 5 synchronous_commit = off checkpoint_segments = 32 max_wal_senders = 2 effective_cache_size = 30GB autovacuum_max_workers = 3 wal_level=archive archive_mode = off Does it look like a bug or am I missing something ? Thanks, Sergey -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Deploying PostgreSQL on CentOS with SSD and Hardware RAID
On Fri, May 10, 2013 at 10:20 AM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, May 10, 2013 at 12:03 PM, David Boreham david_l...@boreham.org wrote: On 5/10/2013 10:21 AM, Merlin Moncure wrote: As it turns out the list of flash drives are suitable for database use is surprisingly small. The s3700 I noted upthread seems to be specifically built with databases in mind and is likely the best choice for new deployments. The older Intel 320 is also a good choice. I think that's pretty much it until you get into expensive pci-e based gear. This may have been a typo : did you mean Intel 710 series rather than 320 ? While the 320 has the supercap, it isn't specified for high write endurance. Definitely usable for a database, and a better choice than most of the alternatives, but I'd have listed the 710 ahead of the 320. It wasn't a typo. The 320 though is perfectly fine although it will wear out faster -- so it fills a niche for low write intensity applications. I find the s3700 to be superior to the 710 in just about every way (although you're right -- it is suitable for database use). There's also the 520 series, which has better performance than the 320 series (which is EOL now). -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Deploying PostgreSQL on CentOS with SSD and Hardware RAID
On Fri, May 10, 2013 at 11:23 AM, Steven Schlansker ste...@likeness.com wrote: On May 10, 2013, at 7:14 AM, Matt Brock m...@mattbrock.co.uk wrote: Hello. We're intending to deploy PostgreSQL on Linux with SSD drives which would be in a RAID 1 configuration with Hardware RAID. My first question is essentially: are there any issues we need to be aware of when running PostgreSQL 9 on CentOS 6 on a server with SSD drives in a Hardware RAID 1 configuration? Will there be any compatibility problems (seems unlikely)? Should we consider alternative configurations as being more effective for getting better use out of the hardware? The second question is: are there any SSD-specific issues to be aware of when tuning PostgreSQL to make the best use of this hardware and software? A couple of things I noticed with a similar-ish setup: * Some forms of RAID / LVM break the kernel's automatic disk tuning mechanism. In particular, there is a rotational tunable that often does not get set right. You might end up tweaking read ahead and friends as well. http://www.mjmwired.net/kernel/Documentation/block/queue-sysfs.txt#112 * The default Postgres configuration is awful for a SSD backed database. You really need to futz with checkpoints to get acceptable throughput. The PostgreSQL 9.0 High Performance book is fantastic and is what I used to great success. * The default Linux virtual memory configuration is awful for this configuration. Briefly, it will accept a ton of incoming data, and then go through an awful stall as soon as it calls fsync() to write all that data to disk. We had multi-second delays all the way through to the application because of this. We had to change the zone_reclaim_mode and the dirty buffer limits. http://www.postgresql.org/message-id/500616cb.3070...@2ndquadrant.com I am not sure that these numbers will end up being anywhere near what works for you, but these are my notes from tuning a 4xMLC SSD RAID-10. I haven't proven that this is optimal, but it was way better than the defaults. We ended up with the following list of changes: * Change IO scheduler to noop * Mount DB volume with nobarrier, noatime * Turn blockdev readahead to 16MiB * Turn sdb's rotational tuneable to 0 PostgreSQL configuration changes: synchronous_commit = off effective_io_concurrency = 4 checkpoint_segments = 1024 checkpoint_timeout = 10min checkpoint_warning = 8min shared_buffers = 32gb temp_buffers = 128mb work_mem = 512mb maintenance_work_mem = 1gb Linux sysctls: vm.swappiness = 0 vm.zone_reclaim_mode = 0 vm.dirty_bytes = 134217728 vm.dirty_background_bytes = 1048576 Can you provide more details about your setup, including: * What kind of filesystem are you using? * Linux distro and/or kernel version * hardware (RAM, CPU cores etc) * database usage patterns (% writes, growth, etc) thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Replication terminated due to PANIC
If its really index corruption, then you should be able to fix it by reindexing. However, that doesn't explain what caused the corruption. Perhaps your hardware is bad in some way? On Wed, Apr 24, 2013 at 10:46 PM, Adarsh Sharma eddy.ada...@gmail.com wrote: Thanks Sergey for such a quick response, but i dont think this is some patch problem because we have other DB servers also running fine on same version and message is also different : host= PANIC: _bt_restore_page: cannot add item to page And the whole day replication is working fine but at midnight when log rotates it shows belows msg : 2013-04-24 00:00:00 UTC [26989]: [4945032-1] user= db= host= LOG: checkpoint starting: time 2013-04-24 00:00:00 UTC [26989]: [4945033-1] user= db= host= ERROR: could not open file global/14078: No such file or directory 2013-04-24 00:00:00 UTC [26989]: [4945034-1] user= db= host= CONTEXT: writing block 0 of relation global/14078 2013-04-24 00:00:00 UTC [26989]: [4945035-1] user= db= host= WARNING: could not write block 0 of global/14078 2013-04-24 00:00:00 UTC [26989]: [4945036-1] user= db= host= DETAIL: Multiple failures --- write error might be permanent. Looks like some index corruption. Thanks On Thu, Apr 25, 2013 at 8:14 AM, Sergey Konoplev gray...@gmail.com wrote: On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma eddy.ada...@gmail.com wrote: I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i setup a hot standby by using pgbasebackup. Today i got the below alert from standby box : [1] (from line 412,723) 2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC: _bt_restore_page: cannot add item to page When i check, the replication is terminated due to slave DB shutdown. From the logs i can see below messages :- I am not sure that it is your situation but take a look at this thread: http://www.postgresql.org/message-id/CAL_0b1t=WuM6roO8dki=w8dhh8p8whhohbpjreymmqurocn...@mail.gmail.com There is a patch by Andres Freund in the end of the discussion. Three weeks have passed after I installed the patched version and it looks like the patch fixed my issue. 2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR: could not open file global/14078: No such file or directory 2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT: writing block 0 of relation global/14078 2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING: could not write block 0 of global/14078 2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL: Multiple failures --- write error might be permanent. I checked in global directory of master, the directory 14078 doesn't exist. Anyone has faced above issue ? Thanks -- Kind regards, Sergey Konoplev Database and Software Consultant Profile: http://www.linkedin.com/in/grayhemp Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979 Skype: gray-hemp Jabber: gray...@gmail.com -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] corrupted item pointer in streaming based replication
You should figure out what base/16384/114846.39 corresponds to inside the database. If you're super lucky its something unimportant and/or something that can be recreated easily (like an index). If its something important, then you're only option is to try to drop the object and restore it from the last known good backup. On Wed, Apr 3, 2013 at 1:02 PM, Jigar Shah js...@pandora.com wrote: Hi, Postgres version = 9.1.2 OS = debian(6.0.7) fsync = on full_page_writes = on Setup = Primary and streaming replication based secondary Few days ago we had a situation where our Primary started to through the error messages below indicating corruption in the database. It crashed sometimes and showed a panic message in the logs 2013-03-25 07:30:39.545 PDT PANIC: corrupted item pointer: offset = 0, size = 0 2013-03-25 07:30:39.704 PDT LOG: server process (PID 8715) was terminated by signal 6: Aborted 2013-03-25 07:30:39.704 PDT LOG: terminating any other active server processes Days before it started to crash it showed the below error messages in the logs. [d: u:postgres p:2498 7] ERROR: could not access status of transaction 837550133 DETAIL: Could not open file pg_clog/031E: No such file or directory. [u:postgres p:2498 9] [d: u:radio p:31917 242] ERROR: could not open file base/16384/114846.39 (target block 360448000): No such file or directory [d: u:radio p:31917 243] On top of that, our secondaries are now crashed and would not startup and showed the error messages below in pg logs. 2013-03-27 11:00:47.281 PDT LOG: recovery restart point at 161A/17108AA8 2013-03-27 11:00:47.281 PDT DETAIL: last completed transaction was at log time 2013-03-27 11:00:47.241236-07 2013-03-27 11:00:47.520 PDT LOG: restartpoint starting: xlog 2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, size = 0 2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, firstright 192 2013-03-27 11:07:51.716 PDT LOG: startup process (PID 5959) exited with exit code 1 2013-03-27 11:07:51.716 PDT LOG: terminating any other active server processes At this point we have a running but corrupt primary and crashed secondary that wont startup. I am wondering what are our options at this point. Can we do something to fix this? How can we recover from corruption. Thanks for help in advance. Regards Jigar -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Streaming replication slave crash
Looks like you've got some form of coruption: page 1441792 of relation base/63229/63370 does not exist The question is whether it was corrupted on the master and then replicated to the slave, or if it was corrupted on the slave. I'd guess that the pg_dump tried to read from that page and barfed. It would be interesting to try re-running the pg_dump again to see if this crash can be replicated. If so, does it also replicate if you run pg_dump against the master? If not, then the corruption is isolated to the slave, and you might have a hardware problem which is causing the data to get corrupted. On Fri, Mar 29, 2013 at 9:19 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: Yesterday morning, one of my streaming replication slaves running 9.2.3 crashed with the following in the log file: 2013-03-28 12:49:30 GMT WARNING: page 1441792 of relation base/63229/63370 does not exist 2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; iblk 303, heap 1663/63229/63370; 2013-03-28 12:49:30 GMT PANIC: WAL contains references to invalid pages 2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; iblk 303, heap 1663/63229/63370; 2013-03-28 12:49:31 GMT LOG: startup process (PID 22941) was terminated by signal 6: Aborted 2013-03-28 12:49:31 GMT LOG: terminating any other active server processes 2013-03-28 12:49:31 GMT WARNING: terminating connection because of crash of another server process 2013-03-28 12:49:31 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2013-03-28 12:49:31 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2013-03-28 12:57:44 GMT LOG: database system was interrupted while in recovery at log time 2013-03-28 12:37:42 GMT 2013-03-28 12:57:44 GMT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2013-03-28 12:57:44 GMT LOG: entering standby mode 2013-03-28 12:57:44 GMT LOG: redo starts at 19/2367CE30 2013-03-28 12:57:44 GMT LOG: incomplete startup packet 2013-03-28 12:57:44 GMT LOG: consistent recovery state reached at 19/241835B0 2013-03-28 12:57:44 GMT LOG: database system is ready to accept read only connections 2013-03-28 12:57:44 GMT LOG: invalid record length at 19/2419EE38 2013-03-28 12:57:44 GMT LOG: streaming replication successfully connected to primary As you can see I was able to restart it and it picked up and synchronized right away, but this crash still concerns me. The DB has about 75GB of data in it, and it is almost entirely write traffic. It's essentially a log aggregator. I believe it was doing a pg_dump backup at the time of the crash. It has hot_standby_feedback on to allow that process to complete. Any insights into this, or advice on figuring out the root of it would be appreciated. So far all the things I've found like this are bugs that should be fixed in this version, or the internet equivalent of a shrug. Thanks! QH -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] replication behind high lag
On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman rumman...@gmail.com wrote: Hi, I have two 9.2 databases running with hot_standby replication. Today when I was checking, I found that replication has not been working since Mar 1st. There was a large database restored in master on that day and I believe after that the lag went higher. SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset 431326108320 SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS receive, pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0') AS replay receive|replay --+-- 245987541312 | 245987534032 (1 row) I checked the pg_xlog in both the server. In Slave the last xlog file -rw--- 1 postgres postgres 16777216 Mar 1 06:02 00010039007F In Master, the first xlog file is -rw--- 1 postgres postgres 16777216 Mar 1 04:45 00010039005E Is there any way I could sync the slave in quick process? generate a new base backup, and seed the slave with it. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] replication behind high lag
On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman rumman...@gmail.com wrote: On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman rumman...@gmail.com wrote: Hi, I have two 9.2 databases running with hot_standby replication. Today when I was checking, I found that replication has not been working since Mar 1st. There was a large database restored in master on that day and I believe after that the lag went higher. SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset 431326108320 SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS receive, pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0') AS replay receive|replay --+-- 245987541312 | 245987534032 (1 row) I checked the pg_xlog in both the server. In Slave the last xlog file -rw--- 1 postgres postgres 16777216 Mar 1 06:02 00010039007F In Master, the first xlog file is -rw--- 1 postgres postgres 16777216 Mar 1 04:45 00010039005E Is there any way I could sync the slave in quick process? generate a new base backup, and seed the slave with it. OK. I am getting these error in slave: LOG: invalid contrecord length 284 in log file 57, segment 127, offset 0 What is the actual reason? Corruption? What were you doing when you saw the error? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] replication behind high lag
On Mon, Mar 25, 2013 at 12:55 PM, AI Rumman rumman...@gmail.com wrote: On Mon, Mar 25, 2013 at 3:52 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman rumman...@gmail.com wrote: On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman rumman...@gmail.com wrote: Hi, I have two 9.2 databases running with hot_standby replication. Today when I was checking, I found that replication has not been working since Mar 1st. There was a large database restored in master on that day and I believe after that the lag went higher. SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset 431326108320 SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS receive, pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0') AS replay receive|replay --+-- 245987541312 | 245987534032 (1 row) I checked the pg_xlog in both the server. In Slave the last xlog file -rw--- 1 postgres postgres 16777216 Mar 1 06:02 00010039007F In Master, the first xlog file is -rw--- 1 postgres postgres 16777216 Mar 1 04:45 00010039005E Is there any way I could sync the slave in quick process? generate a new base backup, and seed the slave with it. OK. I am getting these error in slave: LOG: invalid contrecord length 284 in log file 57, segment 127, offset 0 What is the actual reason? Corruption? What were you doing when you saw the error? I did not have enough idea about these stuffs. I got the database now and saw the error. Is there any way to recover from this state. The master database is a large database of 500 GB. generate a new base backup, and seed the slave with it. if the error persists, then i'd guess that your master is corrupted, and then you've got huge problems. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] replication behind high lag
On Mon, Mar 25, 2013 at 1:23 PM, AI Rumman rumman...@gmail.com wrote: On Mon, Mar 25, 2013 at 4:03 PM, AI Rumman rumman...@gmail.com wrote: On Mon, Mar 25, 2013 at 4:00 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Mar 25, 2013 at 12:55 PM, AI Rumman rumman...@gmail.com wrote: On Mon, Mar 25, 2013 at 3:52 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Mar 25, 2013 at 12:43 PM, AI Rumman rumman...@gmail.com wrote: On Mon, Mar 25, 2013 at 3:40 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Mar 25, 2013 at 12:37 PM, AI Rumman rumman...@gmail.com wrote: Hi, I have two 9.2 databases running with hot_standby replication. Today when I was checking, I found that replication has not been working since Mar 1st. There was a large database restored in master on that day and I believe after that the lag went higher. SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset 431326108320 SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS receive, pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0') AS replay receive|replay --+-- 245987541312 | 245987534032 (1 row) I checked the pg_xlog in both the server. In Slave the last xlog file -rw--- 1 postgres postgres 16777216 Mar 1 06:02 00010039007F In Master, the first xlog file is -rw--- 1 postgres postgres 16777216 Mar 1 04:45 00010039005E Is there any way I could sync the slave in quick process? generate a new base backup, and seed the slave with it. OK. I am getting these error in slave: LOG: invalid contrecord length 284 in log file 57, segment 127, offset 0 What is the actual reason? Corruption? What were you doing when you saw the error? I did not have enough idea about these stuffs. I got the database now and saw the error. Is there any way to recover from this state. The master database is a large database of 500 GB. generate a new base backup, and seed the slave with it. if the error persists, then i'd guess that your master is corrupted, and then you've got huge problems. Master is running fine right now showing only a warning: WARNING: archive_mode enabled, yet archive_command is not set Do you think the master could be corrupted? Hi, I got the info that there was a master db restart on Feb 27th. Could this be a reason of this error? restarting the database cleanly should never cause corruption. again, you need to create a new base backup, and seed the slave with it. if the problem persists, then the master is likely corrupted. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] UNLOGGED TEMPORARY tables?
I'm pretty sure that unlogged tables and temp tables are two separate distinct features, with no overlap in functionality. It would be nice if it was possible to create an unlogged temp table. On Sun, Mar 24, 2013 at 1:32 PM, aasat satri...@veranet.pl wrote: I was tested write speed to temporary and unlogged tables and noticed that unlogged tables was a much faster Postgres 9.2.2 Write speed Temporary 14.5k/s UNLOGGED 50k/s Before test I was convinced that temporary tables in postgres = 9.1 are unlogged -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] UNLOGGED TEMPORARY tables?
On Mon, Mar 25, 2013 at 4:49 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Mar 26, 2013 at 8:26 AM, Lonni J Friedman netll...@gmail.com wrote: I'm pretty sure that unlogged tables and temp tables are two separate distinct features, with no overlap in functionality. It would be nice if it was possible to create an unlogged temp table. Temporary tables are a subtype of unlogged tables, as temporary tables are not WAL-logged. This article from Robert Haas will give a good summary of such differences: http://rhaas.blogspot.jp/2010/05/global-temporary-and-unlogged-tables.html Thanks, that's good to know. the official dox don't really make it clear that temp tables are unlogged. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Replication stopped on 9.0.2 after making change to conf file
It sounds like all you did was setup the slave from scratch with a fresh base backup, without understanding or debugging what caused everything to break. Clearly whatever was wrong on March 5 is still wrong, and nothing has been fixed. The first step in debugging this problem is to look at and/or post the log content (from both the master slave) from the time when this stopped working (march 5). On Fri, Mar 8, 2013 at 11:06 PM, akp geek akpg...@gmail.com wrote: What I noticed is... our streaming stopped on march 5th. I restarted the process today. However the new data is syncing correctly. But the data for these dates between 5th and today is not syncing. Is there some thing wrong that I did. here is what I did. 1.. created base backup 2. Took the data directory on to the slave 3. stopped the base backup on master 4. started the db on slave Appreciate your help. Regards On Sat, Mar 9, 2013 at 12:52 AM, akp geek akpg...@gmail.com wrote: Thanks a lot. I started the replication. It became very slow. It is taking long time to sync the masters data onto slave. Is there a way to find what's causing the issue? Regards On Fri, Mar 8, 2013 at 12:06 PM, John Laing john.la...@gmail.com wrote: I'm not sure about the existence of any standard scripts, but we have a pair of checks running periodically on the backup server. This shouldn't return anything: tail -3 /var/log/postgresql/postgresql-9.1-main.log | grep FATAL And this should return something: ps -u postgres -o cmd | grep postgres: wal receiver process streaming These have worked very reliably for many months. -John On Fri, Mar 8, 2013 at 11:53 AM, akp geek akpg...@gmail.com wrote: I got it fixed. What I did was $ psql -c SELECT pg_start_backup('label', true) $ rsync -a ${PGDATA}/ standby:/srv/pgsql/standby/ --exclude postmaster.pid $ psql -c SELECT pg_stop_backup() It took a while a to catch up the data. One question I have , are there any scripts to monitor the status of the replciation. so that I can be little proactive Regards On Thu, Mar 7, 2013 at 9:25 PM, Scott Marlowe scott.marl...@gmail.com wrote: On Thu, Mar 7, 2013 at 5:28 PM, akp geek akpg...@gmail.com wrote: Hi all - Recently made change on our primary database default_text_search_config = 'pg_catalog.simple' . After that the replication is stopped. Can you please help me ? how to fix the issue. I am sure I made the change on the slave also. How can I start the replication and catch up the data. Thanks for your time. What are you seeing in your slony and / or postgresql logs, if anything? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] postgres 9.0.2 replicated database is crashing
That process merely sets up a new server, it doesn't start streaming, unless the server has been configured correctly. You state that the slave crashed after two hours. How did you make this determination? All you seem to be doing is setting up the slave from scratch repeatedly, and assuming that it will magically just work, rather than understanding debugging why its not working. Where is the log output from your servers that shows what is really transpiring? On Sat, Mar 9, 2013 at 6:51 AM, akp geek akpg...@gmail.com wrote: Hi all - I am in desperate need of your help. The replication/streaming stopped working on March5th. I followed the following procedure to restart the streaming. After running it for couple of hours , the database is crashing on the slave. This is on our production server. Thanks for your help. Steps that I followed 1. SELECT pg_start_backup('label', true); 2. I created tar of data. 3. un tar the data on the slave. 4. deleted the pid file on the slave 5. created recovery.conf 5. SELECT pg_stop_backup(); on the primary 6. started the postgres on slave 7. After 2 hours, the slave server crashes Please help. Regards -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] postgres 9.0.2 replicated database is crashing
On Sat, Mar 9, 2013 at 1:51 PM, akp geek akpg...@gmail.com wrote: thank you. As you mentioned, I understood that I am starting the streaming scratch which is not what I wanted to do. Here is what I am planning to . Our replication process was down since March5th. 1. Is it Ok to get all wals from March5th till now to standby pg_xlog 2. take pg_basebackup 3. export the data dir 4. stop backup 5. restart the standby. Based on my understanding it should work, because when standby started, it is not able find the files, as I have deleted from the slave them when I tried to fix the issue Clearly it didn't work because you've been having issues for the past 4 days, and you've already tried this approach unsuccessfully. Based on the log snippets below, it looks like you have multiple problems. First your slave is definitely missing WAL segments. You should increase the number of WAL segments that are archived on the master and ensure that your base backup is including the WAL segments (by generating it with the -x option). However, that's the least of your problems at the moment, because it looks like the master is what is crashing. 0 2013-03-09 04:56:08 GMT LOG: entering standby mode cp: cannot access /backup/9.0.4/archive/0001010600E1 0 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E120 cp: cannot access /backup/9.0.4/archive/0001010600E2 cp: cannot access /backup/9.0.4/archive/0001010600E3 cp: cannot access /backup/9.0.4/archive/0001010600E4 0 2013-03-09 04:56:10 GMT LOG: consistent recovery state reached at 106/E45AD4A8 0 2013-03-09 04:56:10 GMT LOG: database system is ready to accept read only connections cp: cannot access /backup/9.0.4/archive/0001010600E5 cp: cannot access /backup/9.0.4/archive/0001010600E6 cp: cannot access /backup/9.0.4/archive/0001010600E7 cp: cannot access /backup/9.0.4/archive/0001010600E8 cp: cannot access /backup/9.0.4/archive/0001010600E9 cp: cannot access /backup/9.0.4/archive/0001010600EA 0 2013-03-09 04:56:12 GMT LOG: invalid record length at 106/EA10B8C0 cp: cannot access /backup/9.0.4/archive/0001010600EA 0 2013-03-09 04:56:12 GMT LOG: streaming replication successfully connected to primary [unknown] 0 2013-03-09 04:57:00 GMT [unknown]LOG: connection received: host=[local] [unknown] 0 2013-03-09 04:57:03 GMT [unknown]LOG: connection received: host=[local] [unknown] 10.155.253.43(51257) 0 SELECT 2013-03-09 07:07:18 GMT prodLog: duration: 6316.649 ms [unknown] 10.155.253.43(51257) 0 idle 2013-03-09 07:47:53 GMT prodLog: disconnection: session time: 0:41:06.529 user=postgres database=fprod host= 10.155.253.43 port=51257 0 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time 0 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote 19419 buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s 0 2013-03-09 08:25:47 GMT LOG: recovery restart point at 107/FB01B238 0 2013-03-09 08:25:47 GMT DETAIL: last completed transaction was at log time 2013-03-09 08:25:41.85776+00 0 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected EOF on client connection psql [local] 0 idle 2013-03-09 09:10:52 GMT prodLog: disconnection: session time: 2:15:06.351 user=postgres database=fprod host=[local] XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index page in 100924/100937/1225845 XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf F isdelete F updateBlkno:36483 0 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880) exited with exit code 1 0 2013-03-09 09:23:46 GMT LOG: terminating any other active server processes The real problem is here: XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index page in 100924/100937/1225845 XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf F isdelete F updateBlkno:36483 That looks like something is badly broken (maybe data corruption)? Most of the google hits on that error are associated with GIST indexes. Are you using GIST indexes? Are you really (still) using 9.0.4 ? I hope you realize that there are known data corruption bugs in that version, and that version is super old at this point. You really need to update to 9.0.12. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] postgres 9.0.2 replicated database is crashing
On Sat, Mar 9, 2013 at 4:05 PM, akp geek akpg...@gmail.com wrote: Appreciate your findings. taking your points and doing things now. We can upgrade. Which version is more stable 9.2 or 9.1 They're two entirely different branches. They should both be equally stable. However, if you're looking for the most straightforward path I'd recommend going to 9.0.12. Also be sure to read the release notes first. We use GIST indexes quite a bit. and we gis also I recently compiled postgres 9.2 .. Regards On Sat, Mar 9, 2013 at 5:09 PM, Lonni J Friedman netll...@gmail.com wrote: On Sat, Mar 9, 2013 at 1:51 PM, akp geek akpg...@gmail.com wrote: thank you. As you mentioned, I understood that I am starting the streaming scratch which is not what I wanted to do. Here is what I am planning to . Our replication process was down since March5th. 1. Is it Ok to get all wals from March5th till now to standby pg_xlog 2. take pg_basebackup 3. export the data dir 4. stop backup 5. restart the standby. Based on my understanding it should work, because when standby started, it is not able find the files, as I have deleted from the slave them when I tried to fix the issue Clearly it didn't work because you've been having issues for the past 4 days, and you've already tried this approach unsuccessfully. Based on the log snippets below, it looks like you have multiple problems. First your slave is definitely missing WAL segments. You should increase the number of WAL segments that are archived on the master and ensure that your base backup is including the WAL segments (by generating it with the -x option). However, that's the least of your problems at the moment, because it looks like the master is what is crashing. 0 2013-03-09 04:56:08 GMT LOG: entering standby mode cp: cannot access /backup/9.0.4/archive/0001010600E1 0 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E120 cp: cannot access /backup/9.0.4/archive/0001010600E2 cp: cannot access /backup/9.0.4/archive/0001010600E3 cp: cannot access /backup/9.0.4/archive/0001010600E4 0 2013-03-09 04:56:10 GMT LOG: consistent recovery state reached at 106/E45AD4A8 0 2013-03-09 04:56:10 GMT LOG: database system is ready to accept read only connections cp: cannot access /backup/9.0.4/archive/0001010600E5 cp: cannot access /backup/9.0.4/archive/0001010600E6 cp: cannot access /backup/9.0.4/archive/0001010600E7 cp: cannot access /backup/9.0.4/archive/0001010600E8 cp: cannot access /backup/9.0.4/archive/0001010600E9 cp: cannot access /backup/9.0.4/archive/0001010600EA 0 2013-03-09 04:56:12 GMT LOG: invalid record length at 106/EA10B8C0 cp: cannot access /backup/9.0.4/archive/0001010600EA 0 2013-03-09 04:56:12 GMT LOG: streaming replication successfully connected to primary [unknown] 0 2013-03-09 04:57:00 GMT [unknown]LOG: connection received: host=[local] [unknown] 0 2013-03-09 04:57:03 GMT [unknown]LOG: connection received: host=[local] [unknown] 10.155.253.43(51257) 0 SELECT 2013-03-09 07:07:18 GMT prodLog: duration: 6316.649 ms [unknown] 10.155.253.43(51257) 0 idle 2013-03-09 07:47:53 GMT prodLog: disconnection: session time: 0:41:06.529 user=postgres database=fprod host= 10.155.253.43 port=51257 0 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time 0 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote 19419 buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s 0 2013-03-09 08:25:47 GMT LOG: recovery restart point at 107/FB01B238 0 2013-03-09 08:25:47 GMT DETAIL: last completed transaction was at log time 2013-03-09 08:25:41.85776+00 0 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected EOF on client connection psql [local] 0 idle 2013-03-09 09:10:52 GMT prodLog: disconnection: session time: 2:15:06.351 user=postgres database=fprod host=[local] XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index page in 100924/100937/1225845 XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf F isdelete F updateBlkno:36483 0 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880) exited with exit code 1 0 2013-03-09 09:23:46 GMT LOG: terminating any other active server processes The real problem is here: XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index page in 100924/100937/1225845 XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F
Re: [GENERAL] broke postgres, how to fix??
Did you shut down the 'old' postgres before copying these files? Did you (re)configure the 'new' postgres to set its $PGDATA directory to the location of the 'new' files? On Fri, Feb 22, 2013 at 3:46 PM, JD Wong jdmsw...@gmail.com wrote: I tried copying postgres over to a new directory. it was working until I deleted a bunch of databases from the old postgres. Lo and behold this somehow broke the new copy too. Now when I start with pg_ctl 2013-02-22 18:36:13 EST DETAIL: The database subdirectory base/1066060 is missing. 2013-02-22 18:36:25 EST FATAL: database wormmine does not exist 2013-02-22 18:36:25 EST DETAIL: The database subdirectory base/1027296 is missing. 2013-02-22 18:37:13 EST FATAL: database wormmine-dev does not exist and it won't start... How can I fix this? re-creating these databases with the old postgres didn't work... Also, why did this happen? I created a new data directory for the new postgres, this should be impossible I have no ideas left, can anyone help? Thanks in advance, -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] broke postgres, how to fix??
On Tue, Feb 26, 2013 at 4:02 PM, JD Wong jdmsw...@gmail.com wrote: Hi Adrian, yes I completely copied the config-file and data directories over. Lonnie, I don't remember. I might not have shut down the old postgres, yes I set PGDATA accordingly. That's guaranteed to break everything badly. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] broke postgres, how to fix??
On Tue, Feb 26, 2013 at 4:10 PM, JD Wong jdmsw...@gmail.com wrote: Hi Adrian, That's guaranteed to break everything badly. Even if I read only style copied the files? Do you mind elaborating on why this happens? ( or point me to relevant documentation ) What is read only style, and how does postgres know about this? http://www.postgresql.org/docs/9.2/static/backup-file.html Thanks, -JD On Tue, Feb 26, 2013 at 7:04 PM, Lonni J Friedman netll...@gmail.com wrote: On Tue, Feb 26, 2013 at 4:02 PM, JD Wong jdmsw...@gmail.com wrote: Hi Adrian, yes I completely copied the config-file and data directories over. Lonnie, I don't remember. I might not have shut down the old postgres, yes I set PGDATA accordingly. That's guaranteed to break everything badly. -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] special procedure required when running pg_basebackup from a standby?
Greetings, I'm running postgres-9.2.2 in a Linux-x86_64 cluster with 1 master and several hot standby servers. Since upgrading to 9.2.2 from 9.1.x a few months ago, I switched from generating a base backup on the master, to generating it on a dedicated slave/standby (to reduce the load on the master). The command that I've always used to generate the base backup is: pg_basebackup -v -D /tmp/bb0 -x -Ft -U postgres However, I've noticed that whenever I use the base backup generated from the standby to create a new standby server, many of the indexes are corrupted. This was never the case when I was generating the basebackup directly from the master. Now, I see errors similar to the following when running queries against the tables that own the indexes: INDEX debugger_2013_01_dacode_idx contains unexpected zero page at block 12 HINT: Please REINDEX it. INDEX smoke32on64tests_2013_01_suiteid_idx contains unexpected zero page at block 111 HINT: Please REINDEX it. I've confirmed that the errors/corruption doesn't exist on the server that is generating the base backup (I can run the same SQL query which fails on the new standby, successfully). Also reindexing the index does fix the problem. So it seems that I'm potentially misunderstanding some part of the process. My setup process is to simply untar the basebackup in the $PGDATA directory, and copy over all the WAL logs into $PGDATA/pg_xlog. thanks for any pointers. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] data corruption when using base backups generated from hot standby
Greetings, I'm running postgres-9.2.2 in a Linux-x86_64 cluster with 1 master and several hot standby servers. Since upgrading to 9.2.2 from 9.1.x a few months ago, I switched from generating a base backup on the master, to generating it on a dedicated slave/standby (to reduce the load on the master). The command that I've always used to generate the base backup is: pg_basebackup -v -D /tmp/bb0 -x -Ft -U postgres However, I've noticed that whenever I use the base backup generated from the standby to create a new standby server, many of the indexes are corrupted. This was never the case when I was generating the basebackup directly from the master. Now, I see errors similar to the following when running queries against the tables that own the indexes: INDEX debugger_2013_01_dacode_idx contains unexpected zero page at block 12 HINT: Please REINDEX it. INDEX smoke32on64tests_2013_01_suiteid_idx contains unexpected zero page at block 111 HINT: Please REINDEX it. I've confirmed that the errors/corruption doesn't exist on the server that is generating the base backup (I can run the same SQL query which fails on the new standby, successfully). Also reindexing the index does fix the problem. So it seems that I'm potentially misunderstanding some part of the process. My setup process is to simply untar the basebackup in the $PGDATA directory, and copy over all the WAL logs into $PGDATA/pg_xlog. thanks for any pointers. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pgpool2 load balancing not working
On Fri, Jan 4, 2013 at 3:42 PM, Greg Donald gdon...@gmail.com wrote: Sorry if this is the wrong list, but I've been stuck for a couple days now. I tried pgpool-general but that list appears to not like me. I'm not getting any posts and my post hasn't shown up in the archives. Specifically which address are you sending to? I'm on the official list, and it seems to work fine for me. I have a Python/Django app that will require database load balancing at some point in the near future. In the meantime I'm trying to learn to implement pgpool on a local virtual machine setup. I have 4 Ubuntu 12.04 VMs: 192.168.1.80 - pool, pgppool2 installed and accessible 192.168.1.81 - db1 master 192.168.1.82 - db2 slave 192.168.1.83 - db3 slave I have pgpool-II version 3.1.1 and my database servers are running PostgreSQL 9.1. I have my app's db connection pointed to 192.168.1.80: and it works fine. The problem is when I use Apache ab to throw some load at it, none of SELECT queries appear to be balanced. All the load goes to my db1 master. Also, very concerning is the load on the pool server itself, it is really high compared to db1, maybe an average of 8-10 times higher. Meanwhile my db2 and db3 servers have a load of nearly zero, they appear to only be replicating from db1, which isn't very load intensive for my tests with ab. ab -n 300 -c 4 -C 'sessionid=80a5fd3b6bb59051515e734326735f80' http://192.168.1.17/contacts/ That drives the load on my pool server up to about 2.3. Load on db1 is about 0.4 and load on db2 and db3 is nearly zero. Can someone take a look at my pgpool.conf and see if what I'm doing wrong? http://pastebin.com/raw.php?i=wzBc0aSp Nothing is jumping out at me as blatantly wrong, although it seems kinda weird that each of your database servers is listening on a different port #. I'm starting to think maybe it has something to do with Django wrapping every request in a transaction by default, but when the transaction only has SELECTs, shouldn't that be load balanced just fine? Makes my stomach hurt to think I may have to turn off auto-commit and manually commit transactions all throughout my code :( Still hoping it's a pgpool setup issue, since it's my first time setting it up and all. I've never done anything with Django, but this seems like a good possibility that the transactions are causing pgpool to get confused and assume that every query requires write access. What might be more useful is for you to post your actual pgpool log somewhere, as that might contain a clue of what is going wrong. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Postgresql logfilename and times in GMT - not EST
On Tue, Dec 4, 2012 at 1:59 PM, Bryan Montgomery mo...@english.net wrote: We have a test 9.2.0 db running on openSuse 12.2. When I select now() I get the correct timezone and date back (-5 hours). When I do date at the os prompt, I get the right timezone back. I changed postgres.conf to have timezone = 'EST' and restarted postgres. However the log file is still 5 hours ahead. What gives? Not the end of the world but a bit annoying. you need to set log_timezone . This is a new 'feature' in 9.2 that annoyed me as well. I assume that there was a good use case for this. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Postgresql logfilename and times in GMT - not EST
On Tue, Dec 4, 2012 at 2:42 PM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: On Tue, Dec 4, 2012 at 1:59 PM, Bryan Montgomery mo...@english.net wrote: I changed postgres.conf to have timezone = 'EST' and restarted postgres. However the log file is still 5 hours ahead. What gives? Not the end of the world but a bit annoying. you need to set log_timezone . This is a new 'feature' in 9.2 that annoyed me as well. I assume that there was a good use case for this. New? log_timezone has been around since 8.3, and it seems like a good idea to me --- what if you have N sessions each with its own active timezone setting? Timestamps in the log would be an unreadable mismash if there weren't a separate log_timezone setting. What did change in 9.2 is that initdb sets values for timezone and log_timezone in postgresql.conf, so it's the initdb environment that will determine what you get in the absence of any manual action. Before that it was the postmaster's environment. Sorry, I meant new, in that its impact changed in 9.2 such that it needed to be explicitly set to not get UTC by default, whereas in the past that wasn't required. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table with million rows - and PostgreSQL 9.1 is not using the index
I'm no expert on this, but it will likely be more helpful to others if you include the table description with all the indices. On Tue, Dec 4, 2012 at 8:44 PM, Edson Richter edsonrich...@hotmail.com wrote: I've a table with 110 rows, with streets. I'm making a partial search using zip code, and PostgreSQL is ignoring my ZIP index. I'm sure I'm making some mistake, but I can't see where. The query is: SELECT t2.ID, t2.CEP, t2.COMPLEMENTO, t2.NOME, t2.NOMESEMACENTOS, t2.TIPO, t2.BAIRRO_ID FROM LOCALIDADE t0, LOGRADOURO t2, BAIRRO t1 WHERE t2.CEP LIKE '81630160%' AND ((t1.ID = t2.BAIRRO_ID) AND (t0.ID = t1.LOCALIDADE_ID)) ORDER BY t0.NOME; (for reference, BAIRRO = town, LOCALIDADE = city, LOGRADOURO = street) Here is the result of explain analyze: Sort (cost=11938.72..11938.74 rows=91 width=93) Sort Key: t0.nome - Nested Loop (cost=0.00..11938.42 rows=91 width=93) - Nested Loop (cost=0.00..11935.19 rows=91 width=85) - Seq Scan on logradouro t2 (cost=0.00..11634.42 rows=91 width=81) Filter: ((cep)::text ~~ '81630160%'::text) - Index Scan using pkbairro on bairro t1 (cost=0.00..3.30 rows=1 width=8) Index Cond: (id = t2.bairro_id) - Index Scan using pklocalidade on localidade t0 (cost=0.00..0.03 rows=1 width=16) Index Cond: ((id)::text = (t1.localidade_id)::text) I've few tweaks in postgresql.conf: shared_buffers = 2GB temp_buffers = 32MB max_prepared_transactions = 50 work_mem = 32MB maintenance_work_mem = 16MB max_stack_depth = 4MB max_files_per_process = 15000 random_page_cost = 2.0 cpu_tuple_cost = 0.001 cpu_index_tuple_cost = 0.0005 cpu_operator_cost = 0.00025 effective_cache_size = 512MB Everything else is default configuration. This machine is Intel Quad 3.1Ghz, with 8 threads, 8Gig of RAM, 8Gig of Swap, running CentOS 6.3 64bit. Machine is free almost all the time. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Quick estimate of num of rows table size
On Mon, Nov 5, 2012 at 2:02 PM, Thalis Kalfigkopoulos tkalf...@gmail.com wrote: Hi all, I read somewhere that the following query gives a quick estimate of the # of rows in a table regardless of the table's size (which would matter in a simple SELECT count(*)?): SELECT (CASE WHEN reltuples 0 THEN pg_relation_size('mytable')/(8192*relpages/reltuples) ELSE 0 END)::bigint AS estimated_row_count FROM pg_class WHERE oid = 'mytable'::regclass; If relpages reltuples are recorded accurately each time VACUUM is run, wouldn't it be the same to just grab directly the value of reltuples like: SELECT reltuples FROM pg_class WHERE oid='mytable'::regclass; In the same manner, are pg_relation_size('mytable') and 8192*relpages the same? I run both assumptions against a freshly VACUUMed table and they seem correct. This doesn't seem to work for me. I get an estimated row_count of 0 on a table that I know has millions of rows. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Quick estimate of num of rows table size
On Mon, Nov 5, 2012 at 3:56 PM, Thalis Kalfigkopoulos tkalf...@gmail.com wrote: On Mon, Nov 5, 2012 at 7:14 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Nov 5, 2012 at 2:02 PM, Thalis Kalfigkopoulos tkalf...@gmail.com wrote: Hi all, I read somewhere that the following query gives a quick estimate of the # of rows in a table regardless of the table's size (which would matter in a simple SELECT count(*)?): SELECT (CASE WHEN reltuples 0 THEN pg_relation_size('mytable')/(8192*relpages/reltuples) ELSE 0 END)::bigint AS estimated_row_count FROM pg_class WHERE oid = 'mytable'::regclass; If relpages reltuples are recorded accurately each time VACUUM is run, wouldn't it be the same to just grab directly the value of reltuples like: SELECT reltuples FROM pg_class WHERE oid='mytable'::regclass; In the same manner, are pg_relation_size('mytable') and 8192*relpages the same? I run both assumptions against a freshly VACUUMed table and they seem correct. This doesn't seem to work for me. I get an estimated row_count of 0 on a table that I know has millions of rows. Which one doesn't work exactly? The larger query? Are you on a 9.x? doh, sorry. The first/larger doesn't work. As it turns out the 2nd actually does work well. I'm on 9.1.x. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hot Standby Not So Hot Anymore
On Mon, Nov 5, 2012 at 7:40 PM, Ian Harding harding@gmail.com wrote: I had a 9.0.8 hot standby setup, one master, two slaves, working great. Then, I tried to re-initialize by making a base backup, the way I've done it many times before, but for some reason I can't get the standby to accept connections. I copied the postgresql.conf and recorvery.conf out of the way, cleaned the data directory and extracted the backup files, then replaced the conf files. Everything works perfectly, but I keep getting :FATAL: the database system is starting up I know I'm an idiot and that I did something wrong but I can't find it. [root@db03 data]# grep standby postgresql.conf wal_level = hot_standby# minimal, archive, or hot_standby hot_standby = on# on allows queries during recovery max_standby_archive_delay = -1# max delay before canceling queries max_standby_streaming_delay = -1# max delay before canceling queries [root@db03 data]# grep standby recovery.conf # Specifies whether to start the server as a standby. In streaming replication, standby_mode = 'on' # Specifies a connection string which is used for the standby server to connect [root@db03 data]# ps aux | grep stream postgres 62127 0.1 0.0 34554356 3724 ? Ss 16:22 0:16 postgres: wal receiver process streaming BA7/12B87818 [root@db03 data]# ps aux | grep startup postgres 62122 0.5 0.9 34545900 1223708 ?Ss 16:22 1:03 postgres: startup process recovering 00010BA70012 Any suggestions appreciated!!! What's your log have to say? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hot Standby Not So Hot Anymore
On Mon, Nov 5, 2012 at 7:49 PM, Ian Harding harding@gmail.com wrote: On Mon, Nov 5, 2012 at 7:46 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Nov 5, 2012 at 7:40 PM, Ian Harding harding@gmail.com wrote: I had a 9.0.8 hot standby setup, one master, two slaves, working great. Then, I tried to re-initialize by making a base backup, the way I've done it many times before, but for some reason I can't get the standby to accept connections. I copied the postgresql.conf and recorvery.conf out of the way, cleaned the data directory and extracted the backup files, then replaced the conf files. Everything works perfectly, but I keep getting :FATAL: the database system is starting up I know I'm an idiot and that I did something wrong but I can't find it. [root@db03 data]# grep standby postgresql.conf wal_level = hot_standby# minimal, archive, or hot_standby hot_standby = on# on allows queries during recovery max_standby_archive_delay = -1# max delay before canceling queries max_standby_streaming_delay = -1# max delay before canceling queries [root@db03 data]# grep standby recovery.conf # Specifies whether to start the server as a standby. In streaming replication, standby_mode = 'on' # Specifies a connection string which is used for the standby server to connect [root@db03 data]# ps aux | grep stream postgres 62127 0.1 0.0 34554356 3724 ? Ss 16:22 0:16 postgres: wal receiver process streaming BA7/12B87818 [root@db03 data]# ps aux | grep startup postgres 62122 0.5 0.9 34545900 1223708 ?Ss 16:22 1:03 postgres: startup process recovering 00010BA70012 Any suggestions appreciated!!! What's your log have to say? It says everything is happy as normal... 2012-11-05 16:22:38.744 PST - :LOG: database system was shut down in recovery at 2012-11-05 16:22:33 PST 2012-11-05 16:22:38.745 PST - :LOG: entering standby mode 2012-11-05 16:22:38.746 PST - :LOG: redo starts at BA5/F96F86A8 2012-11-05 16:22:38.762 PST - postgres :FATAL: the database system is starting up 2012-11-05 16:22:39.764 PST - postgres :FATAL: the database system is starting up 2012-11-05 16:22:40.766 PST - postgres :FATAL: the database system is starting up 2012-11-05 16:22:41.200 PST - :LOG: invalid record length at BA6/6DCBA48 2012-11-05 16:22:41.206 PST - :LOG: streaming replication successfully connected to primary This is after I stopped and restarted... after that just more failed connection attempts. Are you sure that its not still recovering? In other words, when you look at the 'ps' output do the 'wal receiver process' and 'recovering' entries change? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hot Standby Not So Hot Anymore
On Mon, Nov 5, 2012 at 8:13 PM, Ian Harding harding@gmail.com wrote: On Mon, Nov 5, 2012 at 7:57 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Nov 5, 2012 at 7:49 PM, Ian Harding harding@gmail.com wrote: On Mon, Nov 5, 2012 at 7:46 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Nov 5, 2012 at 7:40 PM, Ian Harding harding@gmail.com wrote: I had a 9.0.8 hot standby setup, one master, two slaves, working great. Then, I tried to re-initialize by making a base backup, the way I've done it many times before, but for some reason I can't get the standby to accept connections. I copied the postgresql.conf and recorvery.conf out of the way, cleaned the data directory and extracted the backup files, then replaced the conf files. Everything works perfectly, but I keep getting :FATAL: the database system is starting up I know I'm an idiot and that I did something wrong but I can't find it. [root@db03 data]# grep standby postgresql.conf wal_level = hot_standby# minimal, archive, or hot_standby hot_standby = on# on allows queries during recovery max_standby_archive_delay = -1# max delay before canceling queries max_standby_streaming_delay = -1# max delay before canceling queries [root@db03 data]# grep standby recovery.conf # Specifies whether to start the server as a standby. In streaming replication, standby_mode = 'on' # Specifies a connection string which is used for the standby server to connect [root@db03 data]# ps aux | grep stream postgres 62127 0.1 0.0 34554356 3724 ? Ss 16:22 0:16 postgres: wal receiver process streaming BA7/12B87818 [root@db03 data]# ps aux | grep startup postgres 62122 0.5 0.9 34545900 1223708 ?Ss 16:22 1:03 postgres: startup process recovering 00010BA70012 Any suggestions appreciated!!! What's your log have to say? It says everything is happy as normal... 2012-11-05 16:22:38.744 PST - :LOG: database system was shut down in recovery at 2012-11-05 16:22:33 PST 2012-11-05 16:22:38.745 PST - :LOG: entering standby mode 2012-11-05 16:22:38.746 PST - :LOG: redo starts at BA5/F96F86A8 2012-11-05 16:22:38.762 PST - postgres :FATAL: the database system is starting up 2012-11-05 16:22:39.764 PST - postgres :FATAL: the database system is starting up 2012-11-05 16:22:40.766 PST - postgres :FATAL: the database system is starting up 2012-11-05 16:22:41.200 PST - :LOG: invalid record length at BA6/6DCBA48 2012-11-05 16:22:41.206 PST - :LOG: streaming replication successfully connected to primary This is after I stopped and restarted... after that just more failed connection attempts. Are you sure that its not still recovering? In other words, when you look at the 'ps' output do the 'wal receiver process' and 'recovering' entries change? Yeah, it's still recovering. In normal operation it keeps right on recovering, it's just that it used to accept read-only connections... Is the 2nd slave also experiencing this problem? If not, is it further ahead than the bad one? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hot Standby Not So Hot Anymore
On Mon, Nov 5, 2012 at 8:31 PM, Ian Harding harding@gmail.com wrote: On Mon, Nov 5, 2012 at 8:15 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Nov 5, 2012 at 8:13 PM, Ian Harding harding@gmail.com wrote: On Mon, Nov 5, 2012 at 7:57 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Nov 5, 2012 at 7:49 PM, Ian Harding harding@gmail.com wrote: On Mon, Nov 5, 2012 at 7:46 PM, Lonni J Friedman netll...@gmail.com wrote: On Mon, Nov 5, 2012 at 7:40 PM, Ian Harding harding@gmail.com wrote: I had a 9.0.8 hot standby setup, one master, two slaves, working great. Then, I tried to re-initialize by making a base backup, the way I've done it many times before, but for some reason I can't get the standby to accept connections. I copied the postgresql.conf and recorvery.conf out of the way, cleaned the data directory and extracted the backup files, then replaced the conf files. Everything works perfectly, but I keep getting :FATAL: the database system is starting up I know I'm an idiot and that I did something wrong but I can't find it. [root@db03 data]# grep standby postgresql.conf wal_level = hot_standby# minimal, archive, or hot_standby hot_standby = on# on allows queries during recovery max_standby_archive_delay = -1# max delay before canceling queries max_standby_streaming_delay = -1# max delay before canceling queries [root@db03 data]# grep standby recovery.conf # Specifies whether to start the server as a standby. In streaming replication, standby_mode = 'on' # Specifies a connection string which is used for the standby server to connect [root@db03 data]# ps aux | grep stream postgres 62127 0.1 0.0 34554356 3724 ? Ss 16:22 0:16 postgres: wal receiver process streaming BA7/12B87818 [root@db03 data]# ps aux | grep startup postgres 62122 0.5 0.9 34545900 1223708 ?Ss 16:22 1:03 postgres: startup process recovering 00010BA70012 Any suggestions appreciated!!! What's your log have to say? It says everything is happy as normal... 2012-11-05 16:22:38.744 PST - :LOG: database system was shut down in recovery at 2012-11-05 16:22:33 PST 2012-11-05 16:22:38.745 PST - :LOG: entering standby mode 2012-11-05 16:22:38.746 PST - :LOG: redo starts at BA5/F96F86A8 2012-11-05 16:22:38.762 PST - postgres :FATAL: the database system is starting up 2012-11-05 16:22:39.764 PST - postgres :FATAL: the database system is starting up 2012-11-05 16:22:40.766 PST - postgres :FATAL: the database system is starting up 2012-11-05 16:22:41.200 PST - :LOG: invalid record length at BA6/6DCBA48 2012-11-05 16:22:41.206 PST - :LOG: streaming replication successfully connected to primary This is after I stopped and restarted... after that just more failed connection attempts. Are you sure that its not still recovering? In other words, when you look at the 'ps' output do the 'wal receiver process' and 'recovering' entries change? Yeah, it's still recovering. In normal operation it keeps right on recovering, it's just that it used to accept read-only connections... Is the 2nd slave also experiencing this problem? If not, is it further ahead than the bad one? Everything looks fine (forgot to cc the list...) [root@db0 ~]# ps aux | grep stream postgres 45267 0.0 0.0 34546456 3036 ? Ss 16:22 0:14 postgres: wal sender process postgres 192.168.4.4(55925) streaming BA7/5FC1BFD8 postgres 54247 0.0 0.0 34546456 3036 ? Ss 18:15 0:07 postgres: wal sender process postgres 192.168.4.3(57482) streaming BA7/5FC1BFD8 Right, but what I meant was, are both slaves experiencing the same problem, or just 1? Did you use the same base backup on both of them, or did you generate a different one for each slave? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] dropdb breaks replication?
On Wed, Oct 31, 2012 at 10:32 AM, Edson Richter edsonrich...@hotmail.com wrote: I've two PostgreSQL 9.1.6 running on Linux CentOS 5.8 64bit. They are replicated asynchronously. Yesterday, I've dropped a database of 20Gb, and then replication has broken, requiring me to manually synchronize both servers again. It is expected that dropdb (or, perhaps, createdb) break existing replication between servers? How did you determine that replication was broken, and how did you manually synchronize the servers? Are you certain that replication was working prior to dropping the database? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] dropdb breaks replication?
On Wed, Oct 31, 2012 at 11:01 AM, Edson Richter edsonrich...@hotmail.com wrote: Em 31/10/2012 15:39, Lonni J Friedman escreveu: On Wed, Oct 31, 2012 at 10:32 AM, Edson Richter edsonrich...@hotmail.com wrote: I've two PostgreSQL 9.1.6 running on Linux CentOS 5.8 64bit. They are replicated asynchronously. Yesterday, I've dropped a database of 20Gb, and then replication has broken, requiring me to manually synchronize both servers again. It is expected that dropdb (or, perhaps, createdb) break existing replication between servers? How did you determine that replication was broken, and how did you manually synchronize the servers? Are you certain that replication was working prior to dropping the database? I'm sure replication was running. I usually keep two windows open in both servers, running In master: watch -n 2 ps aux | egrep sender In slave: watch -n 2 ps aux | egrep receiver At the point the dropdb command has been executed, both disappeared from my radar. Also, in the log there is the following error: LOG: replicação em fluxo conectou-se com sucesso ao servidor principal FATAL: não pôde receber dados do fluxo do WAL: FATAL: segmento do WAL solicitado 0001000100BE já foi removido May the cause not having enough segments (currently 80) for dropdb command? Is dropdb logged in transaction log page-by-page excluded? I can't read portugese(?), but i think the gist of the error is that the WAL segment was already removed before the slave could consume it. I'm guessing that you aren't keeping enough of them, and dropping the database generated a huge volume which flushed out the old ones before they could get consumed by your slave. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] 9.1 to 9.2 requires a dump/reload?
pg_upgrade has worked fine for several releases. I believe that the only time when pg_upgrade isn't a viable option is for some types of GIST indices. On Mon, Oct 22, 2012 at 2:55 PM, Nikolas Everett nik9...@gmail.com wrote: I was just looking at http://www.postgresql.org/docs/devel/static/release-9-2.html and it mentioned that a dump/reload cycle was required to upgrade from a previous release. I just got done telling some of my coworkers that PG had been bitten by this enough times that they were done with it. Am I wrong? Is this normal? I see that pg_upgrade is an option. Having never used how long should I expect pg_upgrade to take? Obviously we'll measure it in our environment, but it'd be nice to have a ballpark figure. Nik -- ~ L. Friedmannetll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Strategies/Best Practises Handling Large Tables
On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta chitracr...@gmail.com wrote: Hi, I currently have a table that is growing very quickly - i.e 7 million records in 5 days. This table acts as a placeholder for statistics, and hence the records are merely inserted and never updated or deleted. Many queries are run on this table to obtain trend analysis. However, these queries are now starting to take a very long time (hours) to execute due to the size of the table. I have put indexes on this table, to no significant benefit. Some of the other strategies I have thought of: 1. Purge old data 2. Reindex 3. Partition 4. Creation of daily, monthly, yearly summary tables that contains aggregated data specific to the statistics required Does anyone know what is the best practice to handle this situation? I would appreciate knowledge sharing on the pros and cons of the above, or if there are any other strategies that I could put in place. Partitioning is prolly your best solution. 3 4 sound like variations on the same thing. Before you go that route, you should make sure that your bottleneck is really a result of the massive amount of data, and not some other problem. Are you sure that the indices you created are being used, and that you have all the indices that you need for your queries? Look at the query plan output from EXPLAIN, and/or post here if you're unsure. Reindexing shouldn't make a difference unless something is wrong with the indices that you already have in place. Purging old data is only a good solution if you do not need the data, and never will need the data. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] strange hot_standby behaviour
On Mon, Oct 1, 2012 at 7:28 AM, pfote pf...@ypsilon.net wrote: Hi, I had a very strange effect on the weekend that smells like a bug, so i'd like so share it. Setup: machine A: 16 CPU Cores (modern), 128GB RAM, nice 6-drive SAS Raid-10 machines B, C: 8 Cores (substantially older than A), 48GB Ram, some scsi Raid, substantially slower than A The workload is about 80% - 90% SELECTs with heavy sorting and grouping, the remaining are INSERTs/UPDATEs/DELETEs. So In the original setup A is the master, B and C are hot standby's that process some of the SELECTs, but by far the most processing is done on the master (A). pg version is 9.0.6. CPU utilization is about 80% on the master and between 90-100% in the standby's, so it's decided to upgrade to the latest 9.2 to profit from the latest performance enhancements. So B gets upgraded to 9.2.1-1.pgdg60+1 (from pgapt.debian.org) and becomes master, then A becomes a hot_standby slave that takes all the SELECTs (and C becomes another hot_standby). In the beginning everything works as expected, CPU utilization drops from 80% to about 50-60%, selects run faster, everything looks smoother (some queries drop from 5s to 1s due to 9.2s index-only-scan feature). Its friday, everyone is happy. About 16 hours later, saturday morning around 6:00, A suddenly goes wild and has a CPU utilization of 100% without a change in the workload, out of the blue. Queries that used to take 1s suddenly take 5-10s, explain analyze plans of these queries havn't change a bit though. Switching the workload off causes the server to become idle. (while I'm writing this I realize we haven't tried to restart A). Instead, $boss decides to twitch back to the original setup, so B gets dropped, A becomes master and gets 100% of the workload (all SELECTs/INSERTs/UPDATEs/DELETEs), and everything becomes just like friday, CPU usage drops to 50-60%, everything runs smothly. I'm not sure yet if this is replication related or a 9.2.1 problem. Any Ideas? This could be just about anything. Which OS are you running? Did you check any logs when everything went crazy? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] initial sync of multiple streaming slaves simultaneously
On Wed, Sep 19, 2012 at 8:59 AM, Mike Roest mike.ro...@replicon.com wrote: Hey Everyone, We currently have a 9.1.5 postgres cluster running using streaming replication. We have 3 nodes right now 2 - local that are setup with pacemaker for a HA master/slave set failover cluster 1 - remote as a DR. Currently we're syncing with the pretty standard routine clear local datadir pg_start_backup sync datadir with fast-archiver (https://github.com/replicon/fast-archiver) pg_stop_backup start slave We use the streaming replication with wal_keep_segments set to 1000 to get the required WAL files to the slaves. With this procedure we can currently only sync one of the slaves at a time if we failover. As when the second machine goes to start the sync it errors out cause trying to run pg_start_backup fails. Specifically what is the error? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] initial sync of multiple streaming slaves simultaneously
Just curious, is there a reason why you can't use pg_basebackup ? On Wed, Sep 19, 2012 at 12:27 PM, Mike Roest mike.ro...@replicon.com wrote: Is there any hidden issue with this that we haven't seen. Or does anyone have suggestions as to an alternate procedure that will allow 2 slaves to sync concurrently. With some more testing I've done today I seem to have found an issue with this procedure. When the slave starts up after the sync It reaches what it thinks is a consistent recovery point very fast based on the pg_stop_backup eg: (from the recover script) 2012-09-19 12:15:02: pgsql_start start 2012-09-19 12:15:31: pg_start_backup 2012-09-19 12:15:31: - 2012-09-19 12:15:31: 61/3020 2012-09-19 12:15:31: (1 row) 2012-09-19 12:15:31: 2012-09-19 12:15:32: NOTICE: pg_stop_backup complete, all required WAL segments have been archived 2012-09-19 12:15:32: pg_stop_backup 2012-09-19 12:15:32: 2012-09-19 12:15:32: 61/30D8 2012-09-19 12:15:32: (1 row) 2012-09-19 12:15:32: While the sync was running (but after the pg_stop_backup) I pushed a bunch of traffic against the master server. Which got me to a current xlog location of postgres=# select pg_current_xlog_location(); pg_current_xlog_location -- 61/6834C450 (1 row) The startup of the slave after the sync completed: 2012-09-19 12:42:49.976 MDT [18791]: [1-1] LOG: database system was interrupted; last known up at 2012-09-19 12:15:31 MDT 2012-09-19 12:42:49.976 MDT [18791]: [2-1] LOG: creating missing WAL directory pg_xlog/archive_status 2012-09-19 12:42:50.143 MDT [18791]: [3-1] LOG: entering standby mode 2012-09-19 12:42:50.173 MDT [18792]: [1-1] LOG: streaming replication successfully connected to primary 2012-09-19 12:42:50.487 MDT [18791]: [4-1] LOG: redo starts at 61/3020 2012-09-19 12:42:50.495 MDT [18791]: [5-1] LOG: consistent recovery state reached at 61/3100 2012-09-19 12:42:50.495 MDT [18767]: [2-1] LOG: database system is ready to accept read only connections It shows the DB reached a consistent state as of 61/3100 which is well behind the current location of the master (and the data files that were synced over to the slave). And monitoring the server showed the expected slave delay that disappeared as the slave pulled and recovered from the WAL files that go generated after the pg_stop_backup. But based on this it looks like this procedure would end up with a indeterminate amount of time (based on how much traffic the master processed while the slave was syncing) that the slave couldn't be trusted for fail over or querying as the server is up and running but is not actually in a consistent state. Thinking it through the more complicated script version of the 2 server recovery (where first past the post to run start_backup or stop_backup) would also have this issue (although our failover slave would always be the one running stop backup as it syncs faster so at least it would be always consistent but the DR would still have the problem) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] File system level backup
On Thu, Jul 26, 2012 at 3:39 AM, Manoj Agarwal m...@ockham.be wrote: Hi, I have two virtual machines with two different versions of Postgresql. One machine contains Postgres 7.4.19 and another has Postgres 8.4.3. I also have other instances of these two virtual machines. I need to transfer the database from one machine to other. Both machines contain a database with the same name, for example: testdb, but with different data/values, but the structure is exactly same. I want to do the following: · Take file system level backup from from first machine for Postgres database ‘testdb’ and restore it in another instance that has SAME version of Postgres. i.e. backup the database ‘testdb’ from Postgres 7.4.19 and restore it on another virtual machine with same Postgres version 7.4.19. · Take file system level backup from from first machine for Postgres database ‘testdb’ and restore it in another instance that has DIFFERENT version of Postgres. i.e. backup the database testdb from Postgres 7.4.19 and restore it to another virtual machine with different Postgres version 8.4.3. I can achieve it with pg_dump and pg_restore, but not with file level backups. The data directory is /var/lib/pgsql/data that contains base directory amongst others, that contains directories for each database referred to by their OIDs. If I replace the complete data directory of one machine from the instance of another machine (with same or different Postgres version), It is failing to identify OIDs for that database. It should be possible in Postgres to swap two data directories in two different virtual machines without requiring pg_dump and pg_restore. With me, it doesn’t work in both the cases mentioned above. In first case, it gives an error of missing OID for the database. In second case, it is giving version incompatibility issue. Is there a way in Postgres to do file system level backup? The objective is to push /var/lib/pgsql/data directory across different virtual machines, without the need to backup and restore Postgres database with ‘pg_dump’ and ‘pg_restore’ commands. Any help will be highly appreciated. You definitely cannot take a filesystem level backup from one version and throw it into a different version (perhaps minor versions, but not 7.x vs. 8.x). This is basically what pg_upgrade was created to solve (however with a different purpose). But pg_upgrade definitely doesn't support 7.x, and I'm not even sure that it supports 8.x. In fact, I don't even know that 7.x is a supported version of postgresql in any context any longer. As for the issue of moving a filesystem level backup between identical versions, I believe that should work (although I have no clue if there were bugs preventing this in a version as old as 7.x). Can you provide exact details commands of what you're trying to do, and the exact errors you're seeing? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] insert binary data into a table column with psql
On Tue, Jul 24, 2012 at 2:22 PM, John R Pierce pie...@hogranch.com wrote: On 07/24/12 1:28 PM, jkells wrote: from psql I have tried several ways including creating a function to read a file without any success but basically I want to do something like the following from a bash shell psql connection information -c insert into x (ID, load_date, image) values ($PID,clock_timestamp()::timestamp(0), copy from '/tmp/$FN' with binary); Any help would be greatly appreciated use a proper programming language that can read files and insert BYTEA data. shell + psql just won't cut it. I'd suggest perl or python or java or Actually, that's not true. Its definitely possible to INSERT data into bytea using just psql. The trick is to sub- 'select' the data with the bytea_import function in the INSERT. So something like this: insert into x (ID, load_date, image) values ($PID,clock_timestamp()::timestamp(0), (SELECT bytea_import( '/tmp/$FN')); -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] insert binary data into a table column with psql
On Tue, Jul 24, 2012 at 7:16 PM, Tom Lane t...@sss.pgh.pa.us wrote: jtkells jtke...@verizon.net writes: Thanks much for your reply, that does the trick quite nicely. But, I just came to the realization that this only works if your are running the client and the file both resides on the database server. I thought that I would be able to do this from a remote server where the client was running, picking up a local file and sending it into a remote database table. Unless I am missing something, I cant. I can create a temp table on this server, upload the file to a similar table then replicate it to the targeted server.. Lots of work that could be easily done with a programming language (as someone else posted) but thats what I have to work with for now and I have no direct access (I.e. sftp ) to the database server Maybe use psql's \lo_import command to suck the data into a large object on the server, and then use loread() to insert it into the target table? (And don't forget to drop the large object after.) Pretty grotty but I think it might be the only solution with the currently available tools. btw, does bytea_import actually exist? It's not in the core server for sure. Doh. No, its not standard, I found it here: http://dba.stackexchange.com/questions/1742/how-to-insert-file-data-into-a-postgresql-bytea-column/2962#2962 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] big database resulting in small dump
On Fri, Jul 20, 2012 at 11:05 AM, Ilya Ivanov f...@ngs.ru wrote: I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds Zabbix database. The database on disk takes 10Gb. SQL dump takes only 2Gb. I've gone through http://archives.postgresql.org/pgsql-general/2008-08/msg00316.php and got some hints. Naturally, the biggest table is history (the second biggest is history_uint. Together they make about 95% of total size). I've tried to perform CLUSTER on it, but seemed to be taking forever (3 hours and still not completed). So I cancelled it and went with database drop and restore. It resulted in database taking up 6.4Gb instead of 10Gb. This is a good improvement, but still isn't quite what I expect. I would appreciate some clarification. Its not entirely clear what behavior you expect here. Assuming that you're referring to running pg_dump, then you should just about never expect the size of the resulting dump to be equal to the amount of disk space the database server files consume on disk. For example, when I pg_dump a database that consumes about 290GB of disk, the resulting dump is about 1.3GB. This is normal expected behavior. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] big database resulting in small dump
On Fri, Jul 20, 2012 at 11:23 AM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: On Fri, Jul 20, 2012 at 11:05 AM, Ilya Ivanov f...@ngs.ru wrote: I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds Zabbix database. The database on disk takes 10Gb. SQL dump takes only 2Gb. Its not entirely clear what behavior you expect here. Assuming that you're referring to running pg_dump, then you should just about never expect the size of the resulting dump to be equal to the amount of disk space the database server files consume on disk. For example, when I pg_dump a database that consumes about 290GB of disk, the resulting dump is about 1.3GB. This is normal expected behavior. The fine manual says someplace that databases are commonly about 5X the size of a plain-text dump, which is right in line with Ilya's results. Lonni's DB sounds a bit bloated :-(, though maybe he's got an atypically large set of indexes. I do have a lot of indices. Also, I'm using a lot of partitions, so there are a relatively large number of tables. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] efficiency of wildcards at both ends
On Wed, Jun 20, 2012 at 10:10 AM, Sam Z J sammyjiang...@gmail.com wrote: Hi all I'm curious how is wildcards at both ends implemented, e.g. LIKE '%str%' How efficient is it if that's the only search criteria against a large table? how much does indexing the column help and roughly how much more space is needed for the index? if the answers are too long, please point me to the relavant text =D My limited understanding is that any time you need to resort to using wildcards, indices are never used, and you're falling back to using the inefficient table scan. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_basebackup blocking all queries
I'm still plagued by this. Immediately before starting a basebackup the load on my server is 1.00 or less. Within a few minutes of starting the basebackup, the load climbs steadily to 30+ and anything trying to write to the database just sits for minutes at a time, with overall performance on any query (read or write) being horrible (seconds to minutes). As soon as the basebackup completes, perf returns to normal (and the load drops back down to 1.00 or less). How can I debug what's wrong? On Tue, May 22, 2012 at 3:20 PM, Lonni J Friedman netll...@gmail.com wrote: Thanks for your reply. Unfortunately, those queries don't shed any light no the problem. The first two return 0 rows, and the third just returns 12 rows all associated with the query itself, rather than anything else. Any other suggestions? On Tue, May 22, 2012 at 2:56 PM, Scott Marlowe scott.marl...@gmail.com wrote: Do the queries here help? http://wiki.postgresql.org/wiki/Lock_Monitoring On Tue, May 22, 2012 at 12:42 PM, Lonni J Friedman netll...@gmail.com wrote: Greetings, I have a 4 server postgresql-9.1.3 cluster (one master doing streaming replication to 3 hot standby servers). All of them are running Fedora-16-x86_64. Last Friday I upgraded the entire cluster from Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3. I'm finding that I cannot runpg_basebackup at all, or it blocks all SQL queries from running until pg_basebackup has completed (and the load on the box just takes off to over 75.00). By blocks I mean that any query that is submitted just hangs and does not return at all until pg_basebackup has stopped. I'm assuming that this isn't expected behavior, so I'm rather confused on what is going on. The command that I'm issuing is: pg_basebackup -v -D /mnt/backups/backups/tmp0 -x -Ft -U postgres Can someone provide some guidance on how to debug this? thanks! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] autovacuum running for a long time on a new table with 1 row
On Fri, Jun 1, 2012 at 10:34 AM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: Running 9.1.3 on Linux-x86_64. I'm seeing autovacuum running for the past 6 hours on a newly created table that only has 1 row of data in it. This table did exist previously, but was dropped recreated. I'm not sure if that might explain this behavior. When I strace the autovacuum process, I see the following scrolling by non-stop (with no changes to the file referenced): select(0, NULL, NULL, NULL, {0, 21000}) = 0 (Timeout) open(base/16412/214803_vm, O_RDWR) = -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR) = -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR) = -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR) = -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR) = -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR) = -1 ENOENT (No such file or directory) This seems to have been noticed and fixed in HEAD: http://git.postgresql.org/gitweb/?p=postgresql.gita=commitdiffh=b4e0741727685443657b55932da0c06f028fbc00 I wonder whether that should've been back-patched. Thanks for your reply. I won't even pretend to understand what that fix does. Is this behavior something that is blatantly broken, or harmless, or somewhere in between? Should I expect autovacuum to eventually complete succesfully when it stumbles into this scenario? In the meantime, though, it sure looks like you've got a lot more than one row in there. Perhaps you did umpteen zillion updates on that one row? Before dropping recreating the table, yes it had millions of rows, and millions of updates. But since then, all I did was insert a single row, and watched autovacuum wedge itself in that seemingly infinite loop. I ended up doing a 'kill -2' on the autovacuum PID that was misbehaving, disabled autovacuuming the table, and went about what I needed to get done as an interim solution. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] autovacuum running for a long time on a new table with 1 row
On Fri, Jun 1, 2012 at 10:54 AM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: On Fri, Jun 1, 2012 at 10:34 AM, Tom Lane t...@sss.pgh.pa.us wrote: This seems to have been noticed and fixed in HEAD: http://git.postgresql.org/gitweb/?p=postgresql.gita=commitdiffh=b4e0741727685443657b55932da0c06f028fbc00 I wonder whether that should've been back-patched. Thanks for your reply. I won't even pretend to understand what that fix does. Is this behavior something that is blatantly broken, or harmless, or somewhere in between? Should I expect autovacuum to eventually complete succesfully when it stumbles into this scenario? Well, the problem with the original code was that it would recheck the visibility map's file size anytime somebody tried to check a bit beyond the end of the map. If the map isn't there (which is not an error case) this would result in a useless open() attempt for each table page scanned by vacuum. So ordinarily I would say that yes you could expect autovac to complete eventually. However ... Before dropping recreating the table, yes it had millions of rows, and millions of updates. But since then, all I did was insert a single row, and watched autovacuum wedge itself in that seemingly infinite loop. I ended up doing a 'kill -2' on the autovacuum PID that was misbehaving, disabled autovacuuming the table, and went about what I needed to get done as an interim solution. ... if you really did drop and recreate the table, then at this point it should only have a single page, I would think. It might be worth checking the actual file size. pg_relation_size('tablename') is probably the quickest way. Unfortunately, I've since inserted several million rows into this table, so I'm guessing its too late now to check the size? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] autovacuum running for a long time on a new table with 1 row
Running 9.1.3 on Linux-x86_64. I'm seeing autovacuum running for the past 6 hours on a newly created table that only has 1 row of data in it. This table did exist previously, but was dropped recreated. I'm not sure if that might explain this behavior. When I strace the autovacuum process, I see the following scrolling by non-stop (with no changes to the file referenced): select(0, NULL, NULL, NULL, {0, 21000}) = 0 (Timeout) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) select(0, NULL, NULL, NULL, {0, 21000}) = 0 (Timeout) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) open(base/16412/214803_vm, O_RDWR)= -1 ENOENT (No such file or directory) Is this normal/expected? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 - 9.1
On Thu, May 24, 2012 at 12:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: On Thu, May 24, 2012 at 12:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: Can you correlate the performance hit with any specific part of autovacuum? In particular, I'm wondering if it matters whether vacuum is cleaning tables or indexes --- it alternates between the two, and the access patterns are a bit different. You could probably watch what the autovac process is doing with strace to see what it's accessing. Is there something specific I should be looking for in the strace output, or is this just a matter of correlating PID and FD to pg_class.relfilenode ? Nah, just match up the files it touches with pg_class.relfilenode. Seems to be slower across the board regardless of what is being vacuumed (tables or indices). For example, i have a relatively small table (currently 395 rows, rarely has any inserts or deletes). vacuuming just the table (not indices) takes on average 4s with the khugepaged defragmenter on, and less than 1s with it off. vacuuming just the indices takes on average 2s with the khugepaged defragmenter on, and less than 1s with it off. The much larger tables (thousands to millions of rows), I see similar performance (although they take even longer to complete with without the khugepaged defragmenter enabled). -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 - 9.1
On Wed, May 23, 2012 at 2:45 PM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: On 24/05/12 08:18, Lonni J Friedman wrote: On Wed, May 23, 2012 at 12:36 PM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: On 24/05/12 05:09, Lonni J Friedman wrote: On Wed, May 23, 2012 at 9:37 AM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: After banging my head on the wall for a long time, I happened to notice that khugepaged was consuming 100% CPU every time autovacuum was running. I did: echo madvise /sys/kernel/mm/transparent_hugepage/defrag and immediately the entire problem went away. Fascinating. In hindsight, sure. Before that, it was 2 days of horror. So this looks like a nasty Fedora16 kernel bug to me, or maybe postgresql Fedora16's default kernel settings are just not compatible? I agree, kernel bug. What kernel version are you using exactly? I'm using the stock 3.3.5-2.fc16.x86_64 kernel that is in Fedora updates. Is anyone else using Fedora16 PostgreSQL-9.1 ? I use an F16 box daily, but can't claim to have done major performance testing with it. Can you put together a summary of your nondefault Postgres settings? I wonder whether it only kicks in for a certain size of shared memory for instance. Oh yea, I'm quite certain that this is somehow related to my setup, and not a generic problem with all F16/pgsql systems. For starters, this problem isn't happening on any of the 3 standby systems, which are all otherwise identical to the master in every respect. Also when we had done some testing (prior to the upgrades), we never ran into any of these problems. However our test environment was on smaller scale hardware, with a much smaller number of clients (and overall load). Here are the non default settings in postgresql.conf : wal_level = hot_standby archive_mode = on archive_timeout = 61 max_wal_senders = 10 wal_keep_segments = 5000 hot_standby = on log_autovacuum_min_duration = 2500 autovacuum_max_workers = 4 maintenance_work_mem = 1GB checkpoint_completion_target = 0.7 effective_cache_size = 88GB work_mem = 576MB wal_buffers = 16MB checkpoint_segments = 64 shared_buffers = 8GB max_connections = 350 Let me know if you have any other questions. I'd be happy to provide as much information as possible if it can aid in fixing this bug. I think they will need details of things like: RAM, number/type processors, number type of disks, disk controllers any other hardware specs that might be relevant etc.- at very least: total RAM number of spindles 16 core Xeon X5550 2.67GHz 128GB RAM $PGDATA sits on a RAID5 array comprised of 3 SATA disks. Its Linux's md software RAID. How does this compare to your other machines running the same, or similar, databases? However, you do say that the other machines are indentical - but are the other machines different in any aspect, that might prove siginificant? Also anything else running on the box. nothing else. its dedicated exclusively to postgresql. Plus transaction load pattern - over time and read/write ratios. I'm not sure how I'd obtain this data. however, the patterns didn't change since the upgrade. If someone can point me in the right direction, I can at least obtain this data as its generated currently. type/nature of queries I need some clarification on specifically what you're asking for here. The complexity, structure, and features of the queries. Do you have lots of sub queries, and ORDER BY's? Also the number of tables accessed in a query. This is heading into the territory where others will be better placed to advise you as to what might be relevant! No, not lots of subqueries or ORDERing, and most queries only touch a single table. However, I'm honestly not sure that I'm following where you're going with this. The problem isn't triggered by explicit queries. I can disable all external access, and simply wait for autovacuum to kick off, and the box starts to die. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 - 9.1
On Thu, May 24, 2012 at 12:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: No, not lots of subqueries or ORDERing, and most queries only touch a single table. However, I'm honestly not sure that I'm following where you're going with this. The problem isn't triggered by explicit queries. I can disable all external access, and simply wait for autovacuum to kick off, and the box starts to die. Can you correlate the performance hit with any specific part of autovacuum? In particular, I'm wondering if it matters whether vacuum is cleaning tables or indexes --- it alternates between the two, and the access patterns are a bit different. You could probably watch what the autovac process is doing with strace to see what it's accessing. Is there something specific I should be looking for in the strace output, or is this just a matter of correlating PID and FD to pg_class.relfilenode ? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 - 9.1
Thanks for your reply. On Tue, May 22, 2012 at 7:19 PM, Andy Colson a...@squeakycode.net wrote: On Mon, May 21, 2012 at 2:05 PM, Lonni J Friedmannetll...@gmail.com wrote: Greetings, When I got in this morning, I found an autovacuum process that had been running since just before the load spiked, Autovacuum might need to set the freeze bit very first time it runs. I recall hearing advice about running a 'vacuum freeze' after you insert a huge amount of data. And I recall pg_upgrade doesn't write stats, so did you analyze your database? yes, I ran a 'vacuum analyze' for all databases tables immediately following completion of pg_upgrade. Or, maybe its not vacuum... maybe some of your sql statements are planning differently and running really bad. Can you check some? Can you log slow queries? Have you checked the status of your raid? Maybe you lost a drive and its in recovery and you have very slow IO? I checked that initially, but the array is fine. After banging my head on the wall for a long time, I happened to notice that khugepaged was consuming 100% CPU every time autovacuum was running. I did: echo madvise /sys/kernel/mm/transparent_hugepage/defrag and immediately the entire problem went away. Load dropped within minutes from 35.00 to 1.00, and has remained under 4.00 for the past 18 hours. Prior to disabling defrag, I never saw the load below 10.00 for more than a few seconds at a time. So this looks like a nasty Fedora16 kernel bug to me, or maybe postgresql Fedora16's default kernel settings are just not compatible? Is anyone else using Fedora16 PostgreSQL-9.1 ? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 - 9.1
On Wed, May 23, 2012 at 9:37 AM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: After banging my head on the wall for a long time, I happened to notice that khugepaged was consuming 100% CPU every time autovacuum was running. I did: echo madvise /sys/kernel/mm/transparent_hugepage/defrag and immediately the entire problem went away. Fascinating. In hindsight, sure. Before that, it was 2 days of horror. So this looks like a nasty Fedora16 kernel bug to me, or maybe postgresql Fedora16's default kernel settings are just not compatible? I agree, kernel bug. What kernel version are you using exactly? I'm using the stock 3.3.5-2.fc16.x86_64 kernel that is in Fedora updates. Is anyone else using Fedora16 PostgreSQL-9.1 ? I use an F16 box daily, but can't claim to have done major performance testing with it. Can you put together a summary of your nondefault Postgres settings? I wonder whether it only kicks in for a certain size of shared memory for instance. Oh yea, I'm quite certain that this is somehow related to my setup, and not a generic problem with all F16/pgsql systems. For starters, this problem isn't happening on any of the 3 standby systems, which are all otherwise identical to the master in every respect. Also when we had done some testing (prior to the upgrades), we never ran into any of these problems. However our test environment was on smaller scale hardware, with a much smaller number of clients (and overall load). Here are the non default settings in postgresql.conf : wal_level = hot_standby archive_mode = on archive_timeout = 61 max_wal_senders = 10 wal_keep_segments = 5000 hot_standby = on log_autovacuum_min_duration = 2500 autovacuum_max_workers = 4 maintenance_work_mem = 1GB checkpoint_completion_target = 0.7 effective_cache_size = 88GB work_mem = 576MB wal_buffers = 16MB checkpoint_segments = 64 shared_buffers = 8GB max_connections = 350 Let me know if you have any other questions. I'd be happy to provide as much information as possible if it can aid in fixing this bug. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 - 9.1
On Wed, May 23, 2012 at 12:36 PM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: On 24/05/12 05:09, Lonni J Friedman wrote: On Wed, May 23, 2012 at 9:37 AM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: After banging my head on the wall for a long time, I happened to notice that khugepaged was consuming 100% CPU every time autovacuum was running. I did: echo madvise /sys/kernel/mm/transparent_hugepage/defrag and immediately the entire problem went away. Fascinating. In hindsight, sure. Before that, it was 2 days of horror. So this looks like a nasty Fedora16 kernel bug to me, or maybe postgresql Fedora16's default kernel settings are just not compatible? I agree, kernel bug. What kernel version are you using exactly? I'm using the stock 3.3.5-2.fc16.x86_64 kernel that is in Fedora updates. Is anyone else using Fedora16 PostgreSQL-9.1 ? I use an F16 box daily, but can't claim to have done major performance testing with it. Can you put together a summary of your nondefault Postgres settings? I wonder whether it only kicks in for a certain size of shared memory for instance. Oh yea, I'm quite certain that this is somehow related to my setup, and not a generic problem with all F16/pgsql systems. For starters, this problem isn't happening on any of the 3 standby systems, which are all otherwise identical to the master in every respect. Also when we had done some testing (prior to the upgrades), we never ran into any of these problems. However our test environment was on smaller scale hardware, with a much smaller number of clients (and overall load). Here are the non default settings in postgresql.conf : wal_level = hot_standby archive_mode = on archive_timeout = 61 max_wal_senders = 10 wal_keep_segments = 5000 hot_standby = on log_autovacuum_min_duration = 2500 autovacuum_max_workers = 4 maintenance_work_mem = 1GB checkpoint_completion_target = 0.7 effective_cache_size = 88GB work_mem = 576MB wal_buffers = 16MB checkpoint_segments = 64 shared_buffers = 8GB max_connections = 350 Let me know if you have any other questions. I'd be happy to provide as much information as possible if it can aid in fixing this bug. I think they will need details of things like: RAM, number/type processors, number type of disks, disk controllers any other hardware specs that might be relevant etc.- at very least: total RAM number of spindles 16 core Xeon X5550 2.67GHz 128GB RAM $PGDATA sits on a RAID5 array comprised of 3 SATA disks. Its Linux's md software RAID. Also anything else running on the box. nothing else. its dedicated exclusively to postgresql. Plus transaction load pattern - over time and read/write ratios. I'm not sure how I'd obtain this data. however, the patterns didn't change since the upgrade. If someone can point me in the right direction, I can at least obtain this data as its generated currently. type/nature of queries I need some clarification on specifically what you're asking for here. size of heavily accessed tables and their indexes there are several rather large tables (90 million+ rows), but most others are under 1M rows. However, most tables are accessed written to with equal frequency. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 - 9.1
On Wed, May 23, 2012 at 3:33 PM, Tom Lane t...@sss.pgh.pa.us wrote: Gavin Flower gavinflo...@archidevsys.co.nz writes: 16 core Xeon X5550 2.67GHz 128GB RAM $PGDATA sits on a RAID5 array comprised of 3 SATA disks. Its Linux's md software RAID. How does this compare to your other machines running the same, or similar, databases? However, you do say that the other machines are indentical - but are the other machines different in any aspect, that might prove siginificant? I think Lonnie said that the other machines are just running standby clusters, which would mean they aren't running autovacuum as such, merely applying any WAL it produces. So that could be plenty enough to explain a difference in kernel-visible behavior. Yes, that is correct. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 - 9.1
No one has any ideas or suggestions, or even questions? If someone needs more information, I'd be happy to provide it. This problem is absolutely killing me. On Mon, May 21, 2012 at 2:05 PM, Lonni J Friedman netll...@gmail.com wrote: Greetings, I have a 4 server postgresql-9.1.3 cluster (one master doing streaming replication to 3 hot standby servers). All of them are running Fedora-16-x86_64. Last Friday I upgraded the entire cluster from Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3. I made no changes to postgresql.conf following the upgrade. I used pg_upgrade on the master to upgrade it, followed by blowing away $PGDATA on all the standbys and rsyncing them fresh from the master. All of the servers have 128GB RAM, and at least 16 CPU cores. Everything appeared to be working fine until last night when the load on the master suddenly took off, and hovered at around 30.00 ever since. Prior to the load spike, the load was hovering around 2.00 (which is actually lower than it was averaging prior to the upgrade when it was often around 4.00). When I got in this morning, I found an autovacuum process that had been running since just before the load spiked, and the pg_dump cronjob that started shortly after the load spike (and normally completes in about 20 minutes for all the databases) was still running, and hadn't finished the first of the 6 databases. I ended up killing the pg_dump process altogether in the hope that it might unblock whatever was causing the high load. Unfortunately that didn't help, and the load continued to run high. I proceeded to check dmesg, /var/log/messages and the postgresql server log (all on the master), but I didn't spot anything out of the ordinary, definitely nothing that pointed to a potential explanation for all of the high load. I inspected what the autovacuum process was doing, and determined that it was chewing away on the largest table (nearly 98 million rows) in the largest database. It was making very slow progress, at least I believe that was the case, as when I attached strace to the process, the seek addresses were changing in a random fashion. Here are the current autovacuum settings: autovacuum | on autovacuum_analyze_scale_factor | 0.1 autovacuum_analyze_threshold | 50 autovacuum_freeze_max_age | 2 autovacuum_max_workers | 4 autovacuum_naptime | 1min autovacuum_vacuum_cost_delay | 20ms autovacuum_vacuum_cost_limit | -1 autovacuum_vacuum_scale_factor | 0.2 autovacuum_vacuum_threshold | 50 Did something significant change in 9.1 that would impact autovacuum behavior? I'm at a complete loss on how to debug this, since I'm using the exact same settings now as prior to the upgrade. thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] pg_basebackup blocking all queries
Greetings, I have a 4 server postgresql-9.1.3 cluster (one master doing streaming replication to 3 hot standby servers). All of them are running Fedora-16-x86_64. Last Friday I upgraded the entire cluster from Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3. I'm finding that I cannot runpg_basebackup at all, or it blocks all SQL queries from running until pg_basebackup has completed (and the load on the box just takes off to over 75.00). By blocks I mean that any query that is submitted just hangs and does not return at all until pg_basebackup has stopped. I'm assuming that this isn't expected behavior, so I'm rather confused on what is going on. The command that I'm issuing is: pg_basebackup -v -D /mnt/backups/backups/tmp0 -x -Ft -U postgres Can someone provide some guidance on how to debug this? thanks! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_basebackup blocking all queries
Thanks for your reply. Unfortunately, those queries don't shed any light no the problem. The first two return 0 rows, and the third just returns 12 rows all associated with the query itself, rather than anything else. Any other suggestions? On Tue, May 22, 2012 at 2:56 PM, Scott Marlowe scott.marl...@gmail.com wrote: Do the queries here help? http://wiki.postgresql.org/wiki/Lock_Monitoring On Tue, May 22, 2012 at 12:42 PM, Lonni J Friedman netll...@gmail.com wrote: Greetings, I have a 4 server postgresql-9.1.3 cluster (one master doing streaming replication to 3 hot standby servers). All of them are running Fedora-16-x86_64. Last Friday I upgraded the entire cluster from Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3. I'm finding that I cannot runpg_basebackup at all, or it blocks all SQL queries from running until pg_basebackup has completed (and the load on the box just takes off to over 75.00). By blocks I mean that any query that is submitted just hangs and does not return at all until pg_basebackup has stopped. I'm assuming that this isn't expected behavior, so I'm rather confused on what is going on. The command that I'm issuing is: pg_basebackup -v -D /mnt/backups/backups/tmp0 -x -Ft -U postgres Can someone provide some guidance on how to debug this? thanks! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] significant performance hit whenever autovacuum runs after upgrading from 9.0 - 9.1
Greetings, I have a 4 server postgresql-9.1.3 cluster (one master doing streaming replication to 3 hot standby servers). All of them are running Fedora-16-x86_64. Last Friday I upgraded the entire cluster from Fedora-15 with postgresql-9.0.6 to Fedora-16 with postgresql-9.1.3. I made no changes to postgresql.conf following the upgrade. I used pg_upgrade on the master to upgrade it, followed by blowing away $PGDATA on all the standbys and rsyncing them fresh from the master. All of the servers have 128GB RAM, and at least 16 CPU cores. Everything appeared to be working fine until last night when the load on the master suddenly took off, and hovered at around 30.00 ever since. Prior to the load spike, the load was hovering around 2.00 (which is actually lower than it was averaging prior to the upgrade when it was often around 4.00). When I got in this morning, I found an autovacuum process that had been running since just before the load spiked, and the pg_dump cronjob that started shortly after the load spike (and normally completes in about 20 minutes for all the databases) was still running, and hadn't finished the first of the 6 databases. I ended up killing the pg_dump process altogether in the hope that it might unblock whatever was causing the high load. Unfortunately that didn't help, and the load continued to run high. I proceeded to check dmesg, /var/log/messages and the postgresql server log (all on the master), but I didn't spot anything out of the ordinary, definitely nothing that pointed to a potential explanation for all of the high load. I inspected what the autovacuum process was doing, and determined that it was chewing away on the largest table (nearly 98 million rows) in the largest database. It was making very slow progress, at least I believe that was the case, as when I attached strace to the process, the seek addresses were changing in a random fashion. Here are the current autovacuum settings: autovacuum | on autovacuum_analyze_scale_factor | 0.1 autovacuum_analyze_threshold| 50 autovacuum_freeze_max_age | 2 autovacuum_max_workers | 4 autovacuum_naptime | 1min autovacuum_vacuum_cost_delay| 20ms autovacuum_vacuum_cost_limit| -1 autovacuum_vacuum_scale_factor | 0.2 autovacuum_vacuum_threshold | 50 Did something significant change in 9.1 that would impact autovacuum behavior? I'm at a complete loss on how to debug this, since I'm using the exact same settings now as prior to the upgrade. thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] problems after restoring from a pg_basebackup
Greetings, I'm running postgresql-9.1.3 on a Linux-x86_64 (Fedora16, if it matters) system. I noticed the existence of pg_basebackup starting in 9.1, and figured I'd try it out and see if it would simplify our backup management processes. $ pg_basebackup -P -v -D /tmp/backup -x -Ft -z -U postgres xlog start point: C6/6420 135733616/135733616 kB (100%), 1/1 tablespace xlog end point: C6/64A0 pg_basebackup: base backup completed So after running through this, I tried to use (restore) the backup that was generated. While everything appears to be working ok from a functional perspective, in the server log I saw the following: ## LOG: creating missing WAL directory pg_xlog/archive_status LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at C6/6678 LOG: could not open file pg_xlog/000100C60067 (log file 198, segment 103): No such file or directory LOG: redo done at C6/66A0 FATAL: the database system is starting up LOG: autovacuum launcher started LOG: database system is ready to accept connections # Just to be clear, here's what I did after pg_basebackup had completed successfully: 0) shutdown postgresql gracefully, and verified that it was fully shutdown 1) moved $PGDATA to $PGDATA.old 2) created $PGDATA as postgres user 3) extracted the basebackup tarball as postgres user cd $PGDATA tar xzvpf /tmp/backup/base.tar.gz 4) started postgresql up I would have expected that I wouldn't have gotten the 'not properly shutdown' warning, or the 'could not open file' warning by following this process. Am I doing something wrong? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_basebackup issues
On Fri, Apr 20, 2012 at 12:31 PM, Magnus Hagander mag...@hagander.net wrote: On Fri, Apr 20, 2012 at 19:51, Lonni J Friedman netll...@gmail.com wrote: Anyway, lesson learned, I need to either invoke pg_basebackup as the same user that runs the database (or is specified with the -U parameter ?), or write the backup somewhere outside of the directory structure that is being backed up. I eventually also found the following entries in the postgresql server log: FATAL: could not open directory ./backups: Permission denied FATAL: archive member backups/base.tar.gz too large for tar format What concerns me is the 2nd fatal error. The tarball that pg_basebackup created before erroring out is about 12GB: 12393094165 base.tar.gz Are you actually storing your backup files *inside* the data directory? You really shouldn't do that, you're creating a cyclic dependency where each new backup will include the old one inside it... You should store the resulting backup file somewhere outside the data directory. yea, in hindsight that was silly. i just saw a convenient 'backups' subdirectory and figured that'd be a nice logical place. I wasn't aware of any 12GB file size limit for tar, so this is a bit of a mystery to me. Regardless, I'd be happy to try some other archiving strategy, but the man page for pg_basebackup suggests that there are only two formats, tar and basically just copying the filesystem. If I copied the filesystem, I'd still have to find some way to archive them for easy management (copying elsewhere, etc). Has anyone come up with a good strategy on how to deal with it? The max file size of a single flie inside a standard tar file is 8Gb, see e.g. http://en.wikipedia.org/wiki/Tar_(file_format). I think there are extensions that let you store bigger files, but since PostgreSQL will never create files that big it's not implemented in the basebackup system. Because again, the root of your problem seems to be that you are trying to store the resulting backup inside the data directory. You're right, that was indeed the issue. I've redone the process using a location external to $PGDATA, and it completed successfully: $ pg_basebackup -P -v -D /tmp/backup -x -Ft -z -U postgres xlog start point: C6/6420 135733616/135733616 kB (100%), 1/1 tablespace xlog end point: C6/64A0 pg_basebackup: base backup completed So after running through this, I tried to use (restore) the backup that was generated. While everything appears to be working ok from a functional perspective, in the server log I saw the following: ## LOG: creating missing WAL directory pg_xlog/archive_status LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at C6/6678 LOG: could not open file pg_xlog/000100C60067 (log file 198, segment 103): No such file or directory LOG: redo done at C6/66A0 FATAL: the database system is starting up LOG: autovacuum launcher started LOG: database system is ready to accept connections # Just to be clear, here's what I did after pg_basebackup had completed successfully: 0) shutdown postgresql gracefully, and verified that it was fully shutdown 1) moved $PGDATA to $PGDATA.old 2) created $PGDATA as postgres user 3) extracted the basebackup tarball as postgres user cd $PGDATA tar xzvpf /tmp/backup/base.tar.gz 4) started postgresql up I would have expected that I wouldn't have gotten the 'not properly shutdown' warning, or the 'could not open file' warning by following this process. Am I doing something wrong? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] pg_basebackup issues
Greetings, I'm running postgresql-9.1.3 on a Linux-x86_64 (Fedora16, if it matters) system. I noticed the existence of pg_basebackup starting in 9.1, and figured I'd try it out and see if it would simplify our backup management processes. I setup a test system (same OS postgresql version as production) with a fairly recent snapshot of our production database, invoked it, and saw the following output: ## # pg_basebackup -P -v -D backups -Ft -z -U postgres 135717206/135717230 kB (100%), 1/1 tablespace pg_basebackup: could not get WAL end position from server ## I wasn't sure what that error meant, so after googling a bit, turns out that it really means that there were one or more files not owned by the postgres user (see http://serverfault.com/questions/312205/pg-basebackup-could-not-get-wal-end-position-from-server ). Sure enough, the file that wasn't owned by the postgres user was the backup tarball that pg_basebackup was creating, since I had been running it as root. That error is rather cryptic, and it would be helpful if it was improved to suggest the real cause of the failure. Anyway, lesson learned, I need to either invoke pg_basebackup as the same user that runs the database (or is specified with the -U parameter ?), or write the backup somewhere outside of the directory structure that is being backed up. I eventually also found the following entries in the postgresql server log: FATAL: could not open directory ./backups: Permission denied FATAL: archive member backups/base.tar.gz too large for tar format What concerns me is the 2nd fatal error. The tarball that pg_basebackup created before erroring out is about 12GB: 12393094165 base.tar.gz I wasn't aware of any 12GB file size limit for tar, so this is a bit of a mystery to me. Regardless, I'd be happy to try some other archiving strategy, but the man page for pg_basebackup suggests that there are only two formats, tar and basically just copying the filesystem. If I copied the filesystem, I'd still have to find some way to archive them for easy management (copying elsewhere, etc). Has anyone come up with a good strategy on how to deal with it? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_upgrade + streaming replication ?
On Tue, Mar 20, 2012 at 11:46 AM, Bruce Momjian br...@momjian.us wrote: On Mon, Mar 19, 2012 at 03:07:02PM -0700, Jeff Davis wrote: On Mon, 2012-03-19 at 15:30 -0400, Bruce Momjian wrote: On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote: I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with streaming replication. I'm in the planning stages of upgrading to 9.1.x, and am looking into the most efficient way to do the upgrade with the goal of minimizing downtime risk. After googling, the only discussion that I've found of using pg_upgrade with a streaming replication setup seems to be this (nearly) year old thread: http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK In summary, there is no way to use both pg_upgrade and streaming replication simultaneously. I'd have to either use pg_upgrade and then effectively rebuild/redeploy the slaves, or not use pg_upgrade, and reimport all of the data. Is that still the latest status, or are there other options? You can shut down all three servers, run pg_upgrade on all of them, then restart them as 9.1 servers. After running pg_upgrade on each server individually, they will have different system IDs, and potentially different on-disk representation of the catalogs, right? So how can you resume streaming without rebuilding the slaves? Oh, wow, I never thought of the fact that the system tables will be different? I guess you could assume the pg_dump restore is going to create things exactly the same on all the systems, but I never tested that. Do the system id's have to match? That would be a problem because you are initdb'ing on each server. OK, crazy idea, but I wonder if you could initdb on the master, then copy that to the slaves, then run pg_upgrade on each of them. Obviously this needs some testing. Wouldn't it be easier to just pg_upgrade the master, then setup the slaves from scratch (with rsync, etc)? It certainly wouldn't be any more work to do it that way (although still a lot more work than simply running pg_upgrade on all servers). -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pg_upgrade + streaming replication ?
On Mon, Mar 19, 2012 at 12:30 PM, Bruce Momjian br...@momjian.us wrote: On Thu, Mar 01, 2012 at 02:01:31PM -0800, Lonni J Friedman wrote: I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with streaming replication. I'm in the planning stages of upgrading to 9.1.x, and am looking into the most efficient way to do the upgrade with the goal of minimizing downtime risk. After googling, the only discussion that I've found of using pg_upgrade with a streaming replication setup seems to be this (nearly) year old thread: http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK In summary, there is no way to use both pg_upgrade and streaming replication simultaneously. I'd have to either use pg_upgrade and then effectively rebuild/redeploy the slaves, or not use pg_upgrade, and reimport all of the data. Is that still the latest status, or are there other options? You can shut down all three servers, run pg_upgrade on all of them, then restart them as 9.1 servers. Thanks for your reply. This is very good news. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] how to measure wal_buffer usage
On Fri, Mar 16, 2012 at 2:45 AM, Albe Laurenz laurenz.a...@wien.gv.at wrote: Lonni J Friedman wrote: After reading this interesting article on shared_buffers and wal_buffers: http://rhaas.blogspot.com/2012/03/tuning-sharedbuffers-and-walbuffers.html it got me wondering if my settings were ideal. Is there some way to measure wal_buffer usage in real time, so that I could simply monitor it for some period of time, and then come up with a way of determining if the current setting is sufficient? I tried googling, but every reference that I've found simply defaults to the trial error approach to performance tuning. You can use the contrib module pg_buffercache to inspect the shared buffers. If almost all your shared buffers have high use count (4 or 5), shared_buffers may be too small. If not, consider reducing shared_buffers. pg_buffercache only reports on the buffer_cache, it does not report any data on the wal_cache. It's probably better to start with a moderate value and tune upwards. You can also look at pg_statio_all_tables and pg_statio_all_indexes and calculate the buffer hit ratio. If that is low, that's also an indication that shared_buffers is too small. Isn't this also specific to the buffer_cache rather than the wal_cache? You should distinguish between tables and indexes: it is usually more important that indexes are cached. Try to observe these things over time, for example by taking snapshots every n minutes and storing the results in a table. Yours, Laurenz Albe -- ~ L. Friedman netll...@gmail.com LlamaLand https://netllama.linux-sxs.org -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] how to measure wal_buffer usage
After reading this interesting article on shared_buffers and wal_buffers: http://rhaas.blogspot.com/2012/03/tuning-sharedbuffers-and-walbuffers.html it got me wondering if my settings were ideal. Is there some way to measure wal_buffer usage in real time, so that I could simply monitor it for some period of time, and then come up with a way of determining if the current setting is sufficient? I tried googling, but every reference that I've found simply defaults to the trial error approach to performance tuning. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] pg_upgrade + streaming replication ?
I've got a 3 node cluster (1 master/2 slaves) running 9.0.x with streaming replication. I'm in the planning stages of upgrading to 9.1.x, and am looking into the most efficient way to do the upgrade with the goal of minimizing downtime risk. After googling, the only discussion that I've found of using pg_upgrade with a streaming replication setup seems to be this (nearly) year old thread: http://web.archiveorange.com/archive/v/9FNVlDWGQtpyWVL54jlK In summary, there is no way to use both pg_upgrade and streaming replication simultaneously. I'd have to either use pg_upgrade and then effectively rebuild/redeploy the slaves, or not use pg_upgrade, and reimport all of the data. Is that still the latest status, or are there other options? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] returning rows from an implicit JOIN where results either exist in both tables OR only one table
Greetings, I've got a PostgreSQL-9.0.x database that manages an automated testing environment. There are a bunch of tables that contain assorted static data (OS versions, test names, etc) named 'buildlist' 'osversmap'. However, there are also two tables which contain data which changes often. The first is a 'pending' table which is effectively a test queue where pending tests are self-selected by the test systems, and then deleted when the test run has completed. The second is a 'results' table which contains the test results as they are produced (in progress and completed). The records in the pending table have a one to many relationship with the records in the results table (each row in pending can have 0 or more rows in results). For example, if no test systems have self-assigned a pending row, then there will be zero associated rows in results, and then once a pending row is assigned, the number of rows in results will increase for each pending row. An added catch is that I always want only the newest results table row associated with each pending table row. What I need to do is query the 'pending' table for pending tests, and then also get a 'logurl' from the results table that corresponds to each pending table row. All of this is rather similar to this problem, except that I have the added burden of the two additional tables with the static data (buildlist osversmap): http://stackoverflow.com/questions/3343857/php-sql-using-only-one-query-select-rows-from-two-tables-if-data-is-in-both-ta I'm stumbling over how to integrate those two tables with static data into the query. The following query works fine as long as there's at least one row in the 'results' table that corresponds to each row in the pending table (however, it doesn't return anything for rows that only exist in 'pending' yet not yet in 'results'): SELECT pending.cl, pending.id, pending.buildid, pending.build_type, pending.active, pending.submittracker, pending.os,pending.arch, pending.osversion, pending.branch, pending.comment, osversmap.osname, buildlist.buildname, results.logurl FROM pending ,osversmap ,buildlist ,results WHERE pending.buildid=buildlist.id AND pending.os=osversmap.os AND pending.osversion=osversmap.osversion AND pending.owner='$owner' AND pending.completed='f' AND results.hostname=pending.active AND results.submittracker=pending.submittracker AND pending.cl=results.cl AND results.current_status!='PASSED' AND results.current_status NOT LIKE '%FAILED' ORDER BY pending.submittracker,pending.branch,pending.os,pending.arch thanks in advance! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] returning rows from an implicit JOIN where results either exist in both tables OR only one table
On Thu, Dec 1, 2011 at 1:57 PM, David Johnston pol...@yahoo.com wrote: -Original Message- From: pgsql-general-ow...@postgresql.org [mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Lonni J Friedman Sent: Thursday, December 01, 2011 4:13 PM To: pgsql-general Subject: [GENERAL] returning rows from an implicit JOIN where results either exist in both tables OR only one table I'm stumbling over how to integrate those two tables with static data into the query. The following query works fine as long as there's at least one row in the 'results' table that corresponds to each row in the pending table (however, it doesn't return anything for rows that only exist in 'pending' yet not yet in 'results'): - Implicit JOINs are ALWAYS INNER JOINs Since you want to use an OUTER JOIN you must be explicit. I'm not going to try and figure out specifically what you need but from your quick description (all pending and results where available) you need to do something like pending LEFT OUTER JOIN results ON (pending.active = results.hostname AND pending.submittracker = results.submittracker AND pending.cl = results.cl) Then, for conditions dependent upon the results (or NULL-able) relation, you need to make sure you explicitly allow for the missing rows: ( results.current_status IS NULL OR ( your existing results conditions ) ) http://www.postgresql.org/docs/9.0/interactive/queries-table-expressions.html#QUERIES-FROM Thanks for your reply and input. I ended up putting together the following query which does what I need: SELECT pending.cl, pending.id, pending.buildid, pending.build_type, pending.active, pending.submittracker, pending.os, pending.arch, pending.osversion, pending.branch, pending.comment, osversmap.osname, buildlist.buildname, results.logurl FROM pending JOIN osversmap ON ( pending.os = osversmap.os AND pending.osversion = osversmap.osversion ) JOIN buildlist ON ( pending.buildid = buildlist.id ) LEFT OUTER JOIN results ON ( pending.active = results.hostname AND pending.submittracker = results.submittracker AND pending.cl = results.cl AND results.current_status != 'PASSED' AND results.current_status NOT LIKE '%FAILED' ) WHERE pending.owner = '$owner' AND pending.completed = 'f' ORDER BY pending.submittracker, pending.branch, pending.os, pending.arch -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] autovacuum stuck on a table for 18+ hours, consuming lots of CPU time
Greetings, I'm running PostgreSQL-9.0.4 on a Linux-x86_64 cluster with 1 master, and two streaming replication slaves. Since late yesterday, the load on the server has been noticably higher (5.00+) than normal (generally under 1.00). I investigated, and found that for the past ~18 hours, there's one autovacuum process that has been running, and not making any obvious progress: select procpid,query_start,current_query from pg_stat_activity where current_query LIKE 'autovacuum%' ; procpid | query_start | current_query -+---+- 30188 | 2011-11-21 22:42:26.426315-08 | autovacuum: VACUUM public.nppsmoketests (1 row) select c.oid,c.relname,l.pid,l.mode,l.granted from pg_class c join pg_locks l on c.oid=l.relation where l.pid='30188' order by l.pid; oid | relname | pid | mode | granted ---++---+--+- 72112 | nppsmoketests | 30188 | ShareUpdateExclusiveLock | t 72617 | nppsmoketests_pkey | 30188 | RowExclusiveLock | t 72619 | nppsmoketests_bug_idx | 30188 | RowExclusiveLock | t 72620 | nppsmoketests_active_idx | 30188 | RowExclusiveLock | t 72621 | nppsmoketests_arch_idx | 30188 | RowExclusiveLock | t 72622 | nppsmoketests_branch_idx | 30188 | RowExclusiveLock | t 72623 | nppsmoketests_current_status_idx | 30188 | RowExclusiveLock | t 72624 | nppsmoketests_build_type_idx | 30188 | RowExclusiveLock | t 72625 | nppsmoketests_gpu_idx | 30188 | RowExclusiveLock | t 72626 | nppsmoketests_os_idx | 30188 | RowExclusiveLock | t 72627 | nppsmoketests_owner_idx| 30188 | RowExclusiveLock | t 72628 | nppsmoketests_regressioncl_idx | 30188 | RowExclusiveLock | t 72629 | nppsmoketests_subtest_idx | 30188 | RowExclusiveLock | t 72630 | nppsmoketests_suiteid_idx | 30188 | RowExclusiveLock | t 72631 | nppsmoketests_suiteid_testname_idx | 30188 | RowExclusiveLock | t 72632 | nppsmoketests_testcl_idx | 30188 | RowExclusiveLock | t 72633 | nppsmoketests_testname_idx | 30188 | RowExclusiveLock | t 80749 | nppsmoketests_osversion_idx| 30188 | RowExclusiveLock | t (18 rows) When I strace PID 30188, I see tons of this scrolling past quickly, but I'm not really sure what it means beyond a 'Timeout' not looking good: select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) lseek(95, 753901568, SEEK_SET) = 753901568 read(95, \202\1\0\0\260\315\250\245\1\0\0\0\220\0\360\20\360\37\4 \0\0\0\0p\237\0\1\360\236\0\1..., 8192) = 8192 lseek(95, 753917952, SEEK_SET) = 753917952 read(95, \202\1\0\0 N\253\245\1\0\0\0\220\0\360\20\360\37\4 \0\0\0\0p\237\0\1\360\236\0\1..., 8192) = 8192 select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) lseek(95, 768606208, SEEK_SET) = 768606208 read(95, \204\1\0\0h!~\233\1\0\0\0\230\0\360\20\360\37\4 \0\0\0\0x\237\360\0\0\237\360\0..., 8192) = 8192 lseek(95, 753934336, SEEK_SET) = 753934336 read(95, \202\1\0\0 \275\245\1\0\0\0\220\0\360\20\360\37\4 \0\0\0\0p\237\0\1\360\236\0\1..., 8192) = 8192 select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) read(95, \202\1\0\0\10\33\276\245\1\0\0\0\220\0\360\20\360\37\4 \0\0\0\0p\237\0\1\360\236\0\1..., 8192) = 8192 lseek(95, 753958912, SEEK_SET) = 753958912 read(95, \202\1\0\0x\317\307\245\1\0\0\0\220\0\360\20\360\37\4 \0\0\0\0p\237\0\1\360\236\0\1..., 8192) = 8192 select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) lseek(95, 768614400, SEEK_SET) = 768614400 An old thread suggests that this is a stuck spinlock: http://archives.postgresql.org/pgsql-performance/2009-05/msg00455.php I'm using the defaults for all the *vacuum* options in postgresql.conf, except for: log_autovacuum_min_duration = 2500 At this point, I'm not sure what I can safely do to either terminate this autovacuum process, or kick it into making progress again? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] autovacuum stuck on a table for 18+ hours, consuming lots of CPU time
On Tue, Nov 22, 2011 at 6:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: Lonni J Friedman netll...@gmail.com writes: When I strace PID 30188, I see tons of this scrolling past quickly, but I'm not really sure what it means beyond a 'Timeout' not looking good: select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) lseek(95, 753901568, SEEK_SET) = 753901568 read(95, \202\1\0\0\260\315\250\245\1\0\0\0\220\0\360\20\360\37\4 \0\0\0\0p\237\0\1\360\236\0\1..., 8192) = 8192 lseek(95, 753917952, SEEK_SET) = 753917952 read(95, \202\1\0\0 N\253\245\1\0\0\0\220\0\360\20\360\37\4 \0\0\0\0p\237\0\1\360\236\0\1..., 8192) = 8192 select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) I'm betting the selects are implementing vacuum_cost_delay, and that the reason this is taking forever is that you have that cranked up to an unreasonably high value. There's no evidence of looping in what you showed us, because the seek addresses are changing. Thanks for your prompt reply. I was pretty sure that I was using the default, but just to confirm, I just ran: 'SHOW vacuum_cost_delay;' and it returned 0.Any other suggestions? Is there no way to safely kill off this autovacuum process that doesn't involve shutting down restarting the database? thanks -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general