Hi Tom,
The ERROR happened again. After several days of investigation and testing, I
can now reproduce the ERROR in my testing environment. The reason why the ERROR
used to happen in a certain time period is that our report batch jobs run in
that period and the batch job can make the TOAST area grow. I can repeat the
ERROR with this set up and testing procedure.
Set up
=====
(1) check that the configuration parameter log_autovacuum_min_duration = 0
(2) in a 9.1.7 database cluster, create a database called "TEST_LOCK".
(3) In the database "TEST_LOCK", create one table called LOCK with one row in
it:
create table lock (id bigint, constraint lock_pk primary key (id));
insert into lock values (1);
(4) in the same database cluster, create another database called "TEST".
(5) in the database "TEST", create one table test with the following columns:
create table test (
id bigint not null,
txop char(1),
messageblob bytea,
constraint test_pk primary key (id)
);
Pre-testing set up:
===========
(1) use psql to connect to TEST_LOCK and issue the following SQLs and leave it
open during the test:
begin;
select * from lock where id =1 for update;
(2) use another psql to connect to TEST_LOCK and issue the same SQLs as in (1).
This session will be blocked by (1).
(3) this is to simulate the situation that in our production environment, our
middle tier service use the TEST_LOCK.lock table for master/slave fail-over. We
run the middle-tier service in several servers. When the middle-tier service
starts up, it tries to put a lock on that ROW. The server successfully holds
this LOCK becomes the master.
Testing set-up:
=========
Part 1 - traffic that will consume the TOAST area of the table "test" in TEST
database
(1) use program to INSERT INTO TEST with length of messaegblob > 1MB and
txop='A'. This is to ensure the TOAST area of the test table is used. Then
update the table txop to 'D. Then delete the same row. These operations are
in their own transactions.
(2) this simulates the batch report traffic we see from 00:15 - 00:45 every
night.
Part 2 - traffic that will not consume the TOAST area of the table "test"
(1) use program to INSERT INTO TEST with length of messageblob of size about
3000 bytes and txop='A'. Then update the row.txop to 'D'. Then delete the same
row. These operations are in their own transactions.
(2) this simulates the background traffic we see during the whole day.
For part 1 and part 2, the traffic must be high enough so that from the
database log:
(1) autovacuum for the table TEST.public.test runs every minute
(2) autovacuum for the table TOAST area runs every minute
(3) from result of analyze verbose test, the # of pages scanned keeps on
increasing, even with autovacuum running on the table
(4) similarly, when querying the size of the TOAST, its size keeps on
increasing, even with autovacuum running on the TOAST every one 1 minute
(5) cluster the table test can not reduce the table size and TOAST area size as
well
(6) Any any time, the # of live rows is only a few but the # of dead rows is
high.
Testing procedures:
============
(1) use psql to connect to TEST and frequently analyze the table so as to
monitor the size of the table. Run SQL to check the size of the table TOAST
area as well.
(2) tail the database log and check that the automatic vacuum of the table and
its TOAST area happens every one minute
(3) Wait until the size of the table is at least 400 pages and the TOAST area
is at least 20MB.
(4) wait for the database log to emit the "automatic vacuum of table
"TEST.pg_toast.pg_toast_xxxx": index scans: (pg_toast_xxxx is the TOAST area
of the table test),
IMMEDIATELY issue the command to cluster the table test in the psql session:
cluster test using test_pk;
Testing results:
=========
(1) If the timing is right, occasionally I got the ERROR (about once out of
ten):
ERROR: missing chunk number 0 for toast value 13113341 in pg_toast_13084520
(2) Once I got the ERROR, subsequent CLUSTER command gets the same ERROR until
the next autovacuum of the TOAST is done.
(3) I tested in PostgreSQL 9.1.7 and 9.1.3 and can get the ERROR.
So this is not a 9.1.7 specific bug.
Thanks a lot for your help.
Pius
________________________________________
From: Tom Lane [[email protected]]
Sent: Tuesday, January 22, 2013 8:14 PM
To: Pius Chan
Cc: [email protected]
Subject: Re: [BUGS] BUG #7819: missing chunk number 0 for toast value 1235919
in pg_toast_35328
Pius Chan <[email protected]> writes:
> Hi Tom,
> Yes, we start seeing this ERROR after upgrade to 9.1.7. The ERROR is from the
> "cluster jms_messages" command. Last night, the ERROR happened three times:
> (1) at 00:20:01
> ERROR: missing chunk number 0 for toast value 1239124 in pg_toast_35328
> (2) at 00:25:01
> ERROR: missing chunk number 0 for toast value 1239124 in pg_toast_35328
> (3) at 00:35:01
> ERROR: missing chunk number 0 for toast value 1241022 in pg_toast_35328
> The "cluster jms_messages" runs every 5 minutes. However, so far, it seems
> that the ERROR happens at about mid-night and 35328 is the toast area of the
> "jms_message" table:
So what else is this application doing around midnight? In particular,
it seems like something must've happened between 00:15 and 00:20 to
create the problem with OID 1239124, and then something else happened
between 00:25 and 00:30 to get rid of it. And then a similar sequence
happened between 00:30 and 00:40 involving OID 1241022. Most likely the
trigger events are application actions against jms_messages, but we have
no info about what. (It's also possible that autovacuums of
jms_messages are involved, so you might want to crank up your logging
enough so you can see those in the postmaster log.)
I had first thought that this might have something to do with the
toast-value-processing changes we made in CLUSTER a year or so ago,
but a look in the commit logs confirms that those patches were in 9.1.3.
So I have no idea what might've broken between 9.1.3 and 9.1.7. We
need data, or even better an actual test case.
regards, tom lane
--
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs