Re: [BUGS] BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328

Pius Chan Thu, 31 Jan 2013 17:27:46 -0800

Hi Tom,

The ERROR happened again. After several days of investigation and testing, I 
can now reproduce the ERROR in my testing environment. The reason why the ERROR 
used to happen in a certain time period is that our report batch jobs run in 
that period and the batch job can make the TOAST area grow.  I can repeat the 
ERROR with this set up and testing procedure.

Set up
=====
(1) check that the configuration parameter log_autovacuum_min_duration = 0
(2) in a 9.1.7 database cluster, create a database called "TEST_LOCK".
(3) In the database "TEST_LOCK", create one table called LOCK with one row in 
it:
     create table lock (id bigint, constraint lock_pk primary key (id));
     insert into lock values (1);

(4) in the same database cluster, create another database called "TEST".
(5) in the database "TEST", create one table test with the following columns:

create table test (
     id                          bigint not null,
     txop                     char(1),
     messageblob    bytea,
     constraint test_pk primary key (id)
);

Pre-testing set up:
===========   
(1) use psql to connect to TEST_LOCK and issue the following SQLs and leave it 
open during the test:
      begin;
      select * from lock where id =1 for update;

(2) use another psql to connect to TEST_LOCK and issue the same SQLs as in (1). 
This session will be blocked by (1).
(3) this is to simulate the situation that in our production environment, our 
middle tier service use the TEST_LOCK.lock table for master/slave fail-over. We 
run the middle-tier service in several servers.  When the middle-tier service 
starts up, it tries to put a lock on that ROW.  The server successfully holds 
this LOCK becomes the master.

Testing set-up:
=========
Part 1 - traffic that will consume the TOAST area of the table "test" in TEST 
database
(1) use program to INSERT INTO TEST with length of messaegblob > 1MB and 
txop='A'. This is to ensure the TOAST area of the test table is used.  Then 
update the table txop to 'D.  Then delete the same row.  These operations are 
in their own transactions.
(2) this simulates the batch report traffic we see from 00:15 - 00:45 every 
night.

Part 2 - traffic that will not consume the TOAST area of the table "test"
(1) use program to INSERT INTO TEST with length of messageblob of size about 
3000 bytes and txop='A'.  Then update the row.txop to 'D'. Then delete the same 
row. These operations are in their own transactions.
(2) this simulates the background traffic we see during the whole day.

For part 1 and part 2, the traffic must be high enough so that from the 
database log:
(1) autovacuum for the table TEST.public.test runs every minute
(2) autovacuum for the table TOAST area runs every minute
(3) from result of analyze verbose test, the # of pages scanned keeps on 
increasing, even with autovacuum running on the table
(4) similarly, when querying the size of the TOAST, its size keeps on 
increasing, even with autovacuum running on the TOAST every one 1 minute
(5) cluster the table test can not reduce the table size and TOAST area size as 
well
(6) Any any time, the # of live rows is only a few but the # of dead rows is 
high.

Testing procedures:
============
(1) use psql to connect to TEST and frequently analyze the table so as to 
monitor the size of the table.  Run SQL to check the size of the table TOAST 
area as well.
(2) tail the database log and check that the automatic vacuum of the table and 
its TOAST area happens every one minute
(3) Wait until the size of the table is at least 400 pages and the TOAST area 
is at least 20MB.
(4) wait for the database log to emit the "automatic vacuum of table 
"TEST.pg_toast.pg_toast_xxxx": index scans:  (pg_toast_xxxx is the TOAST area 
of the table test),
IMMEDIATELY issue the command to cluster the table test in the psql session:

    cluster test using test_pk;

Testing results:
=========
(1) If the timing is right, occasionally I got the ERROR (about once out of 
ten):

ERROR:  missing chunk number 0 for toast value 13113341 in pg_toast_13084520

(2) Once I got the ERROR, subsequent CLUSTER command gets the same ERROR until 
the next autovacuum of the TOAST is done.

(3) I tested in PostgreSQL 9.1.7 and 9.1.3 and can get the ERROR.

So this is not a 9.1.7 specific bug.

Thanks a lot for your help.

Pius
________________________________________
From: Tom Lane [[email protected]]
Sent: Tuesday, January 22, 2013 8:14 PM
To: Pius Chan
Cc: [email protected]
Subject: Re: [BUGS] BUG #7819: missing chunk number 0 for toast value 1235919 
in pg_toast_35328

Pius Chan <[email protected]> writes:
> Hi Tom,
> Yes, we start seeing this ERROR after upgrade to 9.1.7. The ERROR is from the 
> "cluster jms_messages" command.  Last night, the ERROR happened three times:

> (1) at 00:20:01
>  ERROR:  missing chunk number 0 for toast value 1239124 in pg_toast_35328

> (2) at 00:25:01
> ERROR:  missing chunk number 0 for toast value 1239124 in pg_toast_35328

> (3) at 00:35:01
>  ERROR:  missing chunk number 0 for toast value 1241022 in pg_toast_35328

> The "cluster jms_messages" runs every 5 minutes. However, so far, it seems 
> that the ERROR happens at about mid-night and 35328 is the toast area of the 
> "jms_message" table:

So what else is this application doing around midnight?  In particular,
it seems like something must've happened between 00:15 and 00:20 to
create the problem with OID 1239124, and then something else happened
between 00:25 and 00:30 to get rid of it.  And then a similar sequence
happened between 00:30 and 00:40 involving OID 1241022.  Most likely the
trigger events are application actions against jms_messages, but we have
no info about what.  (It's also possible that autovacuums of
jms_messages are involved, so you might want to crank up your logging
enough so you can see those in the postmaster log.)

I had first thought that this might have something to do with the
toast-value-processing changes we made in CLUSTER a year or so ago,
but a look in the commit logs confirms that those patches were in 9.1.3.
So I have no idea what might've broken between 9.1.3 and 9.1.7.  We
need data, or even better an actual test case.

                        regards, tom lane

-- 
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328

Reply via email to