Hi, I too have been experiencing this with a busy PostgreSQL instance.
I have been following the updates to the 9.4 branch hoping a fix will appear, but sadly no luck yet. I have manually replicated the issue on 9.4.4, 9.4.10 and 9.4.12. My replication steps are: BEGIN; CREATE TABLE x (id BIGSERIAL PRIMARY KEY, payload1 VARCHAR, payload2 VARCHAR, payload3 VARCHAR, payload4 BIGINT, payload5 BIGINT); /* Repeat until 2,000,000 rows are inserted */ INSERT INTO x (id, payload1, payload2, payload3, payload4, payload5) VALUES (random values of varying length/size to force random toast usage); COMMIT; VACUUM (ANALYZE, FULL); BEGIN; /* Repeat until all 2,000,000 rows are updated */ UPDATE x SET payload1 = , payload2 = , payload3 = , payload4 = , payload5 = ... again random values of varying length/size to force random toast usage COMMIT; VACCUM (ANALYZE, FULL); The second vacuum causes an ERROR identical to that you are reporting below (unexpected chunk number n (expected n) for toast value...). However it may take up to ten attempts to replicate it. Out of interest, are you using any tablespaces other than pg_default? I can only replicate the issue when using separately mounted tablespaces. I have been investigating this quite extensively and everything I can find on the web suggests data corruption. However running the the following DO reports no errors and I can dump the database without issue. DO $$ DECLARE curid INT := 0; vcontent RECORD; badid BIGINT; var1_sub VARCHAR; var2_sub VARCHAR; var3_sub VARCHAR; var4_sub VARCHAR; var5_sub VARCHAR; BEGIN FOR badid IN SELECT id FROM x LOOP curid = curid + 1; IF curid % 100000 = 0 THEN RAISE NOTICE '% rows inspected', curid; END IF; BEGIN SELECT * INTO vcontent FROM x WHERE rowid = badid; var1_sub := SUBSTR(vcontent.var1,2000,5000); var2_sub := SUBSTR(vcontent.var2,2000,5000); var3_sub := SUBSTR(vcontent.var3,2000,5000); var4_sub := SUBSTR(vcontent.var4::VARCHAR,2000,5000); var5_sub := SUBSTR(vcontent.var5::VARCHAR,2000,5000); EXCEPTION WHEN OTHERS THEN RAISE NOTICE 'Data for rowid % is corrupt', badid; CONTINUE; END; END LOOP; END; $$; I also found the following has been reported: https://www.postgresql.org/message-id/20161201165505.4360.28...@wrigleys.postgresql.org Best wishes, Harry > On 7 Jun 2017, at 15:22, Achilleas Mantzios <ach...@matrix.gatewaynet.com> > wrote: > > On 07/06/2017 16:33, ADSJ (Adam Sjøgren) wrote: >> Our database has started reporting errors like this: >> >> 2017-05-31 13:48:10 CEST ERROR: unexpected chunk number 0 (expected 1) >> for toast value 14242189 in pg_toast_10919630 >> ... >> 2017-06-01 11:06:56 CEST ERROR: unexpected chunk number 0 (expected 1) >> for toast value 19573520 in pg_toast_10919630 >> >> (157 times, for different toast values, same pg_toast_nnn). pg_toast_10919630 >> corresponds to a table with around 168 million rows. >> >> These went away, but the next day we got similar errors from another >> table: >> >> 2017-06-02 05:59:50 CEST ERROR: unexpected chunk number 0 (expected 1) >> for toast value 47060150 in pg_toast_10920100 >> ... >> 2017-06-02 06:14:54 CEST ERROR: unexpected chunk number 0 (expected 1) >> for toast value 47226455 in pg_toast_10920100 >> >> (Only 4 this time) pg_toast_10920100 corresponds to a table with holds >> around 320 million rows (these are our two large tables). >> >> The next day we got 6 such errors and the day after 10 such errors. On >> June 5th we got 94, yesterday we got 111, of which one looked a little >> different: >> >> 2017-06-06 17:32:21 CEST ERROR: unexpected chunk size 1996 (expected >> 1585) in final chunk 0 for toast value 114925100 in pg_toast_10920100 >> >> and today the logs have 65 lines, ending with these: >> >> 2017-06-07 14:49:53 CEST ERROR: unexpected chunk number 0 (expected 1) >> for toast value 131114834 in pg_toast_10920100 >> 2017-06-07 14:53:41 CEST ERROR: unexpected chunk number 0 (expected 1) >> for toast value 131149566 in pg_toast_10920100 > First try to find which tables those toast relations refer to : > select 10919630::regclass , 10920100::regclass ; > Are those critical tables? Can you restore them somehow? > > Also you may consider > REINDEX TABLE pg_toast.pg_toast_10920100; > REINDEX TABLE pg_toast.pg_toast_10919630; > REINDEX TABLE <name of table 10920100>; > REINDEX TABLE <name of table 10919630>; > > also VACUUM the above tables. > > You might want to write a function which iterates over the damaged table's > rows in order to identify the damaged row(s). And then do some good update to > create a new version. > >> The database is 10 TB on disk (SSDs) and runs on a 48 core server with 3 >> TB RAM on Ubuntu 14.04 (Linux 3.18.13). >> >> We are updating rows in the database a lot/continuously. >> >> There are no apparent indications of hardware errors (like ECC) in >> dmesg, nor any error messages logged by the LSI MegaRAID controller, as >> far as I can tell. >> >> We are running PostgreSQL 9.3.14 currently. >> >> The only thing I could see in the release notes since 9.3.14 that might >> be related is this: >> >> "* Avoid very-low-probability data corruption due to testing tuple >> visibility without holding buffer lock (Thomas Munro, Peter Geoghegan, >> Tom Lane)" >> >> Although reading more about it, it doesn't sound like it would exhibit >> the symptoms we see? >> >> We have recently increased the load (to around twice the number of >> cores), though, which made me think we could be triggering corner cases >> we haven't hit before. >> >> We will be upgrading to PostgreSQL 9.3.17 during the weekend, but I'd like >> to hear >> if anyone has seen something like this, or have some ideas of how to >> investigate/what the cause might be. >> >> >> Best regards, >> >> Adam >> > > -- > Achilleas Mantzios > IT DEV Lead > IT DEPT > Dynacom Tankers Mgmt > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org > <mailto:pgsql-general@postgresql.org>) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > <http://www.postgresql.org/mailpref/pgsql-general>