On Tue, Nov 19, 2019 at 04:49:10PM -0600, Justin Pryzby wrote:
> On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote:
> > Perhaps we should not panic if we failed to open (not fsync) the file,
> > but it's not the root problem here which is that somehow we thought we
> > should be fsyncing a file that had apparently been removed already
> > (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).
Note, the ALTER was (I think) building index in a parallel process:
2019-11-15 22:16:08.752-05 | 5976 | terminating connection because of crash
of another server process
2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was
terminated by signal 6: Aborted
2019-11-15 22:16:08.751-05 | 27384 | terminating any other active server
processes
2019-11-15 22:16:07.098-05 | 27388 | could not fsync file
"base/16491/1731839470.2": No such file or directory
2019-11-15 22:15:59.592-05 | 19860 | duration: 220283.831 ms statement: ALTER
TABLE child.eric_enodeb_cell_201811 ALTER pmradiothpvolulscell TYPE integer
USING pmradiothpvolulscell::integer
2019-11-15 22:15:59.459-05 | 19860 | temporary file: path
"base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/1.0", size 5144576
2019-11-15 22:15:59.458-05 | 19860 | temporary file: path
"base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/2.0", size 6463488
2019-11-15 22:15:59.456-05 | 19860 | temporary file: path
"base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/0.0", size 4612096
FYI, that table is *currently* (5 days later):
ts=# \dti+ child.eric_enodeb_cell_201811*
child | eric_enodeb_cell_201811 | table | telsasoft |
| 2595 MB |
child | eric_enodeb_cell_201811_idx | index | telsasoft |
eric_enodeb_cell_201811 | 120 kB |
child | eric_enodeb_cell_201811_site_idx | index | telsasoft |
eric_enodeb_cell_201811 | 16 MB |
I don't know if that table is likely to be the one with relfilenode 1731839470
(but it certainly wasn't its index), or if that was maybe a table (or index)
from an earlier ALTER. I tentatively think we wouldn't have had any other
tables being dropped, partitions pruned or maintenance commands running.
Checkpoint logs for good measure:
2019-11-15 22:18:26.168-05 | 10388 | checkpoint complete: wrote 2915 buffers
(3.0%); 0 WAL file(s) added, 0 removed, 18 recycled; write=30.022 s, sync=0.472
s, total=32.140 s; sync files=107, longest=0.364 s, average=0.004 s;
distance=297471 kB, estimate=297471 kB
2019-11-15 22:17:54.028-05 | 10388 | checkpoint starting: time
2019-11-15 22:16:53.753-05 | 10104 | checkpoint complete: wrote 98275 buffers
(100.0%); 0 WAL file(s) added, 0 removed, 43 recycled; write=11.040 s,
sync=0.675 s, total=11.833 s; sync files=84, longest=0.335 s, average=0.008 s
; distance=698932 kB, estimate=698932 kB
2019-11-15 22:16:41.921-05 | 10104 | checkpoint starting: end-of-recovery
immediate
2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was
terminated by signal 6: Aborted
2019-11-15 22:15:33.03-05 | 27388 | checkpoint starting: time
2019-11-15 22:15:03.62-05 | 27388 | checkpoint complete: wrote 5436 buffers
(5.5%); 0 WAL file(s) added, 0 removed, 45 recycled; write=28.938 s, sync=0.355
s, total=29.711 s; sync files=22, longest=0.174 s, average=0.016 s; d
istance=740237 kB, estimate=740237 kB
2019-11-15 22:14:33.908-05 | 27388 | checkpoint starting: time
I was trying to reproduce what was happening:
set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int
unique); INSERT INTO t SELECT generate_series(1,999999)"; echo "begin;SELECT
pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot&
sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`;
PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t ALTER
i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;
Under v12, with or without the CHECKPOINT command, it fails:
|pg_dump: error: query failed: ERROR: cache lookup failed for index 0
But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails
like:
|pg_dump: [archiver (db)] query failed: ERROR: cache lookup failed for index
16391
With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
needed, or why it would behave differently (or if it's related to this crash).