Re: checkpointer: PANIC: could not fsync file: No such file or directory

Justin Pryzby Tue, 19 Nov 2019 17:23:41 -0800

On Tue, Nov 19, 2019 at 04:49:10PM -0600, Justin Pryzby wrote:
> On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote:
> > Perhaps we should not panic if we failed to open (not fsync) the file,
> > but it's not the root problem here which is that somehow we thought we
> > should be fsyncing a file that had apparently been removed already
> > (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).


Note, the ALTER was (I think) building index in a parallel process:

 2019-11-15 22:16:08.752-05 |  5976 | terminating connection because of crash 
of another server process
 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was 
terminated by signal 6: Aborted
 2019-11-15 22:16:08.751-05 | 27384 | terminating any other active server 
processes
 2019-11-15 22:16:07.098-05 | 27388 | could not fsync file 
"base/16491/1731839470.2": No such file or directory
 2019-11-15 22:15:59.592-05 | 19860 | duration: 220283.831 ms  statement: ALTER 
TABLE child.eric_enodeb_cell_201811 ALTER pmradiothpvolulscell TYPE integer 
USING pmradiothpvolulscell::integer
 2019-11-15 22:15:59.459-05 | 19860 | temporary file: path 
"base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/1.0", size 5144576
 2019-11-15 22:15:59.458-05 | 19860 | temporary file: path 
"base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/2.0", size 6463488
 2019-11-15 22:15:59.456-05 | 19860 | temporary file: path 
"base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/0.0", size 4612096

FYI, that table is *currently* (5 days later):
ts=# \dti+ child.eric_enodeb_cell_201811*
 child  | eric_enodeb_cell_201811          | table | telsasoft |                
         | 2595 MB |
 child  | eric_enodeb_cell_201811_idx      | index | telsasoft | 
eric_enodeb_cell_201811 | 120 kB  |
 child  | eric_enodeb_cell_201811_site_idx | index | telsasoft | 
eric_enodeb_cell_201811 | 16 MB   |

I don't know if that table is likely to be the one with relfilenode 1731839470
(but it certainly wasn't its index), or if that was maybe a table (or index)
from an earlier ALTER.  I tentatively think we wouldn't have had any other
tables being dropped, partitions pruned or maintenance commands running.

Checkpoint logs for good measure:
 2019-11-15 22:18:26.168-05 | 10388 | checkpoint complete: wrote 2915 buffers 
(3.0%); 0 WAL file(s) added, 0 removed, 18 recycled; write=30.022 s, sync=0.472 
s, total=32.140 s; sync files=107, longest=0.364 s, average=0.004 s; 
distance=297471 kB, estimate=297471 kB
 2019-11-15 22:17:54.028-05 | 10388 | checkpoint starting: time
 2019-11-15 22:16:53.753-05 | 10104 | checkpoint complete: wrote 98275 buffers 
(100.0%); 0 WAL file(s) added, 0 removed, 43 recycled; write=11.040 s, 
sync=0.675 s, total=11.833 s; sync files=84, longest=0.335 s, average=0.008 s
; distance=698932 kB, estimate=698932 kB
 2019-11-15 22:16:41.921-05 | 10104 | checkpoint starting: end-of-recovery 
immediate
 2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was 
terminated by signal 6: Aborted
 2019-11-15 22:15:33.03-05  | 27388 | checkpoint starting: time
 2019-11-15 22:15:03.62-05  | 27388 | checkpoint complete: wrote 5436 buffers 
(5.5%); 0 WAL file(s) added, 0 removed, 45 recycled; write=28.938 s, sync=0.355 
s, total=29.711 s; sync files=22, longest=0.174 s, average=0.016 s; d
istance=740237 kB, estimate=740237 kB
 2019-11-15 22:14:33.908-05 | 27388 | checkpoint starting: time

I was trying to reproduce what was happening:
set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int 
unique); INSERT INTO t SELECT generate_series(1,999999)"; echo "begin;SELECT 
pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot& 
sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; 
PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t ALTER 
i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;

Under v12, with or without the CHECKPOINT command, it fails:
|pg_dump: error: query failed: ERROR:  cache lookup failed for index 0
But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails 
like:
|pg_dump: [archiver (db)] query failed: ERROR:  cache lookup failed for index 
16391
With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
needed, or why it would behave differently (or if it's related to this crash).

Re: checkpointer: PANIC: could not fsync file: No such file or directory

Reply via email to