I'm a little late getting back to this but still have no solution.
I 100% agree that updating postgres is best. The usual excuses for not updating 
apply. It will be done when it is allowed to be done. It remains 11.4 for now. 
I read through the changelogs up through 11.18 and didn't see anything 
obviously related. The underlying cause could be pretty involved and something 
I wouldn't recognize.

Thank you Laurenz Albe for reminding me about an important clue. I had 
inadvertently executed a vacuum freeze on a table that pg_stat_activity 
indicated was currently being autovacuumed. The manual vacuum succeeded while 
leaving the previous autovacuum still showing active in pg_stat_activity. 
Manual vacuum always completes, never stalls, but also often does not appear in 
pg_stat_progress_vacuum unless it's a longer process.

It appears the process completes the vacuum but does not register that fact. 
relfrozenxid of the main table is what would be expected but an associated 
toast table was still very old. Cancelling all pending vacuums of the table and 
manually running vacuum completes in a few seconds and both the main relation 
and toast are updated as expected with last vacuum time updated in 
pg_stat_all_tables. pg_stat_progress_vacuum never had any entry. Autoacuum and 
autovacuum analyze both get hung.

I often see the same table listed multiple times in pg_stat_activity with 
different pids and:
state: active
backend_type: autovacuum worker
wait_event_type: blank
state_change: 10-15 seconds after backend_start - about how long it takes to 
manually run vacuum on most tables.

What exactly does autovacuum rely on in the stats file? I ran strace on a hung 
autovacuum process and saw a repeated read of /run/postgresql/db_16384.stat 
(tempfs). The file is 740MB which is about the same as other similar 
installations I've reviewed. I'm lacking in overall experience in this though.

One final oddity:
I ran a query for oldest relfrozenxid and redirected to file. The query took 
around a minute. A few seconds after it finished, I queried for 'active' in 
pg_stat_activity and the oldest relfrozenxid query was still listed. A few 
seconds later it had cleared.

Can a corrupted stats file prevent autovac from reading/writing? 


> I scripted a vacuum loop using the oldest table list. It's extremely slow but 
> it was
> making better progress than autovacuum was.
>
> Using ps I see that there were as many worker processes as defined with 
> autovacuum_max_workers
> but pg_stat_activity consistantly showed 19. I killed the script thinking 
> there might be a conflict.
> I saw no difference after 30 minutes so restarted script.

I am not sure what exactly you are actually doing here, but you should know 
that there
can only be one VACUUM process per table.  If there is already an 
anti-wraparound autovacuum
running on the table, a manual VACUUM will simple be blocked until the 
autovacuum worker
is done.

> Never saw anything in pg_stat_progress_vacuum.

Now that would be weird, except if VACUUM cannot get the required lock on the 
table.

> vacuum settings:
>                 name                 |  setting
> -------------------------------------+-----------
>  autovacuum                          | on
>  autovacuum_freeze_max_age           | 200000000
>  autovacuum_max_workers              | 40
>  autovacuum_naptime                  | 4
>  autovacuum_vacuum_cost_delay        | 0
>  autovacuum_vacuum_cost_limit        | 5000
>  autovacuum_work_mem                 | -1
>  vacuum_freeze_min_age               | 50000000
>  vacuum_freeze_table_age             | 150000000
>
> I'm now thinking that autovacuum getting hung up is what caused the issue to 
> begin with. I see nothing
> but the successful vacuums from the script and my own fat-fingering commands 
> in the postgres
> logs (set at info).

Sorry about the reply formatting. I tried using outlook web in Edge. Maybe that 
was a mistake.
Thanks,
Senor

Reply via email to