We're tracking down an issue that we've seen in two separate installations so 
far, which is that, at the very end of a vacuum, the vacuum operation starts 
using *very* high levels of CPU and (sometimes) I/O, often to the point that 
the system becomes unable to service other requests.  We've seen this on 
versions 15, 16, and 17 so far.

The common data points are:

1. The table being vacuumed is large (>250 million rows, often in the >10 
billion row level).
2. The table has a relatively high churn rate.
3. The number of updated / deleted rows before that particular vacuum cycle are 
very high.

Everything seems to point to the vacuum free space map operation, since it 
would have a lot of work to do in that particular situation, it happens at just 
the right place in the vacuum cycle, and its resource consumption is not 
throttled the way the regular vacuum operation is.

Assuming this analysis is correct, our current proposal as a temporary fix is 
to increase the frequency of autovacuum on those tables, so that the free space 
map vacuum operation has less to do, and is less likely to consume the system.  
In the longer run, is it worth considering implementing a cost delay inside of 
the free space map update operations?

Reply via email to