Hi hackers,
Recently we found a table that was slowly, but consistently increasing in size.
The table has a low fill-factor set and was updated very frequently. As
expected, almost all updates are HOT updates, but for some of the non-HOT
updates it always wanted to use a new page, rather than reuse an existing empty
page. This led to a steady growth in table size (and a steady growth in the
number of empty pages in the table).
I've managed to create a very simple reproducing example that shows the problem
(original problem occurred on 12.4, but I've tested this example on latest
master). It only occurs for updates where the new tuple is larger than the size
of what "fillfactor" would normally allow. In real life, this would only be a
very small portion of the updates to a certain table of course, but in this
example every update will be this large.
Create a table with a low fill-factor and insert one row into it. Note that, in
this case, the row that we're inserting is by itself larger than the "max fill
factor space".
create table t1 (a int primary key, b text) with (fillfactor=10);
insert into t1 select 1, (select string_agg('1', '') from
generate_series(1,1000)); -- 1000 byte text field
vacuum t1;
postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
0 | 7104
(1 row)
This looks alright - there's 1 page and the available space is indeed roughly
1000 bytes less, because of our tuple and page header.
Now, in a different backend, initiate a longer query.
select pg_sleep(600); -- just sleep 600 seconds so that we have enough time to
do some updates during this
Then, in the original backend, update the tuple 7 times.
-- execute this 7 times
update t1 set b=(select string_agg((random()*9)::int::text, '') from
generate_series(1,1000)) where a=1;
Cancel the pg_sleep call.
Then execute
vacuum t1; -- cleans rows and updates the fsm
postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
0 | 8128
1 | 7104
(2 rows)
This still looks OK. There's an extra page, because a total of 8 tuples needed
to kept alive for the pg_sleep query. These didn't fit on one page, so a new
page was created.
Now, repeat it (the pg_sleep, update 7 times, cancel the pg_sleep and vacuum).
postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
0 | 8128
1 | 8128
2 | 7104
(3 rows)
This does not look good anymore. The tuple was on page 1, so at first there
were several HOT updates on page 1. Then, when page 1 was full, it needed to
search for another page to put the tuple. It did not consider page 0, but
instead decided to create a new page 2.
Repeating this process would create a new page each time, never reusing the
empty old pages.
The reason it does not consider page 0 is because of this piece of code in
function RelationGetBufferForTuple in hio.c:
/* Compute desired extra freespace due to fillfactor option */
saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
HEAP_DEFAULT_FILLFACTOR);
...
if (len + saveFreeSpace > MaxHeapTupleSize)
{
/* can't fit, don't bother asking FSM */
targetBlock = InvalidBlockNumber;
use_fsm = false;
}
The problem here is two-folded: for any non-HOT update of a tuple that's larger
than the size of the fillfactor, the fsm will not be used, but instead a new
page will be chosen.
This seems to rely on the false assumption that every existing page has at last
one tuple on it.
Secondly, and this is a bit trickier.. Even if we would ask the FSM to come up
with a free page with a free size of "MaxHeapTupleSize", it wouldn't find
anything... This is, because the FSM tracks free space excluding any unused
line pointers. In this example, if we look at block 0:
postgres=# select * from page_header(get_raw_page('t1', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version |
prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
0/16D75A0 | 0 | 5 | 52 | 8192 | 8192 | 8192 | 4 |
0
(1 row)
postgres=# select * from heap_page_items(get_raw_page('t1', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid |
t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
1 | 0 | 0 | 0 | | | | |
| | | | |
2 | 0 | 0 | 0 | | | | |
| | | | |
3 | 0 | 0 | 0 | | | | |
| | | | |
4 | 0 | 0 | 0 | | | | |
| | | | |
5 | 0 | 0 | 0 | | | | |
| | | | |
6 | 0 | 0 | 0 | | | | |
| | | | |
7 | 0 | 0 | 0 | | | | |
| | | | |
(7 rows)
There are 7 line pointers on this page, consuming 28 bytes. Plus the 24 byte
header, that means that lower=52. However, all line pointers are unused, so the
page really is empty. The FSM does not see the page as empty though, as it only
looks at "upper-lower".
When asking the FSM for slightly less space (MaxHeapTupleSize - 50 for
example), it does find the free pages. I've confirmed that with such a hack the
table is not growing indefinitely anymore. However, this number 50 is rather
arbitrary obviously, as it depends on the number of unused line items on a
page, so that's not a proper way to fix things.
In any case, the behavior feels like a bug to me, but I don't know what the
best way would be to fix it. Thoughts?
-Floris