Hi hackers,

Recently we found a table that was slowly, but consistently increasing in size. 
The table has a low fill-factor set and was updated very frequently. As 
expected, almost all updates are HOT updates, but for some of the non-HOT 
updates it always wanted to use a new page, rather than reuse an existing empty 
page. This led to a steady growth in table size (and a steady growth in the 
number of empty pages in the table).

I've managed to create a very simple reproducing example that shows the problem 
(original problem occurred on 12.4, but I've tested this example on latest 
master). It only occurs for updates where the new tuple is larger than the size 
of what "fillfactor" would normally allow. In real life, this would only be a 
very small portion of the updates to a certain table of course, but in this 
example every update will be this large.

Create a table with a low fill-factor and insert one row into it. Note that, in 
this case, the row that we're inserting is by itself larger than the "max fill 
factor space".

create table t1 (a int primary key, b text) with (fillfactor=10);
insert into t1 select 1, (select string_agg('1', '') from 
generate_series(1,1000)); -- 1000 byte text field
vacuum t1;

postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
     0 |  7104
(1 row)

This looks alright - there's 1 page and the available space is indeed roughly 
1000 bytes less, because of our tuple and page header.

Now, in a different backend, initiate a longer query.

select pg_sleep(600); -- just sleep 600 seconds so that we have enough time to 
do some updates during this

Then, in the original backend, update the tuple 7 times.

-- execute this 7 times
update t1 set b=(select string_agg((random()*9)::int::text, '') from 
generate_series(1,1000)) where a=1;

Cancel the pg_sleep call.
Then execute

vacuum t1; -- cleans rows and updates the fsm

postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
     0 |  8128
     1 |  7104
(2 rows)

This still looks OK. There's an extra page, because a total of 8 tuples needed 
to kept alive for the pg_sleep query. These didn't fit on one page, so a new 
page was created.

Now, repeat it (the pg_sleep, update 7 times, cancel the pg_sleep and vacuum).

postgres=# select * from pg_freespace('t1');
blkno | avail
-------+-------
     0 |  8128
     1 |  8128
     2 |  7104
(3 rows)

This does not look good anymore. The tuple was on page 1, so at first there 
were several HOT updates on page 1. Then, when page 1 was full, it needed to 
search for another page to put the tuple. It did not consider page 0, but 
instead decided to create a new page 2.

Repeating this process would create a new page each time, never reusing the 
empty old pages.

The reason it does not consider page 0 is because of this piece of code in 
function RelationGetBufferForTuple in hio.c:

     /* Compute desired extra freespace due to fillfactor option */
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation, 
HEAP_DEFAULT_FILLFACTOR);
...
     if (len + saveFreeSpace > MaxHeapTupleSize)
     {
           /* can't fit, don't bother asking FSM */
           targetBlock = InvalidBlockNumber;
           use_fsm = false;
     }

The problem here is two-folded: for any non-HOT update of a tuple that's larger 
than the size of the fillfactor, the fsm will not be used, but instead a new 
page will be chosen.
This seems to rely on the false assumption that every existing page has at last 
one tuple on it.
Secondly, and this is a bit trickier.. Even if we would ask the FSM to come up 
with a free page with a free size of "MaxHeapTupleSize", it wouldn't find 
anything... This is, because the FSM tracks free space excluding any unused 
line pointers. In this example, if we look at block 0:

postgres=# select * from page_header(get_raw_page('t1', 0));
    lsn    | checksum | flags | lower | upper | special | pagesize | version | 
prune_xid
-----------+----------+-------+-------+-------+---------+----------+---------+-----------
0/16D75A0 |        0 |     5 |    52 |  8192 |    8192 |     8192 |       4 |   
      0
(1 row)

postgres=# select * from heap_page_items(get_raw_page('t1', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
  1 |      0 |        0 |      0 |        |        |          |        |        
     |            |        |        |       |
  2 |      0 |        0 |      0 |        |        |          |        |        
     |            |        |        |       |
  3 |      0 |        0 |      0 |        |        |          |        |        
     |            |        |        |       |
  4 |      0 |        0 |      0 |        |        |          |        |        
     |            |        |        |       |
  5 |      0 |        0 |      0 |        |        |          |        |        
     |            |        |        |       |
  6 |      0 |        0 |      0 |        |        |          |        |        
     |            |        |        |       |
  7 |      0 |        0 |      0 |        |        |          |        |        
     |            |        |        |       |
(7 rows)

There are 7 line pointers on this page, consuming 28 bytes. Plus the 24 byte 
header, that means that lower=52. However, all line pointers are unused, so the 
page really is empty. The FSM does not see the page as empty though, as it only 
looks at "upper-lower".

When asking the FSM for slightly less space (MaxHeapTupleSize - 50 for 
example), it does find the free pages. I've confirmed that with such a hack the 
table is not growing indefinitely anymore. However, this number 50 is rather 
arbitrary obviously, as it depends on the number of unused line items on a 
page, so that's not a proper way to fix things.

In any case, the behavior feels like a bug to me, but I don't know what the 
best way would be to fix it. Thoughts?

-Floris

Reply via email to