Re: Corrupt btree index includes rows that don't match

Erik Johnston Wed, 09 Jul 2025 10:02:37 -0700

Hi again,

Thanks very much for the replies last week. We’ve been continuing toinvestigate this problem, and I thought I’d share an update on where we are.

To recap: the situation is that, looking at our backup from 2025-06-26via pageinspect, we have btree index rows which point to eithernon-existent heap TIDs, or to heap TIDs with data which does notcorrespond to the index row. In fact it looks like we have entire indexpages which point only to non-existent heap TIDs.

(I previously said that these index rows were marked as ‘dead’ in thebackup. We now suspect this is an artifact of the restore process: webelieve they are live in the backup, but were marked as dead during therestore.)

Empirically, and surprisingly to us, when one does a SELECT from anindex entry that points to a non-existent TID, the index entry isquietly ignored.

We therefore suspect that this index corruption has been present forsome time (possibly years); more recently those non-existent heap TIDshave been recycled, and that is when we have noticed the effects of theproblem.

As far as we can tell, the corruption only affects one index on onetable, and only a specific region of that index/table. Specifically, itonly appears to affect rows which would have been inserted between 2018and January 2021. At least 1B rows appear to be affected (the table as awhole has 29B rows).

One thing that surprised us is that ‘amcheck’ didn’t find any sign ofthe corruption. We’re not completely sure if this is because we areholding it wrong, or because it’s simply out of scope or unsupported foramcheck. Any advice on this, or suggestions for other tooling we coulduse to check the consistency of our other indexes, would be muchappreciated.

We’re still very interested in trying to understand the root cause ofthe corruption, mostly to confirm that it’s not an ongoing problem.Thanks Tom for the suggestion ofhttps://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4934d3875.We agree with your assessment that this is unlikely. For one thing, itlooks like that bug could only conceivably cause this corruption if itaffected an UPDATE query, and we’re reasonably sure we never do anyUPDATE queries on that table. (The table is mostly append-only. We dosometimes run cleanup/compression jobs which amount to large amounts ofinterleaved DELETEs and INSERTs, but no UPDATEs.)

Back in 2021, we were running Postgres 10.11. We’ve taken a pass throughthe release notes since then to see if we can find any likely-lookingbugs. We found the one that causes BRIN index corruption (this is not aBRIN index), and the one that causes CREATE INDEX CONCURRENTLY to end upwith too *few* entries (this one has the opposite problem), but noparticularly likely candidate. Any other suggestions would be welcome here.

At the moment, a historical hardware-level problem seems like it mightbe the most likely culprit, though we are a bit mystified about how anyhardware failure could have caused such widespread damage to a singleindex, whilst apparently leaving the rest of the database intact.



Any thoughts or suggestions are very much appreciated.

Thanks,
Erik


On 04/07/2025 15:59, Erik Johnston wrote:



On Fri, 4 Jul 2025, 15:38 Ron Johnson, <[email protected]> wrote:

    On Fri, Jul 4, 2025 at 9:49 AM Erik Johnston <[email protected]> wrote:

        Hi, a quick update:

        - We have discovered that the corruption was present from
        before libicu update.
        - We ran `pg_amcheck --index state_groups_state_type_idx
        --heapallindexed matrix`, which returned nothing
        - We believe that means that (and matches what we see
        sampling) the index has gained extra entries, i.e. that for a
        given state group it does return all the relevant rows in the
        table /plus/ extra rows.

        We are also seeing old state groups starting to point at rows
        that have only just been inserted. For example, querying for
        353864583 on the primary it returns that row plus four rows
        that have been inserted today, but on the backup from last
        week an index only scan for 353864583 only returns one row.
        This makes it feel like the corruption is ongoing? Nothing
        should have modified that state group in the interim (they are
        generally immutable).

        This naively feels like when inserting a new row we sometimes
        add the row to the index twice: once pointing from the correct
        state group to the new row, and once from an old state group
        to the new row?


    Are checksums enabled in the instance?


Alas not.

We've also now found that the index on the backup does in fact pointto those ctids after all, but they are marked as dead. So at somepoint between then and when we inserted the new row at that ctid todaythose entries were marked undead.

--
Element Logo

_Copyright © 2023 Element - All rights reserved. The Element name, logoand device are registered trademarks of New Vector Ltd. Registerednumber: 10873661. Registered in England and Wales. Registered address:10 Queen Street Place, London, United Kingdom, EC4R 1AG.

This message is intended for the addressee only and may contain privateand confidential information or material which may be privileged. Ifthis message has come to you in error please delete it immediately anddo not copy it or show it to any other person.

--

Copyright © 2025 Element - All rights reserved. The Element name, logoand device are registered trademarks of New Vector Ltd. Registered number:10873661. Registered in England and Wales. Registered address: 10 QueenStreet Place, London, United Kingdom, EC4R 1AG.

This message is intendedfor the addressee only and may contain private and confidential informationor material which may be privileged. If this message has come to you inerror please delete it immediately and do not copy it or show it to anyother person.

Re: Corrupt btree index includes rows that don't match

Reply via email to