Hello All,

While stress testing WARM, I encountered a data consistency problem. It
happens when index is built concurrently. What I thought to be a WARM
induced bug and it took me significant amount of investigation to finally
conclude that it's a bug in the master. In fact, we tested all the way up
to 9.2 release and the bug exists in all releases.

In my test set up, I keep a UNIQUE index on an "aid" column and then have
many other "aid[1-4]" columns, on which indexes are built and dropped. The
script then updates a randomly selected row using any of the aid[1-4]
columns. Even these columns are updated during the tests to force WARM
updates, but within a narrow range of non-overlapping values. So we can
always find the unique row using some kind of range condition. The
reproduction case is much simpler and I'll explain that below.

In the reproduction scenario (attached), it happens fairly quickly, may be
after 3-10 rounds of DROP/CREATE. You may need to adjust the -j value if it
does not happen for you even after a few rounds. Just for the record, I
haven't checked past reports to correlate if the bug explains any of the
index corruption issues we might have received.

To give a short summary of how CIC works with HOT, it consists of three
1. Create index entry in the catalog, mark it not ready for inserts, commit
the transaction, start new transaction and then wait for all potential lock
holders on the table to end
2. Take a MVCC snapshot, build the index, mark index ready for inserts,
commit and then once again wait for potential lock holders on the table to
3. Do a validation phase where we look at MVCC-visible tuples and add
missing entries to the index. We only look at the root line pointer of each
tuple while deciding whether a particular tuple already exists in the index
or not. IOW we look at HOT chains and not tuples themselves.

The interesting case is phase 1 and 2. The phase 2 works on the premise
that every transaction started after we finished waiting for all existing
transactions will definitely see the new index. All these new transactions
then look at the columns used by the new index and consider them while
deciding HOT-safety of updates. Now for reasons which I don't fully
understand, there exists a path where a transaction starts after CIC waited
and took its snapshot at the start of the second phase, but still fails to
see the new index. Or may be it sees the index, but fails to update its
rd_indexattr list to take into account the columns of the new index. That
leads to wrong decisions and we end up with a broken HOT chain, something
the CIC algorithm is not prepared to handle. This in turn leads the new
index missing entries for the update tuples i.e. the index may have aid1 =
10, but the value in the heap could be aid1 = 11.

Based on my investigation so far and the evidence that I've collected, what
probably happens is that after a relcache invalidation arrives at the new
transaction, it recomputes the rd_indexattr but based on the old, cached
rd_indexlist. So the rd_indexattr does not include the new columns. Later
rd_indexlist is updated, but the rd_indexattr remains what it was.

There is definitely an evidence of this sequence of events (collected after
sprinkling code with elog messages), but it's not yet clear to me why it
affects only a small percentage of transactions. One possibility is that it
probably depends on at what stage an invalidation is received and processed
by the backend. I find it a bit confusing that even though rd_indexattr is
tightly coupled to the rd_indexlist, we don't invalidate or recompute them
together. Shouldn't we do that? For example, in the attached patch we
invalidate rd_indexattr whenever we recompute rd_indexlist (i.e. if we see
that rd_indexvalid is 0). This patch fixes the problem for me (I've only
tried master), but I am not sure if this is a correct fix.

Finally, here is the reproduction scenario. Run "cic_bug.sh" and observe
the output. The script runs a continuous UPDATE on the table, while running
DROP INDEX and CREATE INDEX CONCURRENTLY on aid1 column in a loop. As soon
as we build the index, we run a validation query such that rows are fetched
using IOS on the new index and compare them against the rows obtained via a
SeqScan on the table. If things are OK, the rows obtained via both methods
should match. But every often you'll see a row or two differ. I must say
that IOS itself is not a problem, but I'd to setup things that way to
ensure that index access does not have to go to the heap. Otherwise, it
will just return whatever is there in the heap and things will look fine.
May be some index consistency check utility could be handy to check the
corruption. But I think the query should just do the job for now.

Some additional server logs are attached too. I've filtered them to include
only three sessions. One (PID 73119) which is running CIC, second (PID
73090) which processed relcache invalidation correctly and third  (PID
73091) which got it wrong. Specifically, look at line #77 where snapshot is
obtained for the second phase:

2017-01-29 10:44:31.238 UTC [73119] [xid=0]LOG:  CIC (pgb_a_aid1) building
first phase with snapshot (4670891:4670893)
At line 108, you'll see a new transaction with XID 4670908 still using the
old attribute list.

2017-01-29 10:44:31.889 UTC [73091] [xid=4670908]LOG:  heap_update
(cic_tab_accounts), hot_attrs ((b 9))

Other session, which I included for comparison, sees the new index and
recomputes its rd_indexattr in time and hence that update works as expected.


 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment: cic_bug.tar.gz
Description: GNU Zip compressed data

Attachment: relcache_issue_2.log
Description: Binary data

Attachment: invalidate_indexattr.patch
Description: Binary data

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to