On Sun, Mar 4, 2018 at 5:40 PM, Thomas Munro <[email protected]> wrote: > Could shm_mq_detach_internal() need a pg_write_barrier() before it > writes mq_detached = true, to make sure that anyone who observes that > can also see the most recent increase of mq_bytes_written?
I can reproduce both failure modes (missing tuples and "lost contact") in the regression database with the attached Python script on my Mac. It takes a few minutes and seems to be happen sooner when my machine is also doing other stuff (playing debugging music...). I can reproduce it at 34db06ef9a1d7f36391c64293bf1e0ce44a33915 "shm_mq: Reduce spinlock usage." but (at least so far) not at the preceding commit. I can fix it with the following patch, which writes XXX out to the log where it would otherwise miss a final message sent just before detaching with sufficiently bad timing/memory ordering. This patch isn't my proposed fix, it's just a demonstration of what's busted. There could be a better way to structure things than this. -- Thomas Munro http://www.enterprisedb.com
fix.patch
Description: Binary data
import psycopg2
conn = psycopg2.connect("dbname=regression")
cursor = conn.cursor()
cursor.execute("""
set enable_seqscan to on;
set enable_indexscan to off;
set enable_hashjoin to off;
set enable_mergejoin to off;
set enable_material to off;
set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_table_scan_size=0;
set max_parallel_workers_per_gather=4;
alter table tenk2 set (parallel_workers = 0);
""")
for i in range(10000):
cursor.execute("select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0")
count, = cursor.fetchone()
if count != 98000:
print "count = %d, after %d tests" % (count, i)
