Hi, On 2018-08-11 01:55:43 +0200, Tomas Vondra wrote: > On 08/10/2018 11:59 PM, Tomas Vondra wrote: > > > > ... > > > > I suspect there's some other ingredient, e.g. some manipulation with the > > subscription. Or maybe it's not needed at all and I'm just imagining things. > > > > Indeed, the manipulation with the subscription seems to be the key here. > I pretty reliably get the 'could not read block' error when doing this: > > 1) start the insert pgbench > > pgbench -n -c 4 -T 300 -p 5433 -f insert.sql test > > 2) start the vacuum full pgbench > > pgbench -n -f vacuum.sql -T 300 -p 5433 test > > 3) try to create a subscription, but with small amount of conflicting > data so that the sync fails like this: > > LOG: logical replication table synchronization worker for > subscription "s", table "t" has started > ERROR: duplicate key value violates unique constraint "t_pkey" > DETAIL: Key (a)=(5997542) already exists. > CONTEXT: COPY t, line 1 > LOG: worker process: logical replication worker for subscription > 16458 sync 16397 (PID 31983) exited with exit code 1 > > 4) At this point the insert pgbench (at least some clients) should have > failed with the error. If not, rinse and repeat. > > This kinda explains why I've been seeing the error only occasionally, > because it only happened when I forgotten to clean the table on the > subscriber while recreating the subscription.
I'll try to reproduce this. If you're also looking, I suspect a good first hint would be to just change the ERROR into a PANIC and look at the backtrace from the generated core file. To the point that I wonder if we shouldn't just change the ERROR into a PANIC on master (but not REL_11_STABLE), so the buildfarm gives us feedback. I don't think the problem can fundamentally be related to subscriptions, given the error occurs before any subscriptions are created in the schedule. Greetings, Andres Freund