On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Fri, Feb 2, 2018 at 2:11 PM, amul sul <sula...@gmail.com> wrote: >> On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapil...@gmail.com> >> wrote: >> [....] >>> I think you can manually (via debugger) hit this by using >>> PUBLICATION/SUBSCRIPTION syntax for logical replication. I think what >>> you need to do is in node-1, create a partitioned table and subscribe >>> it on node-2. Now, perform an Update on node-1, then stop the logical >>> replication worker before it calls heap_lock_tuple. Now, in node-2, >>> update the same row such that it moves the row. Now, continue the >>> logical replication worker. I think it should hit your new code, if >>> not then we need to think of some other way. >>> >> >> I am able to hit the change log using above steps. Thanks a lot for the >> step by step guide, I really needed that. >> >> One strange behavior I found in the logical replication which is reproducible >> without attached patch as well -- when I have updated on node2 by keeping >> breakpoint before the heap_lock_tuple call in replication worker, I can see >> a duplicate row was inserted on the node2, see this: >> > .. >> >> I am thinking to report this in a separate thread, but not sure if >> this is already known behaviour or not. >> > > I think it is worth to discuss this behavior in a separate thread. > However, if possible, try to reproduce it without partitioning and > then report it. > Logical replication behavior for the normal table is as expected, this happens only with partition table, will start a new thread for this on hacker.
>> >> Updated patch attached -- correct changes in execReplication.c. >> > > Your changes look correct to me. > > I wonder what will be the behavior of this patch with > wal_consistency_checking [1]. I think it will generate a failure as > there is nothing in WAL to replay it. Can you once try it? If we see > a failure with wal consistency checker, then we need to think whether > (a) we want to deal with it by logging this information, or (b) do we > want to mask it or (c) something else? > > > [1] - > https://www.postgresql.org/docs/devel/static/runtime-config-developer.html > Yes, you are correct standby stopped with a following error: FATAL: inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0 CONTEXT: WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED LOG: startup process (PID 22791) exited with exit code 1 LOG: terminating any other active server processes LOG: database system is shut down I have tested warm standby replication setup using attached script. Without wal_consistency_checking setting, it works fine & data from master to standby is replicated as expected, if this guaranty is enough then I think could skip this error from wal consistent check for such deleted tuple (I guess option b that you have suggested), thoughts?
test.sh
Description: Bourne shell script