On Thu, Jan 5, 2023 at 5:03 PM houzj.f...@fujitsu.com <houzj.f...@fujitsu.com> wrote: > > On Thursday, January 5, 2023 4:22 PM Dilip Kumar <dilipbal...@gmail.com> > wrote: > >
> Thanks for reporting the problem. > > After analyzing the behavior, I think it's a bug on publisher side which > is not directly related to parallel apply. > > I think the root reason is that we didn't try to send a stream end(stream > abort) message to subscriber for the crashed transaction which was streamed > before. > The behavior is that, after restarting, the publisher will start to decode the > transaction that aborted due to crash, and when try to stream the first change > of that transaction, it will send a stream start message but then it realizes > that the transaction was aborted, so it will enter the PG_CATCH block of > ReorderBufferProcessTXN() and call ReorderBufferResetTXN() which send the > stream stop message. And in this case, there would be a parallel apply worker > started on subscriber waiting for stream end message which will never come. I suspected it but didn't analyze this. > I think the same behavior happens for the non-parallel mode which will cause > a stream file left on subscriber and will not be cleaned until the apply > worker is > restarted. > To fix it, I think we need to send a stream abort message when we are cleaning > up crashed transaction on publisher(e.g., in ReorderBufferAbortOld()). And > here > is a tiny patch which change the same. I have confirmed that the bug is fixed > and all regression tests pass. > > What do you think ? > I will start a new thread and try to write a testcase if possible > after reaching a consensus. I think your analysis looks correct and we can raise this in a new thread. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com