On Wed, 27 Nov 2019 22:11:34 +0100 Lukas Straub <lukasstra...@web.de> wrote:
> On Fri, 22 Nov 2019 09:46:46 +0000 > "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote: > > > * Lukas Straub (lukasstra...@web.de) wrote: > > > Hello Everyone, > > > These patches introduce a resource agent for use with the Pacemaker CRM > > > and a > > > high-level test utilizing it for testing qemu COLO. > > > > > > The resource agent manages qemu COLO including continuous replication. > > > > > > Currently the second test case (where the peer qemu is frozen) fails on > > > primary > > > failover, because qemu hangs while removing the replication related block > > > nodes. > > > Note that this also happens in real world test when cutting power to the > > > peer > > > host, so this needs to be fixed. > > > > Do you understand why that happens? Is this it's trying to finish a > > read/write to the dead partner? > > > > Dave > > I haven't looked into it too closely yet, but it's often hanging in > bdrv_flush() > while removing the replication blockdev and of course thats probably because > the > nbd client waits for a reply. So I tried with the workaround below, which will > actively kill the TCP connection and with it the test passes, though I haven't > tested it in real world yet. > In the real cluster, sometimes qemu even hangs while connecting to qmp (after remote poweroff). But I currently don't have the time to look into it. Still a failing test is better than no test. Could we mark this test as known-bad and fix this issue later? How should I mark it as known-bad? By tag? Or warn in the log? Regards, Lukas Straub