Re: [PATCH 0/4] colo: Introduce resource agent and high-level test

Lukas Straub Wed, 18 Dec 2019 01:28:36 -0800

On Wed, 27 Nov 2019 22:11:34 +0100
Lukas Straub <lukasstra...@web.de> wrote:


> On Fri, 22 Nov 2019 09:46:46 +0000
> "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote:
>
> > * Lukas Straub (lukasstra...@web.de) wrote:
> > > Hello Everyone,
> > > These patches introduce a resource agent for use with the Pacemaker CRM 
> > > and a
> > > high-level test utilizing it for testing qemu COLO.
> > >
> > > The resource agent manages qemu COLO including continuous replication.
> > >
> > > Currently the second test case (where the peer qemu is frozen) fails on 
> > > primary
> > > failover, because qemu hangs while removing the replication related block 
> > > nodes.
> > > Note that this also happens in real world test when cutting power to the 
> > > peer
> > > host, so this needs to be fixed.
> >
> > Do you understand why that happens? Is this it's trying to finish a
> > read/write to the dead partner?
> >
> > Dave
>
> I haven't looked into it too closely yet, but it's often hanging in 
> bdrv_flush()
> while removing the replication blockdev and of course thats probably because 
> the
> nbd client waits for a reply. So I tried with the workaround below, which will
> actively kill the TCP connection and with it the test passes, though I haven't
> tested it in real world yet.
>

In the real cluster, sometimes qemu even hangs while connecting to qmp (after 
remote
poweroff). But I currently don't have the time to look into it.

Still a failing test is better than no test. Could we mark this test as 
known-bad and
fix this issue later? How should I mark it as known-bad? By tag? Or warn in the 
log?

Regards,
Lukas Straub

Re: [PATCH 0/4] colo: Introduce resource agent and high-level test

Reply via email to