On Wed, Apr 25, 2018 at 5:57 PM, Martin Perina <[email protected]> wrote: > > > On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg <[email protected]> wrote: >> >> On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori <[email protected]> >> wrote: >> > >> > >> > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <[email protected]> >> > wrote: >> >> >> >> Ravi's patch is in, but a similar problem remains, and the test cannot >> >> be put back into its place. >> >> >> >> It seems that while Vdsm was taken down, a couple of getCapsAsync >> >> requests queued up. At one point, the host resumed its connection, >> >> before the requests have been cleared of the queue. After the host is >> >> up, the following tests resume, and at a pseudorandom point in time, >> >> an old getCapsAsync request times out and kills our connection. >> >> >> >> I believe that as long as ANY request is on flight, the monitoring >> >> lock should not be released, and the host should not be declared as >> >> up. >> >> >> >> >> > >> > >> > Hi Dan, >> > >> > Can I have the link to the job on jenkins so I can look at the logs >> >> We disabled a network test that started failing after getCapsAsync was >> merged. >> Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/ >> >> Its most recent failure >> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ >> has been discussed by Alona and Piotr over IRC. > > > So https://bugzilla.redhat.com/1571768 was created to cover this issue > discovered during Alona's and Piotr's conversation. But after further > discussion we have found out that this issue is not related to non-blocking > thread changes in engine 4.2 and this behavior exists from beginning of > vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along > with other locking changes he already posted to see if they will help > network OST to succeed. > > But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that > on master and let's see if it doesn't introduce any regressions. If not, > then we can backport to 4.2.4.
I sense as if there is a regression in connection management, that coincided with the introduction of async monitoring. I am not alone: Gal Ben Haim was reluctant to take our test back. Do you think that it is now safe to take it in https://gerrit.ovirt.org/#/c/90264/ ? I'd appreciate your support there. I don't want any test to be skipped without a very good reason. _______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
