On Mon, Jan 18, 2021 at 11:19 AM Marcin Sobczyk <[email protected]> wrote: > > > > On 1/18/21 9:58 AM, Yedidyah Bar David wrote: > > On Mon, Jan 18, 2021 at 10:53 AM Martin Perina <[email protected]> wrote: > >> > >> > >> On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David <[email protected]> wrote: > >>> On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David <[email protected]> > >>> wrote: > >>>> On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David <[email protected]> > >>>> wrote: > >>>>> On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David <[email protected]> > >>>>> wrote: > >>>>>> On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David <[email protected]> > >>>>>> wrote: > >>>>>>> On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <[email protected]> > >>>>>>> wrote: > >>>>>>>> On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <[email protected]> > >>>>>>>> wrote: > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> my guess is it's selinux-related. > >>>>>>>>> > >>>>>>>>> Unfortunately I can't find any meaningful errors in audit.log in a > >>>>>>>>> scenario where host deployment fails. > >>>>>>>>> However switching selinux to permissive mode before adding hosts > >>>>>>>>> makes > >>>>>>>>> the problem go away, so it's probably not an error somewhere in > >>>>>>>>> logic. > >>>>>>>> It's getting weirder: Under strace, it succeeds: > >>>>>>>> > >>>>>>>> https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948 > >>>>>>>> > >>>>>>>> (Can't see the actual log, as I didn't add '-A', so it was > >>>>>>>> overwritten > >>>>>>>> on restart...) > >>>>>>> After updating it to use '-A' it indeed shows that it worked: > >>>>>>> > >>>>>>> 43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK > >>>>>>> <unfinished ...> > >>>>>>> 43664 14:16:55.997695 <... access resumed>) = 0 > >>>>>>> > >>>>>>> Weird. > >>>>>>> > >>>>>>> Now ran in parallel 'ci test' for this patch and another one from > >>>>>>> master, for comparison: > >>>>>> Again, the same: > >>>>>> > >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/ > >>>>>> With strace, passed, > >>>>>> > >>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/ > >>>>>> Without strace, failed. > >>>>>> > >>>>>> Last nightly run that passed [1] used: > >>>>>> > >>>>>> ost-images-el8-host-installed-1-202101100446.x86_64 > >>>>>> ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64 > >>>>>> > >>>>>> Trying now with these - not sure it possible to put specific versions > >>>>>> inside > >>>>>> automation/*packages, let's see: > >>>>>> > >>>>>> https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977 > >>>>> Indeed, with a fixed ost-images and removing updates, it passes. > >>>>> network suite > >>>>> failed, but he-basic passed: > >>>>> > >>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/artifact/ci_build_summary.html > >>>>> > >>>>> So I am quite certain this is an OS issue. Not sure how we do not see > >>>>> this in basic-suite. > >>>>> Perhaps it's related to nested-kvm, or to load/slowness caused by that? > >>>>> Weird. > >>>>> > >>>>> when this fails, we do not collect all engine's /var/log, only > >>>>> messages and ovirt-engine/ . > >>>>> So it's not easy to get a list of the packages that were updated. > >>>>> > >>>>> Pushed now: > >>>>> > >>>>> https://github.com/oVirt/ovirt-ansible-collection/pull/202 > >>>>> > >>>>> to get all of engine's /var/log, and ran manual HE job with it: > >>>>> > >>>>> https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/7680/ > >>>> This one I accidentally ran with the wrong repo, then ran another one > >>>> with the correct repo [1], > >>>> But: > >>>> > >>>> 1. The repo wasn't used. Emailed about this a separate thread: "manual > >>>> job does not use custom repo" > >>>> > >>>> 2. It passed! Being what seems like a heisenbug, I understand why when > >>>> you run it under strace it > >>>> works differently. But even if you just intend to collect more logs it > >>>> also causes it to behave > >>>> differently? :-) This does not mean that "problem solved" - latest > >>>> nightly run [2] did fail with > >>>> the same error. > >>> Status: > >>> > >>> 1. he-basic-suite is still failing. > >>> > >>> 2. Patch to collect all of /var/log from the engine merged. > >>> > >>> Dana, can you please update? Did you have any progress? > >>> > >>> IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue > >>> :-). > >>> So, how do we continue? > >> > >> Switching to CentOS Stream development/testing is a big effort, I'm not > >> sure we can do this and still deliver all the RFEs/bugs planned for 4.4.5 > >> ... > +1 > > IMO we should now revert appliance and node to CentOS 8.3, and then > > continue the discussion. > > Having he-basic-suite broken for a week is too much. > +1 The testing infrastructure for Stream is here, but if it doesn't work > yet than let's stick to the plan and focus on 8.3.
Just to conclude the original issue - a workaround found, root cause still under investigation. Commented on the bugs (oVirt and Stream) with details. Best regards, -- Didi _______________________________________________ Devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/7HU6SONUCBHPFZR5DB74TDD6OBINZNHE/
