On Tue, Mar 16, 2021 at 9:07 PM Yedidyah Bar David <[email protected]> wrote:

> On Tue, Mar 16, 2021 at 1:11 PM Yedidyah Bar David <[email protected]>
> wrote:
> >
> > On Tue, Mar 16, 2021 at 12:50 PM Dana Elfassy <[email protected]>
> wrote:
> > >
> > > About [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707, I have
> found that when executing the http request of the playbook, sometimes in
> the response there are some events that are missing. The way we handle the
> events is via while loop, where in each loop we continue from the
> lastEventId. Since not all events were listed, the lastEventId (which is
> returned as total number of events) is smaller than the actual eventId that
> was handled. Thus, in the next iteration the last events are handled again.
> It seems that this comes from Ansible Runner, and I'm investigating that
> area now
> >
> > Thanks
> >
> > > I'm not sure if it's the same issue as you get with the logs, though,
> as I'm not familiar with this area in the OSTs. Is the creation of the 2
> logs done by different threads?
> >
> > I have no idea either.
> >
> > >
> > > Regarding the initial error (' directory is not empty.non-zero return
> code') I'm not sure, will have to loop more into it. Martin, do you have
> any idea?
> >
> > I think it's quite obvious that the reason for this is that we run two
> > ovirt-log-collector processes in parallel. Why we (sometimes) do this
> > is an open question, but unrelated to ovirt-log-collector (or the
> > specific error message it emits).
>
> I now did this:
>
> 1. Wrote a trivial dummy module [1] to test this
>
> 2. Ran it on an already finished basic-suite env with lagofy/run_tc [2].
>
> As you can see, if you pipe [2] through grep -Ev
> 'do_something_2.*something 2|do_something_1.*something 1', in some
> cases do_something_1 actually ran the ansible code of do_something_2.
> I currently think it's some kind of race condition in ansible-runner,
> causing it to misbehave when ran in parallel, as we do in
> test_metrics_and_log_collector. I still didn't try to check its code
> and/or add debugging stuff to it to try and find the exact location.
>
> 2021-03-16 20:54:18+0200,681 INFO    [test_003_01_dummy]
> do_something_1: Tue Mar 16 19:54:18 CET 2021
> lago-basic-suite-master-engine something 2 (test_003_01_dummy:39)
> 2021-03-16 20:55:41+0200,155 INFO    [test_003_01_dummy]
> do_something_1: Tue Mar 16 19:55:40 CET 2021
> lago-basic-suite-master-engine something 2 (test_003_01_dummy:39)
> 2021-03-16 20:56:56+0200,196 INFO    [test_003_01_dummy]
> do_something_1: Tue Mar 16 19:56:55 CET 2021
> lago-basic-suite-master-engine something 2 (test_003_01_dummy:39)
> 2021-03-16 20:58:19+0200,665 INFO    [test_003_01_dummy]
> do_something_1: Tue Mar 16 19:58:19 CET 2021
> lago-basic-suite-master-engine something 2 (test_003_01_dummy:39)
>

For now, pushed:

https://gerrit.ovirt.org/c/ovirt-system-tests/+/113906
test_metrics_and_log_collector: Run serially

Best regards,


>
> [1]
> https://github.com/didib/ovirt-system-tests/commit/6a5327976eaf5bc27e352723bf5c17dab4e60fa4
>
> [2] https://gist.github.com/didib/fe1b3bc28b8364cf6976d3d93e6cd9b6
>
> >
> > >
> > > On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <[email protected]>
> wrote:
> > >>
> > >> On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <[email protected]>
> wrote:
> > >> >
> > >> > Hi all,
> > >> >
> > >> > This started a few days ago [1] and randomly happens since then:
> > >> >
> > >> > E           DEBUG: Configuration:
> > >> > E           DEBUG: command: collect
> > >> > E           DEBUG: Traceback (most recent call last):
> > >> > E           DEBUG:   File
> > >> > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py",
> > >> > line 2067, in <module>
> > >> > E           DEBUG:     '%s directory is not empty.' %
> (conf["local_tmp_dir"])
> > >> > E           DEBUG: Exception: /dev/shm/log directory is not
> > >> > empty.ERROR: /dev/shm/log directory is not empty.non-zero return
> code
> > >> >
> > >> > Michal tried to fix this by using a random directory but it still
> fails [2]:
> > >> >
> > >> > DEBUG: command: collect
> > >> > DEBUG: Traceback (most recent call last):
> > >> > DEBUG:   File
> "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py",
> > >> > line 2067, in <module>
> > >> > DEBUG:     '%s directory is not empty.' % (conf["local_tmp_dir"])
> > >> > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR:
> > >> > /dev/shm/kaN7uY directory is not empty.non-zero return code
> > >> >
> > >> > Since I suppose that the randomness of mktemp is good enough, it
> must
> > >> > be something else. Also, the last successful run before [1] used the
> > >> > same OST git commit (same code), so I do not think it's something in
> > >> > OST's code.
> > >> >
> > >> > Any idea?
> > >> >
> > >> > I think I'll push a patch to create and use the directory right
> before
> > >> > calling ovirt-log-collector, which is probably better in other ways.
> > >>
> > >> My patch [1] still fails, with a somewhat different error message, but
> > >> this made me check further, and while I still do not understand, I
> have
> > >> this to add:
> > >>
> > >> In the failing runs, ovirt-log-collector is called *twice* in
> parallel. E.g.
> > >> in [2] (the check-patch of [1]):
> > >>
> > >> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]:
> > >> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p
> > >> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors
> > >> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf
> > >> _uses_shell=True warn=True stdin_add_newline=True
> > >> strip_empty_ends=True argv=None chdir=None executable=None
> > >> creates=None removes=None stdin=None
> > >> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]:
> > >> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p
> > >> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors
> > >> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf
> > >> _uses_shell=True warn=True stdin_add_newline=True
> > >> strip_empty_ends=True argv=None chdir=None executable=None
> > >> creates=None removes=None stdin=None
> > >>
> > >> It also generates two logs, which you can check/compare.
> > >>
> > >> It's the same for previous ones, e.g. latest nightly [3][4]:
> > >>
> > >> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]:
> > >> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose
> > >> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf
> > >> _uses_shell=True warn=True stdin_add_newline=True
> > >> strip_empty_ends=True argv=None chdir=None executable=None
> > >> creates=None removes=None stdin=None
> > >> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]:
> > >> SELinux is preventing /usr/lib/systemd/systemd from unlink access on
> > >> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For
> > >> complete SELinux messages run: sealert -l
> > >> d03a8655-9430-4fcf-9892-3b4df1939899
> > >> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]:
> > >> SELinux is preventing /usr/lib/systemd/systemd from unlink access on
> > >> the sock_file
> ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012*****
> > >>  Plugin catchall (100. confidence) suggests
> > >> **************************#012#012If you believe that systemd should
> > >> be allowed unlink access on the
> > >> ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by
> > >> default.#012Then you should report this as a bug.#012You can generate
> > >> a local policy module to allow this access.#012Do#012allow this access
> > >> for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow
> > >> -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012
> > >> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]:
> > >> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose
> > >> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf
> > >> _uses_shell=True warn=True stdin_add_newline=True
> > >> strip_empty_ends=True argv=None chdir=None executable=None
> > >> creates=None removes=None stdin=None
> > >>
> > >> Any idea what might have caused this to start happening? Perhaps
> > >> a bug in ansible, or ansible-runner? It reminds me of [5].
> > >> Adding Dana and Martin.
> > >>
> > >> I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker.
> > >>
> > >> Best regards,
> > >>
> > >> [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875
> > >>
> > >> [2]
> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/15980/artifact/check-patch.basic_suite_master.el8.x86_64/test_logs/lago-basic-suite-master-engine/var/log/messages/*view*
> > >>
> > >> [3]
> https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/959/
> > >>
> > >> [4]
> https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/959/artifact/exported-artifacts/test_logs/lago-basic-suite-master-engine/var/log/messages/*view*
> > >>
> > >> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707
> > >>
> > >> >
> > >> > Best regards,
> > >> >
> > >> > [1]
> https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/949/
> > >> >
> > >> > [2]
> https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/959/
> > >> >
> > >> >
> > >> > --
> > >> > Didi
> > >>
> > >>
> > >>
> > >> --
> > >> Didi
> > >>
> >
> >
> > --
> > Didi
>
>
>
> --
> Didi
>


-- 
Didi
_______________________________________________
Infra mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/3MOHOQJ6U5XR4QLXNCQTP6HBZ6XFNE4P/

Reply via email to