On Wed, Mar 17, 2021 at 4:48 PM Martin Perina <[email protected]> wrote: > > > > On Wed, Mar 17, 2021 at 3:24 PM Michal Skrivanek > <[email protected]> wrote: >> >> >> >> On 17. 3. 2021, at 13:53, Dana Elfassy <[email protected]> wrote: >> >> Adding +Marcin Sobczyk >> >> On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <[email protected]> wrote: >>> >>> On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <[email protected]> wrote: >>> > >>> > Hi all, >>> > >>> > This started a few days ago [1] and randomly happens since then: >>> > >>> > E DEBUG: Configuration: >>> > E DEBUG: command: collect >>> > E DEBUG: Traceback (most recent call last): >>> > E DEBUG: File >>> > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", >>> > line 2067, in <module> >>> > E DEBUG: '%s directory is not empty.' % >>> > (conf["local_tmp_dir"]) >>> > E DEBUG: Exception: /dev/shm/log directory is not >>> > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code >>> > >>> > Michal tried to fix this by using a random directory but it still fails >>> > [2]: >>> > >>> > DEBUG: command: collect >>> > DEBUG: Traceback (most recent call last): >>> > DEBUG: File >>> > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", >>> > line 2067, in <module> >>> > DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) >>> > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: >>> > /dev/shm/kaN7uY directory is not empty.non-zero return code >>> > >>> > Since I suppose that the randomness of mktemp is good enough, it must >>> > be something else. Also, the last successful run before [1] used the >>> > same OST git commit (same code), so I do not think it's something in >>> > OST's code. >>> > >>> > Any idea? >>> > >>> > I think I'll push a patch to create and use the directory right before >>> > calling ovirt-log-collector, which is probably better in other ways. >>> >>> My patch [1] still fails, with a somewhat different error message, but >>> this made me check further, and while I still do not understand, I have >>> this to add: >>> >>> In the failing runs, ovirt-log-collector is called *twice* in parallel. E.g. >>> in [2] (the check-patch of [1]): >>> >>> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: >>> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p >>> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors >>> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf >>> _uses_shell=True warn=True stdin_add_newline=True >>> strip_empty_ends=True argv=None chdir=None executable=None >>> creates=None removes=None stdin=None >>> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: >>> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p >>> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors >>> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf >>> _uses_shell=True warn=True stdin_add_newline=True >>> strip_empty_ends=True argv=None chdir=None executable=None >>> creates=None removes=None stdin=None >>> >>> It also generates two logs, which you can check/compare. >>> >>> It's the same for previous ones, e.g. latest nightly [3][4]: >>> >>> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: >>> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose >>> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf >>> _uses_shell=True warn=True stdin_add_newline=True >>> strip_empty_ends=True argv=None chdir=None executable=None >>> creates=None removes=None stdin=None >>> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: >>> SELinux is preventing /usr/lib/systemd/systemd from unlink access on >>> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For >>> complete SELinux messages run: sealert -l >>> d03a8655-9430-4fcf-9892-3b4df1939899 >>> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: >>> SELinux is preventing /usr/lib/systemd/systemd from unlink access on >>> the sock_file >>> ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** >>> Plugin catchall (100. confidence) suggests >>> **************************#012#012If you believe that systemd should >>> be allowed unlink access on the >>> ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by >>> default.#012Then you should report this as a bug.#012You can generate >>> a local policy module to allow this access.#012Do#012allow this access >>> for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow >>> -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 >>> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: >>> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose >>> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf >>> _uses_shell=True warn=True stdin_add_newline=True >>> strip_empty_ends=True argv=None chdir=None executable=None >>> creates=None removes=None stdin=None >>> >>> Any idea what might have caused this to start happening? Perhaps >>> a bug in ansible, or ansible-runner? It reminds me of [5]. >>> Adding Dana and Martin. >>> >>> I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker. >> >> >> it’s from January and tehre are no comments there. Dana, any update? >> It does look serious but perhaps not really hit in real world scenarios? > > > It's reproducible outside OST, Dana is investigating intensively last 2 > weeks, but so far we haven't been able to find where events are multiplied. > But AFAIK this issue is visble only with ansible-runner-service, respectively > only in the part when engine fetches events from ARS and create > audit_log records from them. It doesn't mean, that playbook tasks are > executed multiple times!
Are you sure? Please check also bz 1930565. I closed it as a duplicate, but if you are certain that tasks are never executed multiple times, then we should reopen. I looked at the attached logs, and they were hard to follow (due to the duplication), but at least I didn't notice there any bug around imgbased. > OST is using ansible-runner directly, so it doesn't seem to me like the same > issue Indeed. That's also what I currently think myself. It seems like a bug in OST. Still didn't manage to find the root cause - I am pushing my debugging changes to [1], for anyone that wants to follow, and: 1. run_tc basic-suite-master/test-scenarios/test_003_01_dummy.py > ~/run_tc/run-tc-$(date +%Y-%m-%d-%H%M%S).log 2>&1 2. cat $(ls -t1 run-tc* | head -1) | grep -Ev 'do_something_2.*something 2|do_something_1.*something 1' | grep -v 'Start' 3. Pick a line from the output (there are always a few, among 100 runs), check its timestamp, then look at pytest.log and search for (currently): $timestamp|something..|_run_ansible_runner|_run_ansible_runner.*before run|_run_ansible_runner.*after run|Starting functools.partial|.*test_dummy: Starting iteration.*|ModuleArgsMapper Hope to update tomorrow. For now, we can merge the workaround patch to run them serially, to unbreak basic-suite at this point. [1] https://github.com/didib/ovirt-system-tests/commits/ansible-logs > > So [5] is not a blocker, it's just really annoying if you need to works with > audit_log records > >> >>> >>> Best regards, >>> >>> [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 >>> >>> [2] >>> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/15980/artifact/check-patch.basic_suite_master.el8.x86_64/test_logs/lago-basic-suite-master-engine/var/log/messages/*view* >>> >>> [3] >>> https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/959/ >>> >>> [4] >>> https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/959/artifact/exported-artifacts/test_logs/lago-basic-suite-master-engine/var/log/messages/*view* >>> >>> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707 >>> >>> > >>> > Best regards, >>> > >>> > [1] >>> > https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/949/ >>> > >>> > [2] >>> > https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/959/ >>> > >>> > >>> > -- >>> > Didi >>> >>> >>> >>> -- >>> Didi >>> >> _______________________________________________ >> Infra mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> Privacy Statement: https://www.ovirt.org/privacy-policy.html >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/[email protected]/message/PSO5IRX73FSH2RA5VCDNSPIFZRDDAWCQ/ >> >> >> _______________________________________________ >> Devel mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> Privacy Statement: https://www.ovirt.org/privacy-policy.html >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/[email protected]/message/6DZRAMSAJCXNSMZYCSAROOWDGKVF6GQX/ > > > > -- > Martin Perina > Manager, Software Engineering > Red Hat Czech s.r.o. -- Didi _______________________________________________ Devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/C26THI6NNZS77I24X2HC2RBYYIEYAT5X/
