On Tue, Jul 24, 2018 at 10:53 AM Andrej Krejcir <[email protected]> wrote:
> > > On Mon, 23 Jul 2018 at 15:03, Martin Perina <[email protected]> wrote: > >> >> >> On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <[email protected]> wrote: >> >>> Hi, >>> >>> the issue seems to be that host-1 stopped responding and I can see some >>> fluetd errors which we should look at. >>> >>> Jira opened to track this issue: >>> https://ovirt-jira.atlassian.net/browse/OVIRT-2363 >>> >>> Martin, I also added you to the Jira - can you please have a look? >>> >>> error from node-1 messages log: >>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:14 -0400 [warn]: detached forwarding server >>> 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" >>> port=24224 phi=16.275347714068506 >>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: >>> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", >>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", >>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"] >>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:14 -0400 fluent.warn: >>> {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.275347714068506,"message":"detached >>> forwarding server 'lago-basic-suite-master-engine' >>> host=\"lago-basic-suite-master-engine\" port=24224 phi=16.275347714068506"} >>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:15 -0400 [warn]: detached forwarding server >>> 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" >>> port=24224 phi=16.70444149784817 >>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: >>> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", >>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", >>> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"] >>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:15 -0400 fluent.warn: >>> {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.70444149784817,"message":"detached >>> forwarding server 'lago-basic-suite-master-engine' >>> host=\"lago-basic-suite-master-engine\" port=24224 phi=16.70444149784817"} >>> Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command >>> Invoked with warn=False executable=None _uses_shell=False >>> _raw_params=systemctl is-active 'collectd' removes=None argv=None >>> creates=None chdir=None stdin=None >>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New >>> session 29 of user root. >>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session >>> 29 of user root. >>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session >>> 29 of user root. >>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed >>> session 29. >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 [warn]: failed to flush the buffer. >>> error_class="RuntimeError" error="no nodes are available" >>> plugin_id="object:151a620" >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 [warn]: retry count exceededs limit. >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 [warn]: >>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/out_forward.rb:222:in >>> `write_objects' >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 [warn]: >>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write' >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 [warn]: >>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in >>> `write_chunk' >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 [warn]: >>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop' >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 [warn]: >>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush' >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 [warn]: >>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run' >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 [error]: throwing away old logs. >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 fluent.warn: {"error_class":"RuntimeError","error":"no nodes >>> are available","plugin_id":"object:151a620","message":"failed to flush the >>> buffer. error_class=\"RuntimeError\" error=\"no nodes are available\" >>> plugin_id=\"object:151a620\""} >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs limit."} >>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >>> 05:09:27 -0400 fluent.error: {"message":"throwing away old logs."} >>> >>> >>> >>> Thanks. >>> Dafna >>> >> >> Hi, >> >> I can see in vdsm.log that it received a kill signal: >> >> 2018-07-23 05:24:26,735-0400 INFO (MainThread) [vds] Received signal 15, >> shutting down (vdsmd:68) >> >> And in /var/log/messages I found that mom was killed: >> >> Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM >> instance configured for VDSM purposes... >> >> ... >> >> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service >> stop-sigterm timed out. Killing. >> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service: >> main process exited, code=killed, status=9/KILL >> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM >> instance configured for VDSM purposes. >> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit >> mom-vdsm.service entered failed state. >> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service >> failed. >> >> So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And >> could this be a cause of VDSM shutdown? >> >> Hi, > > Mom is not related to fluentd and mom shutdown should not cause vdsm > shutdown. > > > The service dependency between vdsmd and mom-vdsm is weak (using > Wants=mom-vdsm.service). > > Looking at /var/log/messages both mom-vdsm and vdsmd services were > restarted: > > Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM > instance configured for VDSM purposes... > ... > Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM > instance configured for VDSM purposes. > > Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit > mom-vdsm.service entered failed state. > Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service > failed. > Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopping Virtual > Desktop Server Manager... > ... > Jul 23 05:24:29 lago-basic-suite-master-host-1 systemd: Stopped Virtual > Desktop Server Manager. > ... > Jul 23 05:25:26 lago-basic-suite-master-host-1 systemd: Starting Virtual > Desktop Server Manager... > ... > Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started Virtual > Desktop Server Manager. > Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started MOM > instance configured for VDSM purposes. > Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Starting MOM > instance configured for VDSM purposes... > ... > Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Started MOM > instance configured for VDSM purposes. > Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Starting MOM > instance configured for VDSM purposes... > > > > > The error in 008_basic_ui_sanity.py.junit.xml probably means that the > docker executable was not found on the machine running the test. Can it > be the cause of the failure? > > <error type="exceptions.OSError" > message="[Errno 2] No such file or directory > -------------------- >> begin captured stdout << > --------------------- > executing shell: docker ps > --------------------- >> end captured stdout << --------------- > > File "/usr/lib64/python2.7/unittest/case.py", line 369, in run > testMethod() > File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest > self.test(*self.arg) > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in > wrapped_test test() > File > "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py", > line 169, in start_grid _docker_cleanup() > File > "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py", > line 136, in _docker_cleanup _shell(["docker", "ps"]) > File > "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py", > line 119, in _shell stderr=subprocess.PIPE) > File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread, > errwrite) > File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child > raise child_exception [Errno 2] No such file or directory > > Yep, looks like docker isn't installed. And yes that would fail it. Any recent changes? I know Gal is working on some containerization of this [1], but I don't know what's been merged. [1] Change I5af15dce: Adjust UI test to run inside STDCI container | https://gerrit.ovirt.org/#/c/93074/ > > Andrej > > >>> >>> >>> On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <[email protected]> >>> wrote: >>> >>>> Change 92882,9 (ovirt-engine) is probably the reason behind recent >>>> system test >>>> failures in the "ovirt-master" change queue and needs to be fixed. >>>> >>>> This change had been removed from the testing queue. Artifacts build >>>> from this >>>> change will not be released until it is fixed. >>>> >>>> For further details about the change see: >>>> https://gerrit.ovirt.org/#/c/92882/9 >>>> >>>> For failed test results see: >>>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/ >>>> _______________________________________________ >>>> Infra mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> oVirt Code of Conduct: >>>> https://www.ovirt.org/community/about/community-guidelines/ >>>> List Archives: >>>> https://lists.ovirt.org/archives/list/[email protected]/message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/ >>>> >>> >>> >> >> >> -- >> Martin Perina >> Associate Manager, Software Engineering >> Red Hat Czech s.r.o. >> _______________________________________________ >> Infra mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/[email protected]/message/KXBI2VR5TXH2FRBOS3ASV3YPOTJZ52RB/ >> > _______________________________________________ > Infra mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/[email protected]/message/AD5NAECNGUW4LYJFC5C67TP4SMAY3ZW2/ > -- GREG SHEREMETA SENIOR SOFTWARE ENGINEER - TEAM LEAD - RHV UX Red Hat NA <https://www.redhat.com/> [email protected] IRC: gshereme <https://red.ht/sig>
_______________________________________________ Infra mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/W6BR572DZKYDD6F7E2OBX2725FLLEMXW/
