On Mon, 23 Jul 2018 at 15:03, Martin Perina <[email protected]> wrote:
> > > On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <[email protected]> wrote: > >> Hi, >> >> the issue seems to be that host-1 stopped responding and I can see some >> fluetd errors which we should look at. >> >> Jira opened to track this issue: >> https://ovirt-jira.atlassian.net/browse/OVIRT-2363 >> >> Martin, I also added you to the Jira - can you please have a look? >> >> error from node-1 messages log: >> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:14 -0400 [warn]: detached forwarding server >> 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" >> port=24224 phi=16.275347714068506 >> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: >> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", >> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", >> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"] >> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:14 -0400 fluent.warn: >> {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.275347714068506,"message":"detached >> forwarding server 'lago-basic-suite-master-engine' >> host=\"lago-basic-suite-master-engine\" port=24224 phi=16.275347714068506"} >> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:15 -0400 [warn]: detached forwarding server >> 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" >> port=24224 phi=16.70444149784817 >> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: >> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", >> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", >> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"] >> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:15 -0400 fluent.warn: >> {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.70444149784817,"message":"detached >> forwarding server 'lago-basic-suite-master-engine' >> host=\"lago-basic-suite-master-engine\" port=24224 phi=16.70444149784817"} >> Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command >> Invoked with warn=False executable=None _uses_shell=False >> _raw_params=systemctl is-active 'collectd' removes=None argv=None >> creates=None chdir=None stdin=None >> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New >> session 29 of user root. >> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session >> 29 of user root. >> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session >> 29 of user root. >> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed >> session 29. >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 [warn]: failed to flush the buffer. >> error_class="RuntimeError" error="no nodes are available" >> plugin_id="object:151a620" >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 [warn]: retry count exceededs limit. >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 [warn]: >> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/out_forward.rb:222:in >> `write_objects' >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 [warn]: >> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write' >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 [warn]: >> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in >> `write_chunk' >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 [warn]: >> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop' >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 [warn]: >> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush' >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 [warn]: >> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run' >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 [error]: throwing away old logs. >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 fluent.warn: {"error_class":"RuntimeError","error":"no nodes >> are available","plugin_id":"object:151a620","message":"failed to flush the >> buffer. error_class=\"RuntimeError\" error=\"no nodes are available\" >> plugin_id=\"object:151a620\""} >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs limit."} >> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 >> 05:09:27 -0400 fluent.error: {"message":"throwing away old logs."} >> >> >> >> Thanks. >> Dafna >> > > Hi, > > I can see in vdsm.log that it received a kill signal: > > 2018-07-23 05:24:26,735-0400 INFO (MainThread) [vds] Received signal 15, > shutting down (vdsmd:68) > > And in /var/log/messages I found that mom was killed: > > Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM > instance configured for VDSM purposes... > > ... > > Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service > stop-sigterm timed out. Killing. > Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service: > main process exited, code=killed, status=9/KILL > Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM > instance configured for VDSM purposes. > Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit > mom-vdsm.service entered failed state. > Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service > failed. > > So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And > could this be a cause of VDSM shutdown? > > Hi, Mom is not related to fluentd and mom shutdown should not cause vdsm shutdown. The service dependency between vdsmd and mom-vdsm is weak (using Wants=mom-vdsm.service). Looking at /var/log/messages both mom-vdsm and vdsmd services were restarted: Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM instance configured for VDSM purposes... ... Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM instance configured for VDSM purposes. Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit mom-vdsm.service entered failed state. Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service failed. Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopping Virtual Desktop Server Manager... ... Jul 23 05:24:29 lago-basic-suite-master-host-1 systemd: Stopped Virtual Desktop Server Manager. ... Jul 23 05:25:26 lago-basic-suite-master-host-1 systemd: Starting Virtual Desktop Server Manager... ... Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started Virtual Desktop Server Manager. Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started MOM instance configured for VDSM purposes. Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Starting MOM instance configured for VDSM purposes... ... Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Started MOM instance configured for VDSM purposes. Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Starting MOM instance configured for VDSM purposes... The error in 008_basic_ui_sanity.py.junit.xml probably means that the docker executable was not found on the machine running the test. Can it be the cause of the failure? <error type="exceptions.OSError" message="[Errno 2] No such file or directory -------------------- >> begin captured stdout << --------------------- executing shell: docker ps --------------------- >> end captured stdout << --------------- File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in wrapped_test test() File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py", line 169, in start_grid _docker_cleanup() File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py", line 136, in _docker_cleanup _shell(["docker", "ps"]) File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py", line 119, in _shell stderr=subprocess.PIPE) File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread, errwrite) File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child raise child_exception [Errno 2] No such file or directory Andrej >> >> >> On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <[email protected]> >> wrote: >> >>> Change 92882,9 (ovirt-engine) is probably the reason behind recent >>> system test >>> failures in the "ovirt-master" change queue and needs to be fixed. >>> >>> This change had been removed from the testing queue. Artifacts build >>> from this >>> change will not be released until it is fixed. >>> >>> For further details about the change see: >>> https://gerrit.ovirt.org/#/c/92882/9 >>> >>> For failed test results see: >>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/ >>> _______________________________________________ >>> Infra mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: >>> https://www.ovirt.org/community/about/community-guidelines/ >>> List Archives: >>> https://lists.ovirt.org/archives/list/[email protected]/message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/ >>> >> >> > > > -- > Martin Perina > Associate Manager, Software Engineering > Red Hat Czech s.r.o. > _______________________________________________ > Infra mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/[email protected]/message/KXBI2VR5TXH2FRBOS3ASV3YPOTJZ52RB/ >
_______________________________________________ Infra mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/AD5NAECNGUW4LYJFC5C67TP4SMAY3ZW2/
