On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <d...@redhat.com> wrote:

> Hi,
>
> the issue seems to be that host-1 stopped responding and I can see some
> fluetd errors which we should look at.
>
> Jira opened to track this issue: https://ovirt-jira.atlassian.
> net/browse/OVIRT-2363
>
> Martin, I also added you to the Jira - can you please have a look?
>
> error from node-1 messages log:
> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:14 -0400 [warn]: detached forwarding server 
> 'lago-basic-suite-master-engine'
> host="lago-basic-suite-master-engine" port=24224 phi=16.275347714068506
> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd:
> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:14 -0400 fluent.warn: {"host":"lago-basic-suite-
> master-engine","port":24224,"phi":16.275347714068506,"message":"detached
> forwarding server 'lago-basic-suite-master-engine'
> host=\"lago-basic-suite-master-engine\" port=24224
> phi=16.275347714068506"}
> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:15 -0400 [warn]: detached forwarding server 
> 'lago-basic-suite-master-engine'
> host="lago-basic-suite-master-engine" port=24224 phi=16.70444149784817
> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd:
> ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
> "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:15 -0400 fluent.warn: {"host":"lago-basic-suite-
> master-engine","port":24224,"phi":16.70444149784817,"message":"detached
> forwarding server 'lago-basic-suite-master-engine'
> host=\"lago-basic-suite-master-engine\" port=24224 phi=16.70444149784817"}
> Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command
> Invoked with warn=False executable=None _uses_shell=False
> _raw_params=systemctl is-active 'collectd' removes=None argv=None
> creates=None chdir=None stdin=None
> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New session
> 29 of user root.
> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session 29
> of user root.
> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session
> 29 of user root.
> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed
> session 29.
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: failed to flush the buffer.
> error_class="RuntimeError" error="no nodes are available"
> plugin_id="object:151a620"
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: retry count exceededs limit.
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
> 0.12.42/lib/fluent/plugin/out_forward.rb:222:in `write_objects'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
> 0.12.42/lib/fluent/output.rb:490:in `write'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
> 0.12.42/lib/fluent/buffer.rb:354:in `write_chunk'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
> 0.12.42/lib/fluent/buffer.rb:333:in `pop'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
> 0.12.42/lib/fluent/output.rb:342:in `try_flush'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
> 0.12.42/lib/fluent/output.rb:149:in `run'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [error]: throwing away old logs.
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 fluent.warn: {"error_class":"RuntimeError","error":"no
> nodes are available","plugin_id":"object:151a620","message":"failed to
> flush the buffer. error_class=\"RuntimeError\" error=\"no nodes are
> available\" plugin_id=\"object:151a620\""}
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs limit."}
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 fluent.error: {"message":"throwing away old logs."}
>
>
>
> Thanks.
> Dafna
>

​Hi,

I can see in vdsm.log that it received a kill signal:

2018-07-23 05:24:26,735-0400 INFO  (MainThread) [vds] Received signal 15,
shutting down (vdsmd:68)

​And in /var/log/messages I found that mom was killed:

Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
instance configured for VDSM purposes...

...

Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
stop-sigterm timed out. Killing.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service:
main process exited, code=killed, status=9/KILL
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
instance configured for VDSM purposes.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
mom-vdsm.service entered failed state.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
failed.

So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And
could this be a cause of VDSM shutdown?


>
>
> On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenk...@ovirt.org> wrote:
>
>> Change 92882,9 (ovirt-engine) is probably the reason behind recent system
>> test
>> failures in the "ovirt-master" change queue and needs to be fixed.
>>
>> This change had been removed from the testing queue. Artifacts build from
>> this
>> change will not be released until it is fixed.
>>
>> For further details about the change see:
>> https://gerrit.ovirt.org/#/c/92882/9
>>
>> For failed test results see:
>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/
>> _______________________________________________
>> Infra mailing list -- in...@ovirt.org
>> To unsubscribe send an email to infra-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: https://www.ovirt.org/communit
>> y/about/community-guidelines/
>> List Archives: https://lists.ovirt.org/archiv
>> es/list/in...@ovirt.org/message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/
>>
>
>


-- 
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/KXBI2VR5TXH2FRBOS3ASV3YPOTJZ52RB/

Reply via email to