On Wed, Aug 30, 2017 at 8:43 PM, Dan Kenigsberg <[email protected]> wrote:
> On Wed, Aug 30, 2017 at 6:47 PM, Dominik Holler <[email protected]> > wrote: > > On Wed, 30 Aug 2017 13:31:38 +0200 > > Dominik Holler <[email protected]> wrote: > > > >> On Wed, 30 Aug 2017 14:18:49 +0300 > >> Dan Kenigsberg <[email protected]> wrote: > >> > >> > On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <[email protected]> > >> > wrote: > >> > > Test failed: [ 002_bootstrap.add_hosts ] > >> > > > >> > > Link to suspected patches: > >> > > > >> > > We suspect this is due to change to LLDPAD in upstream CentOS > >> > > repos. We can't tell the exact point it was introduced because > >> > > other ovirt-engine regressions introduced too much noise into the > >> > > system. > >> > > > >> > > It also seems that the failure is not 100% reproducible sine we > >> > > have runs that do not encounter it. > >> > > > >> > > Link to Job: > >> > > http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151 > >> > > > >> > > (This job checked a specific patch to ovirt-log-collector but > >> > > failure seems unrelated, and also happens on vanille 'tested' repo > >> > > right now). > >> > > > >> > > Link to all logs: > >> > > VDSM logs seem to be most relevant here: > >> > > > >> > > http://jenkins.ovirt.org/job/ovirt-master_change-queue- > tester/2151/artifact/exported-artifacts/basic-suit-master- > el7/test_logs/basic-suite-master/post-002_bootstrap.py/ > lago-basic-suite-master-host-0/_var_log/vdsm/ > >> > > > >> > > Error snippet from log: > >> > > > >> > > - From VDSM logs (note it exists very quickly): > >> > > > >> > > <error> > >> > > > >> > > 2017-08-30 05:55:52,208-0400 INFO (MainThread) [vds] (PID: 4594) > >> > > I am the actual vdsm 4.20.2-133.git0ce4485.el7.centos > >> > > lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64) > >> > > (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO (MainThread) [vds] > >> > > VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254) > >> > > 2017-08-30 05:55:52,253-0400 INFO (MainThread) [storage.check] > >> > > Starting check service (check:92) > >> > > 2017-08-30 05:55:52,257-0400 INFO (MainThread) > >> > > [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:48) > >> > > 2017-08-30 05:55:52,257-0400 INFO (check/loop) > >> > > [storage.asyncevent] Starting <EventLoop running=True > >> > > closed=False at 0x45317520> (asyncevent:125) > >> > > 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] START > >> > > registerDomainStateChangeCallback(callbackFunc=<functools.partial > >> > > object at 0x2b47b50>) from=internal, > >> > > task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46) > >> > > 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] FINISH > >> > > registerDomainStateChangeCallback return=None from=internal, > >> > > task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52) > >> > > 2017-08-30 05:55:52,288-0400 INFO (MainThread) [MOM] Preparing > >> > > MOM interface (momIF:53) > >> > > 2017-08-30 05:55:52,289-0400 INFO (MainThread) [MOM] Using named > >> > > unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62) > >> > > 2017-08-30 05:55:52,289-0400 INFO (MainThread) [root] > >> > > Unregistering all secrets (secret:91) > >> > > 2017-08-30 05:55:52,307-0400 INFO (MainThread) [vds] Setting > >> > > channels' timeout to 30 seconds. (vmchannels:221) > >> > > 2017-08-30 05:55:52,309-0400 INFO (vmrecovery) [vds] recovery: > >> > > completed in 0s (clientIF:516) > >> > > 2017-08-30 05:55:52,310-0400 INFO (MainThread) > >> > > [vds.MultiProtocolAcceptor] Listening at :::54321 > >> > > (protocoldetector:196) > >> > > 2017-08-30 05:55:52,496-0400 INFO (http) [vds] Server running > >> > > (http:58) 2017-08-30 05:55:52,742-0400 INFO (periodic/1) > >> > > [vdsm.api] START repoStats(domains=()) from=internal, > >> > > task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46) > >> > > 2017-08-30 05:55:52,743-0400 INFO (periodic/1) [vdsm.api] FINISH > >> > > repoStats return={} from=internal, > >> > > task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52) > >> > > 2017-08-30 05:55:52,743-0400 WARN (periodic/1) [throttled] MOM > >> > > not available. (throttledlog:103) > >> > > 2017-08-30 05:55:52,744-0400 WARN (periodic/1) [throttled] MOM > >> > > not available, KSM stats will be missing. (throttledlog:103) > >> > > 2017-08-30 05:55:55,043-0400 INFO (MainThread) [vds] Received > >> > > signal 15, shutting down (vdsmd:67) > >> > > 2017-08-30 05:55:55,045-0400 INFO (MainThread) > >> > > [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759) > >> > > 2017-08-30 05:55:55,049-0400 INFO (MainThread) [vds] Stopping > >> > > http server (http:79) > >> > > 2017-08-30 05:55:55,049-0400 INFO (http) [vds] Server stopped > >> > > (http:69) 2017-08-30 05:55:55,050-0400 INFO (MainThread) [root] > >> > > Unregistering all secrets (secret:91) > >> > > 2017-08-30 05:55:55,052-0400 INFO (MainThread) [vdsm.api] START > >> > > prepareForShutdown(options=None) from=internal, > >> > > task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46) > >> > > 2017-08-30 05:55:55,089-0400 INFO (MainThread) [storage.Monitor] > >> > > Shutting down domain monitors (monitor:222) > >> > > 2017-08-30 05:55:55,090-0400 INFO (MainThread) [storage.check] > >> > > Stopping check service (check:105) > >> > > 2017-08-30 05:55:55,090-0400 INFO (check/loop) > >> > > [storage.asyncevent] Stopping <EventLoop running=True > >> > > closed=False at 0x45317520> (asyncevent:220) > >> > > 2017-08-30 05:55:55,090-0400 INFO (MainThread) [vdsm.api] FINISH > >> > > prepareForShutdown return=None from=internal, > >> > > task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52) > >> > > 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Stopping > >> > > threads (vdsmd:159) > >> > > 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Exiting > >> > > (vdsmd:170) > >> > > > >> > > </error> > >> > > > >> > > > >> > > - From SuperVDSM logs: > >> > > > >> > > <error> > >> > > > >> > > MainThread::ERROR::2017-08-30 > >> > > 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable > >> > > LLDP on eth1 > >> > > Traceback (most recent call last): > >> > > File > >> > > "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", > >> > > line 51, in _lldp_init Lldp.enable_lldp_on_iface(device) > >> > > File > >> > > "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", > >> > > line 30, in enable_lldp_on_iface > >> > > lldptool.enable_lldp_on_iface(iface, rx_only) File > >> > > "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", > >> > > line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out, > >> > > err, iface) EnableLldpError: (1, > >> > > "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx' > >> > > command timed out.\n", '', 'eth1') > >> > > MainThread::DEBUG::2017-08-30 > >> > > 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool > >> > > get-lldp -i eth0 adminStatus (cwd None) > >> > > > >> > > > >> > > </error> > >> > > >> > The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. > >> > They do not cause supervdsmd to stop, so they are not the reason > >> > for vdsm's failure to start. Something ELSE is doing this. > >> > > >> > in > >> > http://jenkins.ovirt.org/job/ovirt-master_change-queue- > tester/2151/artifact/exported-artifacts/basic-suit-master- > el7/test_logs/basic-suite-master/post-002_bootstrap.py/ > lago-basic-suite-master-host-0/_var_log/messages/*view*/ > >> > we have > >> > > >> > Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm > >> > throttled WARN MOM not available. > >> > Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm > >> > throttled WARN MOM not available, KSM stats will be missing. > >> > Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started > >> > Session 10 of user root. > >> > Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New > >> > session 10 of user root. > >> > Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting > >> > Session 10 of user root. > >> > Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup > >> > Invoked with filter=* gather_subset=['all'] > >> > fact_path=/etc/ansible/facts.d gather_timeout=10 > >> > Aug 30 05:55:54 lago-basic-suite-master-host-0 python: > >> > ansible-command Invoked with warn=True executable=None > >> > _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE > >> > 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None > >> > creates=None chdir=None > >> > Aug 30 05:55:54 lago-basic-suite-master-host-0 python: > >> > ansible-systemd Invoked with no_block=False name=libvirt-guests > >> > enabled=True daemon_reload=False state=started user=False > >> > masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: > >> > Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: > >> > Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: > >> > Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 > >> > systemd: Stopped MOM instance configured for VDSM purposes. > >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping > >> > Virtual Desktop Server Manager... > >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting > >> > Suspend Active Libvirt Guests... > >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started > >> > Suspend Active Libvirt Guests. > >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt > >> > version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem > >> > <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) > >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: > >> > lago-basic-suite-master-host-0.lago.local > >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: > >> > vdsm: Running run_final_hooks > >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file > >> > while reading data: Input/output error > >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped > >> > Virtual Desktop Server Manager. > >> > > >> > suggesting that systemd stopped vdsmd since mom did not start on > >> > time. We have an extremely ugly two-way dependency between the two > >> > services. Each depends on the other one. I am guessing that > >> > systemd's grace period for such a cycle has expired, and the couple > >> > was taken down. > >> > > >> > Does anybody have a guess what could cause such slowlness? > >> > >> Even if the LLDP errors does not have a negative impact of the > >> functional behavior of supervdsmd, they have a impact timing behavior > >> of the initialization of supervdsmd. > >> > >> I will post a patch today which moves the initialization of LLDP in an > >> extra thread, to make the initialization of LLDP asynchronly. > >> > >> > > > > I analyzed this issue further by running OST [1] while disabled [2] the > > suspected initialization of LLDP in supervdsmd. > > Like expected, there are no log entries about enabling LLDP in [3], > > but the OST is still failing with same symptoms. > > > > So the reason for this failure is still unknown. > > > > > > [1] > > http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ > ovirt-system-tests_manual/1045/ > > > > [2] > > https://gerrit.ovirt.org/#/c/81232/1/lib/vdsm/network/initializer.py > > > > [3] > > http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ > ovirt-system-tests_manual/1045/artifact/exported- > artifacts/test_logs/basic-suite-master/post-002_ > bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/ > supervdsm.log/*view*/ > > > > So we're back in square one. > Another possible culprit may be ansible: Vdsm is stopped two seconds > after it logs to the host. > > Aug 30 11:26:24 lago-basic-suite-master-host-0 systemd: Starting > Session 10 of user root. > Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-setup > Invoked with filter=* gather_subset=['all'] > fact_path=/etc/ansible/facts.d gather_timeout=10 > Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-command > Invoked with warn=True executable=None _uses_shell=False > _raw_params=bash -c "rpm -qi vdsm | grep -oE > 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None > creates=None chdir=None > Aug 30 11:26:26 lago-basic-suite-master-host-0 python: ansible-systemd > Invoked with no_block=False name=libvirt-guests enabled=True > daemon_reload=False state=started user=False masked=None > Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Reloading. > Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Cannot add > dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. > Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopped MOM > instance configured for VDSM purposes. > Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopping > Virtual Desktop Server Manager... > > > could it be that it triggers a systemd-reload that makes systemd croak > on the vdsm-mom cycle? > We are not restarting VDSM within ovirt-host-deploy Ansible role, the VDSM restart is performed in host-deploy part same as in previous versions. Within ovirt-host-deploy-firewalld we only enable and restart firewalld service.
_______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
