On Tue, Mar 26, 2019 at 3:56 PM Sandro Bonazzola <sbona...@redhat.com> wrote:
> > > Il giorno mar 26 mar 2019 alle ore 14:42 Marcin Sobczyk < > msobc...@redhat.com> ha scritto: > >> Ok, maybe I've found something. I was comparing two sets of log files - >> one for basic-suite-4.3 run that succeeded and one that did not. >> >> The deployment procedure for host-1 is not complete in the unsuccessful >> case. The last log entry is: >> >> 2019-03-22 17:54:49,973-04 INFO >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >> (VdsDeploy) [6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing >> Host lago-basic-suite-4-3-host-1. Starting vdsm. >> >> In the successful case we can see: >> >> 2019-03-25 06:16:31,789-04 INFO >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing >> Host lago-basic-suite-4-3-host-1. Starting vdsm. >> 2019-03-25 06:16:38,257-04 DEBUG >> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >> **%EventEnd STAGE closeup METHOD >> otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start >> (odeploycons.packages.vdsm.started) >> 2019-03-25 06:16:38,257-04 DEBUG >> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >> **%EventStart STAGE closeup METHOD >> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None) >> 2019-03-25 06:16:38,257-04 DEBUG >> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >> ***L:INFO Starting ovirt-vmconsole-host-sshd >> 2019-03-25 06:16:38,257-04 DEBUG >> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] >> nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd >> 2019-03-25 06:16:38,261-04 INFO >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing >> Host lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd. >> 2019-03-25 06:16:39,479-04 DEBUG >> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >> **%EventEnd STAGE closeup METHOD >> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None) >> ... >> >> >> It looks like ovirt-host-deploy plugin did not respond? >> >> The timing of: >> >> https://gerrit.ovirt.org/#/c/98762/ >> >> kinda aligns with my patch which was merged for vdsm-4.3 and was a >> suspect: >> >> https://gerrit.ovirt.org/#/c/98748/ >> >> Sandro, are you aware of any issues on ovirt-host-deploy plugin that >> might have caused these? >> > > I don't really see how a patch touching the automation configuratin for > STDCI v2 can be the cause of this but please go ahead and blacklist last > build. > It should differ from previous one only by the timestamp of the rpm :-) > If you want to verify a specific version of an RPM works , you can use the manual job for it. > > >> Dafna, can we temporarily blacklist the latest version of >> ovirt-host-deploy to see if it fixes CQ? >> >> Marcin >> On 3/25/19 2:45 PM, Dafna Ron wrote: >> >> yes. I suspect this is the problem. >> In the host I can see in the yum log the times the vdsm packages are >> installed: >> [dron@dron post-002_bootstrap.py]$ less >> lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm >> Mar 25 07:48:49 Installed: vdsm-api-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:48:52 Installed: vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:00 Installed: vdsm-common-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:00 Installed: >> vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:38 Installed: vdsm-network-4.30.11-23.git276b602.el7.x86_64 >> Mar 25 07:49:38 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:39 Installed: vdsm-client-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:39 Installed: vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:39 Installed: vdsm-http-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:42 Installed: >> vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:57 Installed: >> vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:58 Installed: vdsm-4.30.11-23.git276b602.el7.x86_64 >> Mar 25 07:49:58 Installed: >> vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch >> Mar 25 07:49:58 Installed: vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch >> >> and in the engine you can see that the engine already failed >> installation: >> >> 2019-03-25 07:48:33,920-04 INFO >> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand >> internal: false. Entities affected : ID: >> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST >> with role type ADMIN >> >> >> On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron <d...@redhat.com> wrote: >> >>> I am also adding didi >>> should this not be logged in host-deploy? >>> >>> >>> On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk <msobc...@redhat.com> >>> wrote: >>> >>>> Indeed in my failed job, I can see, that 'vdsm-python' is installed >>>> *after* engine's attempt to run 'vdsm-tool': >>>> >>>> lago-basic-suite-4-3-host-1/_var_log/yum.log >>>> 403:Mar 22 17:54:12 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch >>>> >>>> lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log >>>> 1580:2019-03-22 17:53:03,793-04 DEBUG >>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool >>>> vdsm-id' >>>> 1581:2019-03-22 17:53:04,043-04 WARN >>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request >>>> on host: Command returned failure code 127 during SSH session >>>> 'root@lago-basic-suite-4-3-host-1' >>>> >>>> >>>> On 3/25/19 2:22 PM, Dafna Ron wrote: >>>> >>>> and its only failing on one host as well. >>>> >>>> >>>> On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk <msobc...@redhat.com> >>>> wrote: >>>> >>>>> Hmmm, that would suggest, that 'vdsm-tool' command is not found on the >>>>> host. There was no work done around 'vdsm-tool's 'vdsm-id' recently. >>>>> On 3/25/19 2:08 PM, Dafna Ron wrote: >>>>> >>>>> Actually, I can see in engine that it is trying to start vdsm >>>>> installation on the host: >>>>> >>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) >>>>> [] Entered SsoRestApiAuthFilter >>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) >>>>> [] SsoRestApiAuthFilter authenticating with sso >>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) >>>>> [] SsoRestApiAuthFilter authenticating using BEARER header >>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) >>>>> [] SsoRestApiAuthFilter successfully authenticated using BEARER header >>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default >>>>> task-1) [] Entered SsoRestApiNegotiationFilter >>>>> 2019-03-25 07:48:33,124-04 DEBUG >>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default >>>>> task-1) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth >>>>> 2019-03-25 07:48:33,143-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored >>>>> procedure. Call string is [{call getclusterforuserbyclustername(?, ?, ?)}] >>>>> 2019-03-25 07:48:33,143-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>> procedure [GetClusterForUserByClusterName] compiled >>>>> 2019-03-25 07:48:33,145-04 DEBUG >>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, >>>>> params: [GetClusterByName, NameQueryParameters:{refresh='false', >>>>> filtered='false'}], timeElapsed: 7ms >>>>> 2019-03-25 07:48:33,161-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [] Compiled stored procedure. Call string is [{call >>>>> getvmstaticbyname(?)}] >>>>> 2019-03-25 07:48:33,161-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled >>>>> 2019-03-25 07:48:33,206-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored >>>>> procedure. Call string is [{call getvdsbyname(?)}] >>>>> 2019-03-25 07:48:33,206-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>> procedure [GetVdsByName] compiled >>>>> 2019-03-25 07:48:33,223-04 DEBUG >>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: >>>>> getByName, params: [lago-basic-suite-4-3-host-1], timeElapsed: 20ms >>>>> 2019-03-25 07:48:33,225-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored >>>>> procedure. Call string is [{call getvdsbyhostname(?)}] >>>>> 2019-03-25 07:48:33,225-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>> procedure [GetVdsByHostName] compiled >>>>> 2019-03-25 07:48:33,252-04 DEBUG >>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: >>>>> getAllForHostname, params: [lago-basic-suite-4-3-host-1], timeElapsed: >>>>> 28ms >>>>> 2019-03-25 07:48:33,254-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored >>>>> procedure. Call string is [{call getstorage_poolsbyclusterid(?)}] >>>>> 2019-03-25 07:48:33,254-04 DEBUG >>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>> procedure [Getstorage_poolsByClusterId] compiled >>>>> 2019-03-25 07:48:33,262-04 DEBUG >>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting >>>>> 'root@lago-basic-suite-4-3-host-1' >>>>> 2019-03-25 07:48:33,461-04 DEBUG >>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connected: >>>>> 'root@lago-basic-suite-4-3-host-1' >>>>> 2019-03-25 07:48:33,467-04 DEBUG >>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: >>>>> 'root@lago-basic-suite-4-3-host-1' >>>>> 2019-03-25 07:48:33,599-04 DEBUG >>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: >>>>> 'root@lago-basic-suite-4-3-host-1' >>>>> 2019-03-25 07:48:33,600-04 DEBUG >>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool >>>>> vdsm-id' >>>>> 2019-03-25 07:48:33,902-04 WARN >>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request >>>>> on host: Command returned failure code 127 during SSH session >>>>> 'root@lago-basic-suite-4-3-host-1' >>>>> 2019-03-25 07:48:33,920-04 INFO >>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand >>>>> internal: false. Entities affected : ID: >>>>> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group >>>>> CREATE_HOST with role type ADMIN >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <d...@redhat.com> wrote: >>>>> >>>>>> we have another failing patch with the same test: >>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/ >>>>>> >>>>>> its obviously not related to the patch but there is something that is >>>>>> causing these failures randomly. >>>>>> From what I can see in the current failed job, you are correct and >>>>>> host-1 and engine does not even try to deploy host-1. >>>>>> >>>>>> i can see host-1 is getting error 127 in lago for ntpd and I can see >>>>>> network manager errors in the host messages log >>>>>> >>>>>> In the host messages log I can see several messages that I suspect >>>>>> may cause issues in communication between engine and host: >>>>>> >>>>>> Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2: >>>>>> _sasldb_getkeyhandle has failed >>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error >>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>> pair >>>>>> found >>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error >>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>> pair >>>>>> found >>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error >>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>> pair >>>>>> found >>>>>> >>>>>> >>>>>> ces/12) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>> bondscan-UMJa2S is being created... >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>> bondscan-UMJa2S is being deleted... >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: >>>>>> <info> [1553514613.7774] manager: (bondscan-UMJa2S): new Bond device >>>>>> (/org/freedesktop/NetworkManager/Devices/13) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >>>>>> net_device) (unregistering): Released all slaves >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>> bondscan-liwvMR is being created... >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option fail_over_mac: invalid value (3) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option ad_select: invalid value (3) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option lacp_rate: mode dependency failed, not supported in mode >>>>>> balance-rr(0) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option arp_validate: invalid value (7) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option xmit_hash_policy: invalid value (5) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: >>>>>> <info> [1553514613.8002] manager: (bondscan-liwvMR): new Bond device >>>>>> (/org/freedesktop/NetworkManager/Devices/14) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option primary_reselect: invalid value (3) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option arp_all_targets: invalid value (2) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>> bondscan-liwvMR is being deleted... >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >>>>>> net_device) (unregistering): Released all slaves >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>> bondscan-liwvMR is being created... >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option fail_over_mac: invalid value (3) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option ad_select: invalid value (3) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option lacp_rate: mode dependency failed, not supported in mode >>>>>> active-backup(1) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option arp_validate: invalid value (7) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: >>>>>> <info> [1553514613.8429] manager: (bondscan-liwvMR): new Bond device >>>>>> (/org/freedesktop/NetworkManager/Devices/15) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option xmit_hash_policy: invalid value (5) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option primary_reselect: invalid value (3) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>> option arp_all_targets: invalid value (2) >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>> bondscan-liwvMR is being deleted... >>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >>>>>> net_device) (unregistering): Released all slaves >>>>>> >>>>>> >>>>>> On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk <msobc...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> For the failed job, the engine didn't even try to deploy on host-1: >>>>>>> >>>>>>> >>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/ >>>>>>> >>>>>>> Martin, do you know what could be the reason for that? >>>>>>> >>>>>>> I can see in the logs for both successful and unsuccessful >>>>>>> basic-suite-4.3 runs, that there is no 'ntpdate' on host-1: >>>>>>> >>>>>>> 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running >>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1: ntpdate -4 >>>>>>> lago-basic-suite-4-3-engine >>>>>>> 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command >>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1 returned with 127 >>>>>>> 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command >>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1 errors: >>>>>>> bash: ntpdate: command not found >>>>>>> >>>>>>> On host-0 everything is ok: >>>>>>> >>>>>>> 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running >>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0: ntpdate -4 >>>>>>> lago-basic-suite-4-3-engine >>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command >>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 returned with 0 >>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command >>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 output: >>>>>>> 25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 offset >>>>>>> 0.017150 sec >>>>>>> >>>>>>> On 3/25/19 10:13 AM, Eyal Edri wrote: >>>>>>> >>>>>>> Still fails, now on a different component. ( ovirt-web-ui-extentions >>>>>>> ) >>>>>>> >>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/ >>>>>>> >>>>>>> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <dan...@redhat.com> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk <msobc...@redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Dafna, >>>>>>>>> >>>>>>>>> in 'verify_add_hosts' we specifically wait for single host to be >>>>>>>>> up with a timeout: >>>>>>>>> >>>>>>>>> 144 up_hosts = hosts_service.list(search='datacenter={} AND >>>>>>>>> status=up'.format(DC_NAME)) >>>>>>>>> 145 if len(up_hosts): >>>>>>>>> 146 return True >>>>>>>>> >>>>>>>>> The log files say, that it took ~50 secs for one of the hosts to >>>>>>>>> be up (seems reasonable) and no timeout is being reported. >>>>>>>>> Just after running 'verify_add_hosts', we run >>>>>>>>> 'add_master_storage_domain', which calls '_hosts_in_dc' function. >>>>>>>>> That function does the exact same check, but it fails: >>>>>>>>> >>>>>>>>> 113 hosts = hosts_service.list(search='datacenter={} AND >>>>>>>>> status=up'.format(dc_name)) >>>>>>>>> 114 if hosts: >>>>>>>>> 115 if random_host: >>>>>>>>> 116 return random.choice(hosts) >>>>>>>>> >>>>>>>>> I don't think it is relevant to our current failure; but I >>>>>>>> consider random_host=True as a bad practice. As if we do not have >>>>>>>> enough >>>>>>>> moving parts, we are adding intentional randomness. Reproducibility is >>>>>>>> far >>>>>>>> more important than coverage - particularly for a shared system test >>>>>>>> like >>>>>>>> OST. >>>>>>>> >>>>>>>>> 117 else: >>>>>>>>> 118 return sorted(hosts, key=lambda host: host.name) >>>>>>>>> 119 raise RuntimeError('Could not find hosts that are up in DC >>>>>>>>> %s' % dc_name) >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm also not able to reproduce this issue locally on my server. >>>>>>>>> The investigation continues... >>>>>>>>> >>>>>>>> >>>>>>>> I think that it would be fair to take the filtering by host state >>>>>>>> out of Engine and into the test, where we can easily log the current >>>>>>>> status >>>>>>>> of each host. Then we'd have better understanding on the next failure. >>>>>>>> >>>>>>>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> sure, I'm on it - it's weird though, I did ran 4.3 basic suite for >>>>>>>>> this patch manually and everything was ok. >>>>>>>>> On 3/22/19 1:05 PM, Dafna Ron wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> We are failing branch 4.3 for test: >>>>>>>>> 002_bootstrap.add_master_storage_domain >>>>>>>>> >>>>>>>>> It seems that in one of the hosts, the vdsm is not starting >>>>>>>>> there is nothing in vdsm.log or in supervdsm.log >>>>>>>>> >>>>>>>>> CQ identified this patch as the suspected root cause: >>>>>>>>> >>>>>>>>> https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support >>>>>>>>> for flow id >>>>>>>>> >>>>>>>>> Milan, Marcin, can you please have a look? >>>>>>>>> >>>>>>>>> full logs: >>>>>>>>> >>>>>>>>> >>>>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/ >>>>>>>>> >>>>>>>>> the only error I can see is about host not being up (makes sense >>>>>>>>> as vdsm is not running) >>>>>>>>> >>>>>>>>> Stacktrace >>>>>>>>> >>>>>>>>> File "/usr/lib64/python2.7/unittest/case.py", line 369, in run >>>>>>>>> testMethod() >>>>>>>>> File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in >>>>>>>>> runTest >>>>>>>>> self.test(*self.arg) >>>>>>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line >>>>>>>>> 142, in wrapped_test >>>>>>>>> test() >>>>>>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line >>>>>>>>> 60, in wrapper >>>>>>>>> return func(get_test_prefix(), *args, **kwargs) >>>>>>>>> File >>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>> line 417, in add_master_storage_domain >>>>>>>>> add_iscsi_storage_domain(prefix) >>>>>>>>> File >>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>> line 561, in add_iscsi_storage_domain >>>>>>>>> host=_random_host_from_dc(api, DC_NAME), >>>>>>>>> File >>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>> line 122, in _random_host_from_dc >>>>>>>>> return _hosts_in_dc(api, dc_name, True) >>>>>>>>> File >>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>> line 119, in _hosts_in_dc >>>>>>>>> raise RuntimeError('Could not find hosts that are up in DC %s' % >>>>>>>>> dc_name) >>>>>>>>> 'Could not find hosts that are up in DC test-dc\n-------------------- >>>>>>>>> >> begin captured logging << --------------------\nlago.ssh: DEBUG: >>>>>>>>> start task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end >>>>>>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on >>>>>>>>> lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: >>>>>>>>> DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine returned with >>>>>>>>> 0\nlago.ssh: DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine >>>>>>>>> output:\n >>>>>>>>> 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n----------- >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Devel mailing list -- devel@ovirt.org >>>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org >>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>>> oVirt Code of Conduct: >>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>>> List Archives: >>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/ >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Devel mailing list -- devel@ovirt.org >>>>>>>> To unsubscribe send an email to devel-le...@ovirt.org >>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>> oVirt Code of Conduct: >>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>> List Archives: >>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/ >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Eyal edri >>>>>>> >>>>>>> >>>>>>> MANAGER >>>>>>> >>>>>>> RHV/CNV DevOps >>>>>>> >>>>>>> EMEA VIRTUALIZATION R&D >>>>>>> >>>>>>> >>>>>>> Red Hat EMEA <https://www.redhat.com/> >>>>>>> <https://red.ht/sig> TRIED. TESTED. TRUSTED. >>>>>>> <https://redhat.com/trusted> >>>>>>> phone: +972-9-7692018 >>>>>>> irc: eedri (on #tlv #rhev-dev #rhev-integ) >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Devel mailing list -- devel@ovirt.org >>>>>>> To unsubscribe send an email to devel-le...@ovirt.org >>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>> oVirt Code of Conduct: >>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>> List Archives: >>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/ >>>>>>> >>>>>> > > -- > > SANDRO BONAZZOLA > > MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV > > Red Hat EMEA <https://www.redhat.com/> > > sbona...@redhat.com > <https://red.ht/sig> > _______________________________________________ > Devel mailing list -- devel@ovirt.org > To unsubscribe send an email to devel-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SR7YJIIZHM36KRKI2XJYBEEI6CU4PCKF/ > -- Eyal edri MANAGER RHV/CNV DevOps EMEA VIRTUALIZATION R&D Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ABB6UAM37JOSSDIJLHQC3RUER7ADT3GZ/