We never remove any packages from tested :) I am actually suspecting a problem that is caused by something that is shared by all projects like the api. we currently have 3 different tests that are randomly failing because of engine/host communication and I am starting to think they are all connected.
On Tue, Mar 26, 2019 at 2:03 PM Eyal Edri <[email protected]> wrote: > > > On Tue, Mar 26, 2019 at 3:56 PM Sandro Bonazzola <[email protected]> > wrote: > >> >> >> Il giorno mar 26 mar 2019 alle ore 14:42 Marcin Sobczyk < >> [email protected]> ha scritto: >> >>> Ok, maybe I've found something. I was comparing two sets of log files - >>> one for basic-suite-4.3 run that succeeded and one that did not. >>> >>> The deployment procedure for host-1 is not complete in the unsuccessful >>> case. The last log entry is: >>> >>> 2019-03-22 17:54:49,973-04 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (VdsDeploy) [6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing >>> Host lago-basic-suite-4-3-host-1. Starting vdsm. >>> >>> In the successful case we can see: >>> >>> 2019-03-25 06:16:31,789-04 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing >>> Host lago-basic-suite-4-3-host-1. Starting vdsm. >>> 2019-03-25 06:16:38,257-04 DEBUG >>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >>> **%EventEnd STAGE closeup METHOD >>> otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start >>> (odeploycons.packages.vdsm.started) >>> 2019-03-25 06:16:38,257-04 DEBUG >>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >>> **%EventStart STAGE closeup METHOD >>> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None) >>> 2019-03-25 06:16:38,257-04 DEBUG >>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >>> ***L:INFO Starting ovirt-vmconsole-host-sshd >>> 2019-03-25 06:16:38,257-04 DEBUG >>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] >>> nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd >>> 2019-03-25 06:16:38,261-04 INFO >>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >>> (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing >>> Host lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd. >>> 2019-03-25 06:16:39,479-04 DEBUG >>> [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: >>> **%EventEnd STAGE closeup METHOD >>> otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None) >>> ... >>> >>> >>> It looks like ovirt-host-deploy plugin did not respond? >>> >>> The timing of: >>> >>> https://gerrit.ovirt.org/#/c/98762/ >>> >>> kinda aligns with my patch which was merged for vdsm-4.3 and was a >>> suspect: >>> >>> https://gerrit.ovirt.org/#/c/98748/ >>> >>> Sandro, are you aware of any issues on ovirt-host-deploy plugin that >>> might have caused these? >>> >> >> I don't really see how a patch touching the automation configuratin for >> STDCI v2 can be the cause of this but please go ahead and blacklist last >> build. >> It should differ from previous one only by the timestamp of the rpm :-) >> > > If you want to verify a specific version of an RPM works , you can use the > manual job for it. > > >> >> >>> Dafna, can we temporarily blacklist the latest version of >>> ovirt-host-deploy to see if it fixes CQ? >>> >>> Marcin >>> On 3/25/19 2:45 PM, Dafna Ron wrote: >>> >>> yes. I suspect this is the problem. >>> In the host I can see in the yum log the times the vdsm packages are >>> installed: >>> [dron@dron post-002_bootstrap.py]$ less >>> lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm >>> Mar 25 07:48:49 Installed: vdsm-api-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:48:52 Installed: >>> vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:00 Installed: vdsm-common-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:00 Installed: >>> vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:38 Installed: vdsm-network-4.30.11-23.git276b602.el7.x86_64 >>> Mar 25 07:49:38 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:39 Installed: vdsm-client-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:39 Installed: vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:39 Installed: vdsm-http-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:42 Installed: >>> vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:57 Installed: >>> vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:58 Installed: vdsm-4.30.11-23.git276b602.el7.x86_64 >>> Mar 25 07:49:58 Installed: >>> vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch >>> Mar 25 07:49:58 Installed: >>> vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch >>> >>> and in the engine you can see that the engine already failed >>> installation: >>> >>> 2019-03-25 07:48:33,920-04 INFO >>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand >>> internal: false. Entities affected : ID: >>> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST >>> with role type ADMIN >>> >>> >>> On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron <[email protected]> wrote: >>> >>>> I am also adding didi >>>> should this not be logged in host-deploy? >>>> >>>> >>>> On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk <[email protected]> >>>> wrote: >>>> >>>>> Indeed in my failed job, I can see, that 'vdsm-python' is installed >>>>> *after* engine's attempt to run 'vdsm-tool': >>>>> >>>>> lago-basic-suite-4-3-host-1/_var_log/yum.log >>>>> 403:Mar 22 17:54:12 Installed: >>>>> vdsm-python-4.30.11-23.git276b602.el7.noarch >>>>> >>>>> lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log >>>>> 1580:2019-03-22 17:53:03,793-04 DEBUG >>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool >>>>> vdsm-id' >>>>> 1581:2019-03-22 17:53:04,043-04 WARN >>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>>> [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request >>>>> on host: Command returned failure code 127 during SSH session >>>>> 'root@lago-basic-suite-4-3-host-1' >>>>> >>>>> >>>>> On 3/25/19 2:22 PM, Dafna Ron wrote: >>>>> >>>>> and its only failing on one host as well. >>>>> >>>>> >>>>> On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk <[email protected]> >>>>> wrote: >>>>> >>>>>> Hmmm, that would suggest, that 'vdsm-tool' command is not found on >>>>>> the host. There was no work done around 'vdsm-tool's 'vdsm-id' recently. >>>>>> On 3/25/19 2:08 PM, Dafna Ron wrote: >>>>>> >>>>>> Actually, I can see in engine that it is trying to start vdsm >>>>>> installation on the host: >>>>>> >>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default >>>>>> task-1) [] Entered SsoRestApiAuthFilter >>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default >>>>>> task-1) [] SsoRestApiAuthFilter authenticating with sso >>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default >>>>>> task-1) [] SsoRestApiAuthFilter authenticating using BEARER header >>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default >>>>>> task-1) [] SsoRestApiAuthFilter successfully authenticated using BEARER >>>>>> header >>>>>> 2019-03-25 07:48:33,123-04 DEBUG >>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default >>>>>> task-1) [] Entered SsoRestApiNegotiationFilter >>>>>> 2019-03-25 07:48:33,124-04 DEBUG >>>>>> [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default >>>>>> task-1) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth >>>>>> 2019-03-25 07:48:33,143-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored >>>>>> procedure. Call string is [{call getclusterforuserbyclustername(?, ?, >>>>>> ?)}] >>>>>> 2019-03-25 07:48:33,143-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>>> procedure [GetClusterForUserByClusterName] compiled >>>>>> 2019-03-25 07:48:33,145-04 DEBUG >>>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: >>>>>> runQuery, params: [GetClusterByName, >>>>>> NameQueryParameters:{refresh='false', filtered='false'}], timeElapsed: >>>>>> 7ms >>>>>> 2019-03-25 07:48:33,161-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [] Compiled stored procedure. Call string is [{call >>>>>> getvmstaticbyname(?)}] >>>>>> 2019-03-25 07:48:33,161-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled >>>>>> 2019-03-25 07:48:33,206-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored >>>>>> procedure. Call string is [{call getvdsbyname(?)}] >>>>>> 2019-03-25 07:48:33,206-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>>> procedure [GetVdsByName] compiled >>>>>> 2019-03-25 07:48:33,223-04 DEBUG >>>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: >>>>>> getByName, params: [lago-basic-suite-4-3-host-1], timeElapsed: 20ms >>>>>> 2019-03-25 07:48:33,225-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored >>>>>> procedure. Call string is [{call getvdsbyhostname(?)}] >>>>>> 2019-03-25 07:48:33,225-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>>> procedure [GetVdsByHostName] compiled >>>>>> 2019-03-25 07:48:33,252-04 DEBUG >>>>>> [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: >>>>>> getAllForHostname, params: [lago-basic-suite-4-3-host-1], timeElapsed: >>>>>> 28ms >>>>>> 2019-03-25 07:48:33,254-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored >>>>>> procedure. Call string is [{call getstorage_poolsbyclusterid(?)}] >>>>>> 2019-03-25 07:48:33,254-04 DEBUG >>>>>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >>>>>> (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for >>>>>> procedure [Getstorage_poolsByClusterId] compiled >>>>>> 2019-03-25 07:48:33,262-04 DEBUG >>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting >>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>> 2019-03-25 07:48:33,461-04 DEBUG >>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Connected: >>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>> 2019-03-25 07:48:33,467-04 DEBUG >>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: >>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>> 2019-03-25 07:48:33,599-04 DEBUG >>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: >>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>> 2019-03-25 07:48:33,600-04 DEBUG >>>>>> [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) >>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool >>>>>> vdsm-id' >>>>>> 2019-03-25 07:48:33,902-04 WARN >>>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id >>>>>> request on host: Command returned failure code 127 during SSH session >>>>>> 'root@lago-basic-suite-4-3-host-1' >>>>>> 2019-03-25 07:48:33,920-04 INFO >>>>>> [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) >>>>>> [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand >>>>>> internal: false. Entities affected : ID: >>>>>> 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group >>>>>> CREATE_HOST with role type ADMIN >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <[email protected]> wrote: >>>>>> >>>>>>> we have another failing patch with the same test: >>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/ >>>>>>> >>>>>>> its obviously not related to the patch but there is something that >>>>>>> is causing these failures randomly. >>>>>>> From what I can see in the current failed job, you are correct and >>>>>>> host-1 and engine does not even try to deploy host-1. >>>>>>> >>>>>>> i can see host-1 is getting error 127 in lago for ntpd and I can see >>>>>>> network manager errors in the host messages log >>>>>>> >>>>>>> In the host messages log I can see several messages that I suspect >>>>>>> may cause issues in communication between engine and host: >>>>>>> >>>>>>> Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2: >>>>>>> _sasldb_getkeyhandle has failed >>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error >>>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>>> pair >>>>>>> found >>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error >>>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>>> pair >>>>>>> found >>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error >>>>>>> deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>>> pair >>>>>>> found >>>>>>> >>>>>>> >>>>>>> ces/12) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>> bondscan-UMJa2S is being created... >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>> bondscan-UMJa2S is being deleted... >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: >>>>>>> <info> [1553514613.7774] manager: (bondscan-UMJa2S): new Bond device >>>>>>> (/org/freedesktop/NetworkManager/Devices/13) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >>>>>>> net_device) (unregistering): Released all slaves >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>> bondscan-liwvMR is being created... >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option fail_over_mac: invalid value (3) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option ad_select: invalid value (3) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option lacp_rate: mode dependency failed, not supported in mode >>>>>>> balance-rr(0) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option arp_validate: invalid value (7) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option xmit_hash_policy: invalid value (5) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: >>>>>>> <info> [1553514613.8002] manager: (bondscan-liwvMR): new Bond device >>>>>>> (/org/freedesktop/NetworkManager/Devices/14) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option primary_reselect: invalid value (3) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option arp_all_targets: invalid value (2) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>> bondscan-liwvMR is being deleted... >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >>>>>>> net_device) (unregistering): Released all slaves >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>> bondscan-liwvMR is being created... >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option fail_over_mac: invalid value (3) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option ad_select: invalid value (3) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option lacp_rate: mode dependency failed, not supported in mode >>>>>>> active-backup(1) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option arp_validate: invalid value (7) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: >>>>>>> <info> [1553514613.8429] manager: (bondscan-liwvMR): new Bond device >>>>>>> (/org/freedesktop/NetworkManager/Devices/15) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option xmit_hash_policy: invalid value (5) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option primary_reselect: invalid value (3) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: >>>>>>> option arp_all_targets: invalid value (2) >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>> bondscan-liwvMR is being deleted... >>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered >>>>>>> net_device) (unregistering): Released all slaves >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> For the failed job, the engine didn't even try to deploy on host-1: >>>>>>>> >>>>>>>> >>>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/ >>>>>>>> >>>>>>>> Martin, do you know what could be the reason for that? >>>>>>>> >>>>>>>> I can see in the logs for both successful and unsuccessful >>>>>>>> basic-suite-4.3 runs, that there is no 'ntpdate' on host-1: >>>>>>>> >>>>>>>> 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running >>>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1: ntpdate -4 >>>>>>>> lago-basic-suite-4-3-engine >>>>>>>> 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command >>>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1 returned with 127 >>>>>>>> 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command >>>>>>>> d0c49b54 on lago-basic-suite-4-3-host-1 errors: >>>>>>>> bash: ntpdate: command not found >>>>>>>> >>>>>>>> On host-0 everything is ok: >>>>>>>> >>>>>>>> 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running >>>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0: ntpdate -4 >>>>>>>> lago-basic-suite-4-3-engine >>>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command >>>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 returned with 0 >>>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command >>>>>>>> d11b2a64 on lago-basic-suite-4-3-host-0 output: >>>>>>>> 25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 >>>>>>>> offset 0.017150 sec >>>>>>>> >>>>>>>> On 3/25/19 10:13 AM, Eyal Edri wrote: >>>>>>>> >>>>>>>> Still fails, now on a different component. ( >>>>>>>> ovirt-web-ui-extentions ) >>>>>>>> >>>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/ >>>>>>>> >>>>>>>> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Dafna, >>>>>>>>>> >>>>>>>>>> in 'verify_add_hosts' we specifically wait for single host to be >>>>>>>>>> up with a timeout: >>>>>>>>>> >>>>>>>>>> 144 up_hosts = hosts_service.list(search='datacenter={} AND >>>>>>>>>> status=up'.format(DC_NAME)) >>>>>>>>>> 145 if len(up_hosts): >>>>>>>>>> 146 return True >>>>>>>>>> >>>>>>>>>> The log files say, that it took ~50 secs for one of the hosts to >>>>>>>>>> be up (seems reasonable) and no timeout is being reported. >>>>>>>>>> Just after running 'verify_add_hosts', we run >>>>>>>>>> 'add_master_storage_domain', which calls '_hosts_in_dc' function. >>>>>>>>>> That function does the exact same check, but it fails: >>>>>>>>>> >>>>>>>>>> 113 hosts = hosts_service.list(search='datacenter={} AND >>>>>>>>>> status=up'.format(dc_name)) >>>>>>>>>> 114 if hosts: >>>>>>>>>> 115 if random_host: >>>>>>>>>> 116 return random.choice(hosts) >>>>>>>>>> >>>>>>>>>> I don't think it is relevant to our current failure; but I >>>>>>>>> consider random_host=True as a bad practice. As if we do not have >>>>>>>>> enough >>>>>>>>> moving parts, we are adding intentional randomness. Reproducibility >>>>>>>>> is far >>>>>>>>> more important than coverage - particularly for a shared system test >>>>>>>>> like >>>>>>>>> OST. >>>>>>>>> >>>>>>>>>> 117 else: >>>>>>>>>> 118 return sorted(hosts, key=lambda host: host.name) >>>>>>>>>> 119 raise RuntimeError('Could not find hosts that are up in DC >>>>>>>>>> %s' % dc_name) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I'm also not able to reproduce this issue locally on my server. >>>>>>>>>> The investigation continues... >>>>>>>>>> >>>>>>>>> >>>>>>>>> I think that it would be fair to take the filtering by host state >>>>>>>>> out of Engine and into the test, where we can easily log the current >>>>>>>>> status >>>>>>>>> of each host. Then we'd have better understanding on the next failure. >>>>>>>>> >>>>>>>>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> sure, I'm on it - it's weird though, I did ran 4.3 basic suite >>>>>>>>>> for this patch manually and everything was ok. >>>>>>>>>> On 3/22/19 1:05 PM, Dafna Ron wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> We are failing branch 4.3 for test: >>>>>>>>>> 002_bootstrap.add_master_storage_domain >>>>>>>>>> >>>>>>>>>> It seems that in one of the hosts, the vdsm is not starting >>>>>>>>>> there is nothing in vdsm.log or in supervdsm.log >>>>>>>>>> >>>>>>>>>> CQ identified this patch as the suspected root cause: >>>>>>>>>> >>>>>>>>>> https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support >>>>>>>>>> for flow id >>>>>>>>>> >>>>>>>>>> Milan, Marcin, can you please have a look? >>>>>>>>>> >>>>>>>>>> full logs: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/ >>>>>>>>>> >>>>>>>>>> the only error I can see is about host not being up (makes sense >>>>>>>>>> as vdsm is not running) >>>>>>>>>> >>>>>>>>>> Stacktrace >>>>>>>>>> >>>>>>>>>> File "/usr/lib64/python2.7/unittest/case.py", line 369, in run >>>>>>>>>> testMethod() >>>>>>>>>> File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in >>>>>>>>>> runTest >>>>>>>>>> self.test(*self.arg) >>>>>>>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line >>>>>>>>>> 142, in wrapped_test >>>>>>>>>> test() >>>>>>>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line >>>>>>>>>> 60, in wrapper >>>>>>>>>> return func(get_test_prefix(), *args, **kwargs) >>>>>>>>>> File >>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>>> line 417, in add_master_storage_domain >>>>>>>>>> add_iscsi_storage_domain(prefix) >>>>>>>>>> File >>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>>> line 561, in add_iscsi_storage_domain >>>>>>>>>> host=_random_host_from_dc(api, DC_NAME), >>>>>>>>>> File >>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>>> line 122, in _random_host_from_dc >>>>>>>>>> return _hosts_in_dc(api, dc_name, True) >>>>>>>>>> File >>>>>>>>>> "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", >>>>>>>>>> line 119, in _hosts_in_dc >>>>>>>>>> raise RuntimeError('Could not find hosts that are up in DC %s' % >>>>>>>>>> dc_name) >>>>>>>>>> 'Could not find hosts that are up in DC >>>>>>>>>> test-dc\n-------------------- >> begin captured logging << >>>>>>>>>> --------------------\nlago.ssh: DEBUG: start >>>>>>>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end >>>>>>>>>> task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for >>>>>>>>>> lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on >>>>>>>>>> lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: >>>>>>>>>> DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine returned with >>>>>>>>>> 0\nlago.ssh: DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine >>>>>>>>>> output:\n >>>>>>>>>> 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n----------- >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Devel mailing list -- [email protected] >>>>>>>>>> To unsubscribe send an email to [email protected] >>>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>>>> oVirt Code of Conduct: >>>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>>>> List Archives: >>>>>>>>>> https://lists.ovirt.org/archives/list/[email protected]/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/ >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Devel mailing list -- [email protected] >>>>>>>>> To unsubscribe send an email to [email protected] >>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>>> oVirt Code of Conduct: >>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>>> List Archives: >>>>>>>>> https://lists.ovirt.org/archives/list/[email protected]/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/ >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Eyal edri >>>>>>>> >>>>>>>> >>>>>>>> MANAGER >>>>>>>> >>>>>>>> RHV/CNV DevOps >>>>>>>> >>>>>>>> EMEA VIRTUALIZATION R&D >>>>>>>> >>>>>>>> >>>>>>>> Red Hat EMEA <https://www.redhat.com/> >>>>>>>> <https://red.ht/sig> TRIED. TESTED. TRUSTED. >>>>>>>> <https://redhat.com/trusted> >>>>>>>> phone: +972-9-7692018 >>>>>>>> irc: eedri (on #tlv #rhev-dev #rhev-integ) >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Devel mailing list -- [email protected] >>>>>>>> To unsubscribe send an email to [email protected] >>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>>>> oVirt Code of Conduct: >>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>> List Archives: >>>>>>>> https://lists.ovirt.org/archives/list/[email protected]/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/ >>>>>>>> >>>>>>> >> >> -- >> >> SANDRO BONAZZOLA >> >> MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV >> >> Red Hat EMEA <https://www.redhat.com/> >> >> [email protected] >> <https://red.ht/sig> >> _______________________________________________ >> Devel mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/[email protected]/message/SR7YJIIZHM36KRKI2XJYBEEI6CU4PCKF/ >> > > > -- > > Eyal edri > > > MANAGER > > RHV/CNV DevOps > > EMEA VIRTUALIZATION R&D > > > Red Hat EMEA <https://www.redhat.com/> > <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> > phone: +972-9-7692018 > irc: eedri (on #tlv #rhev-dev #rhev-integ) > _______________________________________________ > Devel mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/[email protected]/message/ABB6UAM37JOSSDIJLHQC3RUER7ADT3GZ/ >
_______________________________________________ Devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/5CAW7KZ6EFVFLHUOBLFFI7JWPCUM2O2Y/
