On Mon, Mar 2, 2020 at 12:52 AM Nir Soffer <[email protected]> wrote: > > On Sun, Mar 1, 2020 at 10:10 AM Yedidyah Bar David <[email protected]> wrote: > > > > Hi all, > > > > On Sun, Mar 1, 2020 at 6:06 AM <[email protected]> wrote: > > > > > > Project: > > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/ > > > Build: > > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/ > > > > I think the root cause is: > > > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/artifact/exported-artifacts/test_logs/he-basic-suite-4.3/post-008_restart_he_vm.py/lago-he-basic-suite-4-3-host-0/_var_log/ovirt-hosted-engine-ha/broker.log > > > > StatusStorageThread::ERROR::2020-02-29 > > 23:03:04,671::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) > > Failed to update state. > > Traceback (most recent call last): > > File > > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", > > line 82, in run > > if (self._status_broker._inquire_whiteboard_lock() or > > File > > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", > > line 195, in _inquire_whiteboard_lock > > self.host_id, self._lease_file) > > SanlockException: (104, 'Sanlock lockspace inquire failure', > > 'Connection reset by peer') > > Can you point us to the source using the sanlock API?
I think it's right where the above error message says it is: https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/broker/status_broker.py#L193 > > The messages looks like client error accessing sanlock server socket > (maybe someone restarted sanlock at that point?) Maybe, I failed to find evidence :-) > but it may also be some error code reused for sanlock internal error for > accessing the storage. > > Usually you can find more info about the error in /var/sanlock.log Couldn't find anything: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/artifact/exported-artifacts/test_logs/he-basic-suite-4.3/post-008_restart_he_vm.py/lago-he-basic-suite-4-3-host-0/_var_log/sanlock.log > > > This caused the broker to restart itself, > > Restarting because sanlock failed does sound like useful error handling > for broker clients. > > > and while it was doing that, > > OST did 'hosted-engine --vm-status --json', which failed, thus failing > > the build. > > If the broker may restart itself on errors, clients need to use a retry > mechanism to deal with the restarts, so the test should probably have > a retry mechanism before it fails. I am not it's the test that should have that retry mechanism, or the command 'hosted-engine --vm-status'. Opinions? If latter, we probably need to add user controls for this (time between retries, max number). > > > This seems to me like another case of a communication problem in CI. > > Not sure what else could have caused it to fail to inquire the status > > of the lock. This (communication) issue was mentioned several times in > > the past already. Are we doing anything re this? I still didn't see any concrete reply to this point, but perhaps the reply should be: If our CI is not completely perfect, and sometimes has communication issues, that's simply normal life - also the networks of our users are like that. We should simply expect that, and do what's needed (above)... Thanks, > > > > Thanks and best regards, > > > > > Build Number: 366 > > > Build Status: Failure > > > Triggered By: Started by timer > > > > > > ------------------------------------- > > > Changes Since Last Success: > > > ------------------------------------- > > > Changes for Build #366 > > > [Marcin Sobczyk] el8: Don't try to collect whole '/etc/httpd' dir > > > > > > > > > > > > > > > ----------------- > > > Failed Tests: > > > ----------------- > > > 1 tests failed. > > > FAILED: 008_restart_he_vm.clear_global_maintenance > > > > > > Error Message: > > > 1 != 0 > > > -------------------- >> begin captured logging << -------------------- > > > root: INFO: Waiting For System Stability... > > > lago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh > > > client for lago-he-basic-suite-4-3-host-0: > > > lago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh > > > client for lago-he-basic-suite-4-3-host-0: > > > lago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: > > > hosted-engine --set-maintenance --mode=none > > > lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 > > > returned with 1 > > > lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 > > > errors: > > > Cannot connect to the HA daemon, please check the logs. > > > > > > ovirtlago.testlib: ERROR: * Unhandled exception in <function <lambda> > > > at 0x7f52673872a8> > > > Traceback (most recent call last): > > > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, > > > in assert_equals_within > > > res = func() > > > File > > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", > > > line 87, in <lambda> > > > lambda: _set_and_test_maintenance_mode(host, False) > > > File > > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", > > > line 108, in _set_and_test_maintenance_mode > > > nt.assert_equals(ret.code, 0) > > > File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual > > > assertion_func(first, second, msg=msg) > > > File "/usr/lib64/python2.7/unittest/case.py", line 546, in > > > _baseAssertEqual > > > raise self.failureException(msg) > > > AssertionError: 1 != 0 > > > --------------------- >> end captured logging << --------------------- > > > > > > Stack Trace: > > > File "/usr/lib64/python2.7/unittest/case.py", line 369, in run > > > testMethod() > > > File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in > > > runTest > > > self.test(*self.arg) > > > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, > > > in wrapped_test > > > test() > > > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, > > > in wrapper > > > return func(get_test_prefix(), *args, **kwargs) > > > File > > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", > > > line 87, in clear_global_maintenance > > > lambda: _set_and_test_maintenance_mode(host, False) > > > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 282, > > > in assert_true_within_short > > > assert_equals_within_short(func, True, allowed_exceptions) > > > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 266, > > > in assert_equals_within_short > > > func, value, SHORT_TIMEOUT, allowed_exceptions=allowed_exceptions > > > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, > > > in assert_equals_within > > > res = func() > > > File > > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", > > > line 87, in <lambda> > > > lambda: _set_and_test_maintenance_mode(host, False) > > > File > > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", > > > line 108, in _set_and_test_maintenance_mode > > > nt.assert_equals(ret.code, 0) > > > File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual > > > assertion_func(first, second, msg=msg) > > > File "/usr/lib64/python2.7/unittest/case.py", line 546, in > > > _baseAssertEqual > > > raise self.failureException(msg) > > > '1 != 0\n-------------------- >> begin captured logging << > > > --------------------\nroot: INFO: Waiting For System > > > Stability...\nlago.ssh: DEBUG: start > > > task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for > > > lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: end > > > task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for > > > lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: Running 9a90ca60 on > > > lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance > > > --mode=none\nlago.ssh: DEBUG: Command 9a90ca60 on > > > lago-he-basic-suite-4-3-host-0 returned with 1\nlago.ssh: DEBUG: Command > > > 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors:\n Cannot connect to > > > the HA daemon, please check the logs.\n\novirtlago.testlib: ERROR: * > > > Unhandled exception in <function <lambda> at 0x7f52673872a8>\nTraceback > > > (most recent call last):\n File > > > "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in > > > assert_equals_within\n res = func()\n File > > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", > > > line 87, in <lambda>\n lambda: _set_and_test_maintenance_mode(host, > > > False)\n File > > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", > > > line 108, in _set_and_test_maintenance_mode\n > > > nt.assert_equals(ret.code, 0)\n File > > > "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual\n > > > assertion_func(first, second, msg=msg)\n File > > > "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual\n > > > raise self.failureException(msg)\nAssertionError: 1 != > > > 0\n--------------------- >> end captured logging << ---------------------' > > > > > > > > -- > > Didi > > _______________________________________________ > > Infra mailing list -- [email protected] > > To unsubscribe send an email to [email protected] > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > oVirt Code of Conduct: > > https://www.ovirt.org/community/about/community-guidelines/ > > List Archives: > > https://lists.ovirt.org/archives/list/[email protected]/message/QGRYTQWRPEF5Y2UUQI7UML5JE66GNXVA/ > -- Didi _______________________________________________ Infra mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/HJMZGCXXMYJOB55GQUZVD7XPYCIL4LLU/
