On Mon, Mar 2, 2020 at 12:52 AM Nir Soffer <[email protected]> wrote:
>
> On Sun, Mar 1, 2020 at 10:10 AM Yedidyah Bar David <[email protected]> wrote:
> >
> > Hi all,
> >
> > On Sun, Mar 1, 2020 at 6:06 AM <[email protected]> wrote:
> > >
> > > Project: 
> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/
> > > Build: 
> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/
> >
> > I think the root cause is:
> >
> > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/artifact/exported-artifacts/test_logs/he-basic-suite-4.3/post-008_restart_he_vm.py/lago-he-basic-suite-4-3-host-0/_var_log/ovirt-hosted-engine-ha/broker.log
> >
> > StatusStorageThread::ERROR::2020-02-29
> > 23:03:04,671::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run)
> > Failed to update state.
> > Traceback (most recent call last):
> >   File 
> > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py",
> > line 82, in run
> >     if (self._status_broker._inquire_whiteboard_lock() or
> >   File 
> > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py",
> > line 195, in _inquire_whiteboard_lock
> >     self.host_id, self._lease_file)
> > SanlockException: (104, 'Sanlock lockspace inquire failure',
> > 'Connection reset by peer')
>
> Can you point us to the source using the sanlock API?

I think it's right where the above error message says it is:

https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/broker/status_broker.py#L193

>
> The messages looks like client error accessing sanlock server socket
> (maybe someone restarted sanlock at that point?)

Maybe, I failed to find evidence :-)

> but it may also be some error code reused for sanlock internal error for
> accessing the storage.
>
> Usually you can find more info about the error in /var/sanlock.log

Couldn't find anything:

https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/artifact/exported-artifacts/test_logs/he-basic-suite-4.3/post-008_restart_he_vm.py/lago-he-basic-suite-4-3-host-0/_var_log/sanlock.log

>
> > This caused the broker to restart itself,
>
> Restarting because sanlock failed does sound like useful error handling
> for broker clients.
>
> > and while it was doing that,
> > OST did 'hosted-engine --vm-status --json', which failed, thus failing
> > the build.
>
> If the broker may restart itself on errors, clients need to use a retry
> mechanism to deal with the restarts, so the test should probably have
> a retry mechanism before it fails.

I am not it's the test that should have that retry mechanism, or the
command 'hosted-engine --vm-status'. Opinions? If latter, we probably
need to add user controls for this (time between retries, max number).

>
> > This seems to me like another case of a communication problem in CI.
> > Not sure what else could have caused it to fail to inquire the status
> > of the lock. This (communication) issue was mentioned several times in
> > the past already. Are we doing anything re this?

I still didn't see any concrete reply to this point, but perhaps the reply
should be: If our CI is not completely perfect, and sometimes has communication
issues, that's simply normal life - also the networks of our users are like
that. We should simply expect that, and do what's needed (above)...

Thanks,

> >
> > Thanks and best regards,
> >
> > > Build Number: 366
> > > Build Status:  Failure
> > > Triggered By: Started by timer
> > >
> > > -------------------------------------
> > > Changes Since Last Success:
> > > -------------------------------------
> > > Changes for Build #366
> > > [Marcin Sobczyk] el8: Don't try to collect whole '/etc/httpd' dir
> > >
> > >
> > >
> > >
> > > -----------------
> > > Failed Tests:
> > > -----------------
> > > 1 tests failed.
> > > FAILED:  008_restart_he_vm.clear_global_maintenance
> > >
> > > Error Message:
> > > 1 != 0
> > > -------------------- >> begin captured logging << --------------------
> > > root: INFO: Waiting For System Stability...
> > > lago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh 
> > > client for lago-he-basic-suite-4-3-host-0:
> > > lago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh 
> > > client for lago-he-basic-suite-4-3-host-0:
> > > lago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: 
> > > hosted-engine --set-maintenance --mode=none
> > > lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 
> > > returned with 1
> > > lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0  
> > > errors:
> > >  Cannot connect to the HA daemon, please check the logs.
> > >
> > > ovirtlago.testlib: ERROR:     * Unhandled exception in <function <lambda> 
> > > at 0x7f52673872a8>
> > > Traceback (most recent call last):
> > >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, 
> > > in assert_equals_within
> > >     res = func()
> > >   File 
> > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
> > >  line 87, in <lambda>
> > >     lambda: _set_and_test_maintenance_mode(host, False)
> > >   File 
> > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
> > >  line 108, in _set_and_test_maintenance_mode
> > >     nt.assert_equals(ret.code, 0)
> > >   File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual
> > >     assertion_func(first, second, msg=msg)
> > >   File "/usr/lib64/python2.7/unittest/case.py", line 546, in 
> > > _baseAssertEqual
> > >     raise self.failureException(msg)
> > > AssertionError: 1 != 0
> > > --------------------- >> end captured logging << ---------------------
> > >
> > > Stack Trace:
> > >   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
> > >     testMethod()
> > >   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in 
> > > runTest
> > >     self.test(*self.arg)
> > >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, 
> > > in wrapped_test
> > >     test()
> > >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, 
> > > in wrapper
> > >     return func(get_test_prefix(), *args, **kwargs)
> > >   File 
> > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
> > >  line 87, in clear_global_maintenance
> > >     lambda: _set_and_test_maintenance_mode(host, False)
> > >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 282, 
> > > in assert_true_within_short
> > >     assert_equals_within_short(func, True, allowed_exceptions)
> > >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 266, 
> > > in assert_equals_within_short
> > >     func, value, SHORT_TIMEOUT, allowed_exceptions=allowed_exceptions
> > >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, 
> > > in assert_equals_within
> > >     res = func()
> > >   File 
> > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
> > >  line 87, in <lambda>
> > >     lambda: _set_and_test_maintenance_mode(host, False)
> > >   File 
> > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
> > >  line 108, in _set_and_test_maintenance_mode
> > >     nt.assert_equals(ret.code, 0)
> > >   File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual
> > >     assertion_func(first, second, msg=msg)
> > >   File "/usr/lib64/python2.7/unittest/case.py", line 546, in 
> > > _baseAssertEqual
> > >     raise self.failureException(msg)
> > > '1 != 0\n-------------------- >> begin captured logging << 
> > > --------------------\nroot: INFO: Waiting For System 
> > > Stability...\nlago.ssh: DEBUG: start 
> > > task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for 
> > > lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: end 
> > > task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for 
> > > lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: Running 9a90ca60 on 
> > > lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance 
> > > --mode=none\nlago.ssh: DEBUG: Command 9a90ca60 on 
> > > lago-he-basic-suite-4-3-host-0 returned with 1\nlago.ssh: DEBUG: Command 
> > > 9a90ca60 on lago-he-basic-suite-4-3-host-0  errors:\n Cannot connect to 
> > > the HA daemon, please check the logs.\n\novirtlago.testlib: ERROR:     * 
> > > Unhandled exception in <function <lambda> at 0x7f52673872a8>\nTraceback 
> > > (most recent call last):\n  File 
> > > "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in 
> > > assert_equals_within\n    res = func()\n  File 
> > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
> > >  line 87, in <lambda>\n    lambda: _set_and_test_maintenance_mode(host, 
> > > False)\n  File 
> > > "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
> > >  line 108, in _set_and_test_maintenance_mode\n    
> > > nt.assert_equals(ret.code, 0)\n  File 
> > > "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual\n    
> > > assertion_func(first, second, msg=msg)\n  File 
> > > "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual\n  
> > >   raise self.failureException(msg)\nAssertionError: 1 != 
> > > 0\n--------------------- >> end captured logging << ---------------------'
> >
> >
> >
> > --
> > Didi
> > _______________________________________________
> > Infra mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> > oVirt Code of Conduct: 
> > https://www.ovirt.org/community/about/community-guidelines/
> > List Archives: 
> > https://lists.ovirt.org/archives/list/[email protected]/message/QGRYTQWRPEF5Y2UUQI7UML5JE66GNXVA/
>


-- 
Didi
_______________________________________________
Infra mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/HJMZGCXXMYJOB55GQUZVD7XPYCIL4LLU/

Reply via email to