בתאריך יום ה׳, 20 באפר׳ 2017, 13:05, מאת Piotr Kliczewski < [email protected]>:
> On Thu, Apr 20, 2017 at 11:49 AM, Yaniv Kaul <[email protected]> wrote: > > > > > > On Thu, Apr 20, 2017 at 11:55 AM, Piotr Kliczewski > > <[email protected]> wrote: > >> > >> On Thu, Apr 20, 2017 at 10:32 AM, Yaniv Kaul <[email protected]> wrote: > >> > No, that's not the issue. > >> > I've seen it happening few times. > >> > > >> > 1. It always with the ISO domain (which we don't use anyway in o-s-t) > >> > 2. Apparently, only one host is asking for a mount: > >> > authenticated mount request from 192.168.201.4:713 for > /exports/nfs/iso > >> > (/exports/nfs/iso) > >> > > >> > (/var/log/messages of the NFS server) > >> > > >> > And indeed, you can see in[1] that host1 made the request and all is > >> > well on > >> > it. > >> > > >> > However, there are connection issues with host0 which cause a timeout > to > >> > connectStorageServer(): > >> > From[2]: > >> > > >> > 2017-04-19 18:58:58,465-04 DEBUG > >> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] > (ResponseWorker) > >> > [] > >> > Message received: > >> > > >> > > {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-master-host0:192912448","message":"Vds > >> > timeout occured"},"id":null} > >> > > >> > 2017-04-19 18:58:58,475-04 ERROR > >> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >> > (org.ovirt.thread.pool-7-thread-37) [755b908a] EVENT_ID: > >> > VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: > >> > null, > >> > Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host0 > command > >> > ConnectStorageServerVDS failed: Message timeout which can be caused by > >> > communication issues > >> > 2017-04-19 18:58:58,475-04 INFO > >> > > >> > > [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] > >> > (org.ovirt.thread.pool-7-thread-37) [755b908a] Command > >> > > >> > > 'org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand' > >> > return value ' > >> > ServerConnectionStatusReturn:{status='Status [code=5022, > message=Message > >> > timeout which can be caused by communication issues]'} > >> > > >> > > >> > I wonder why, but on /var/log/messages[3], I'm seeing: > >> > Apr 19 18:56:58 lago-basic-suite-master-host0 journal: vdsm Executor > >> > WARN > >> > Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask > >> > {'params': > >> > {u'connectionParams': [{u'id': > u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1', > >> > u'connection': u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', > >> > u'user': > >> > u'', u'tpgt': u'1', u'protocol_version': u'4.2', u'password': > >> > '********', > >> > u'port': u''}], u'storagepoolID': > >> > u'00000000-0000-0000-0000-000000000000', > >> > u'domainType': 1}, 'jsonrpc': '2.0', 'method': > >> > u'StoragePool.connectStorageServer', 'id': > >> > u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60, > >> > duration=60 at 0x2f44310> task#=9 at 0x2ac11d0> > >> > ... > >> > > >> > >> I see following sequence: > >> > >> The message is sent: > >> > >> 2017-04-19 18:55:58,020-04 DEBUG > >> [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] > >> (org.ovirt.thread.pool-7-thread-37) [755b908a] Message sent: SEND > >> destination:jms.topic.vdsm_requests > >> content-length:381 > >> ovirtCorrelationId:755b908a > >> reply-to:jms.topic.vdsm_responses > >> > >> <JsonRpcRequest id: "057da9c2-1e67-4c2f-9511-7d9de250386b", method: > >> StoragePool.connectStorageServer, params: > >> {storagepoolID=00000000-0000-0000-0000-000000000000, domainType=1, > >> connectionParams=[{password=, protocol_version=4.2, port=, iqn=, > >> connection=192.168.201.3:/exports/nfs/share1, > >> id=4ca8fc84-d872-4a7f-907f-9445bda7b6d1, user=, tpgt=1}]}> > >> > >> There is no response for specified amount of time and we timeout: > >> > >> 2017-04-19 18:58:58,465-04 DEBUG > >> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] > >> (ResponseWorker) [] Message received: > >> > >> > {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-master-host0:192912448","message":"Vds > >> timeout occured"},"id":null} > >> > >> As Yaniv pointed here is why we never get the response: > >> > >> Apr 19 18:58:58 lago-basic-suite-master-host0 journal: vdsm Executor > >> WARN Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask > >> {'params': {u'connectionParams': [{u'id': > >> u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1', u'connection': > >> u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', u'user': u'', > >> u'tpgt': u'1', u'protocol_version': u'4.2', u'password': '********', > >> u'port': u''}], u'storagepoolID': > >> u'00000000-0000-0000-0000-000000000000', u'domainType': 1}, 'jsonrpc': > >> '2.0', 'method': u'StoragePool.connectStorageServer', 'id': > >> u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60, > >> duration=180 at 0x2f44310> task#=9 at 0x2ac11d0> > This means the connection attempt was stuck for 180 seconds. Need to check if the mount was stuck, or maybe there is some issue in supervdsm running this. This is a new warning introduced lately, before a stuck worker was hidden. Do we check that the nfs server is up before we start the tests? >> > >> > > >> > 3. Also, there is still the infamous unable to update response issues. > >> > > >> > >> When we see timeout on a call our default behavior is to reconnect > >> when we clean pending messages. > >> As a result when we reconnect and receive a response from the message > >> sent before disconnect > >> we say it is unknown to us. > > > > > > But the example I've given was earlier than the storage issue? > > The specific message that you refer to was a ping command but it > timeout (3 secs) > before it arrived and it was removed from tracking. When it finally > arrived it was not tracked anymore. > > We may want to increase the timeout to give it more time for arrival. > > > Y. > > > >> > >> > >> > > >> > > {"jsonrpc":"2.0","method":"Host.ping","params":{},"id":"7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"}� > >> > 2017-04-19 18:54:27,843-04 DEBUG > >> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] > >> > (org.ovirt.thread.pool-7-thread-15) [62d198cc] Message sent: SEND > >> > destination:jms.topic.vdsm_requests > >> > content-length:94 > >> > ovirtCorrelationId:62d198cc > >> > reply-to:jms.topic.vdsm_responses > >> > > >> > <JsonRpcRequest id: "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", method: > >> > Host.ping, params: {}> > >> > 2017-04-19 18:54:27,885-04 DEBUG > >> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message] > >> > (org.ovirt.thread.pool-7-thread-16) [1f9aac13] SEND > >> > ovirtCorrelationId:1f9aac13 > >> > destination:jms.topic.vdsm_requests > >> > reply-to:jms.topic.vdsm_responses > >> > content-length:94 > >> > > >> > ... > >> > > >> > {"jsonrpc": "2.0", "id": "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", > >> > "result": > >> > true}� > >> > 2017-04-19 18:54:32,132-04 DEBUG > >> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] > (ResponseWorker) > >> > [] > >> > Message received: {"jsonrpc": "2.0", "id": > >> > "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", "result": true} > >> > 2017-04-19 18:54:32,133-04 ERROR > >> > [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not > >> > able > >> > to update response for "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0" > >> > > >> > > >> > Would be nice to understand why. > >> > > >> > > >> > > >> > 4. Lastly, MOM is not running. Why? > >> > > >> > > >> > Please open a bug with the details from item #2 above. > >> > Y. > >> > > >> > > >> > [1] > >> > > >> > > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log > >> > [2] > >> > > >> > > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log > >> > [3] > >> > > >> > > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host0/_var_log/messages > >> > > >> > > >> > > >> > > >> > > >> > On Thu, Apr 20, 2017 at 9:27 AM, Gil Shinar <[email protected]> > wrote: > >> >> > >> >> Test failed: add_secondary_storage_domains > >> >> Link to suspected patches: > >> >> Link to Job: > >> >> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403 > >> >> Link to all logs: > >> >> > >> >> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py > >> >> > >> >> > >> >> Error seems to be: > >> >> 2017-04-19 18:58:58,774-0400 ERROR (jsonrpc/2) > >> >> [storage.TaskManager.Task] > >> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') Unexpected error > >> >> (task:871) > >> >> Traceback (most recent call last): > >> >> File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line > >> >> 878, > >> >> in _run > >> >> return fn(*args, **kargs) > >> >> File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, > in > >> >> wrapper > >> >> res = f(*args, **kwargs) > >> >> File "/usr/share/vdsm/storage/hsm.py", line 2709, in > >> >> getStorageDomainInfo > >> >> dom = self.validateSdUUID(sdUUID) > >> >> File "/usr/share/vdsm/storage/hsm.py", line 298, in validateSdUUID > >> >> sdDom = sdCache.produce(sdUUID=sdUUID) > >> >> File "/usr/share/vdsm/storage/sdc.py", line 112, in produce > >> >> domain.getRealDomain() > >> >> File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain > >> >> return self._cache._realProduce(self._sdUUID) > >> >> File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce > >> >> domain = self._findDomain(sdUUID) > >> >> File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain > >> >> return findMethod(sdUUID) > >> >> File "/usr/share/vdsm/storage/sdc.py", line 178, in > >> >> _findUnfetchedDomain > >> >> raise se.StorageDomainDoesNotExist(sdUUID) > >> >> StorageDomainDoesNotExist: Storage domain does not exist: > >> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',) > >> >> 2017-04-19 18:58:58,777-0400 INFO (jsonrpc/2) > >> >> [storage.TaskManager.Task] > >> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') aborting: Task is > >> >> aborted: > >> >> 'Storage domain does not exist' - code 358 (task:1176) > >> >> 2017-04-19 18:58:58,777-0400 ERROR (jsonrpc/2) [storage.Dispatcher] > >> >> {'status': {'message': "Storage domain does not exist: > >> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)", 'code': 358}} > >> >> (dispatcher:78) > >> >> > >> >> _______________________________________________ > >> >> Devel mailing list > >> >> [email protected] > >> >> http://lists.ovirt.org/mailman/listinfo/devel > >> > > >> > > >> > > >> > _______________________________________________ > >> > Devel mailing list > >> > [email protected] > >> > http://lists.ovirt.org/mailman/listinfo/devel > > > > > _______________________________________________ > Devel mailing list > [email protected] > http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
