Re: [ovirt-devel] suspend_resume_vm fail on master experimental

Yaniv Kaul Wed, 11 Jan 2017 03:20:59 -0800

On Wed, Jan 11, 2017 at 12:42 PM, Francesco Romani <[email protected]>
wrote:


>
>
> On 01/11/2017 09:29 AM, Yaniv Kaul wrote:
>
>
>
> On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <[email protected]>
> wrote:
>
>> Hi all
>>
>> On 01/11/2017 08:52 AM, Eyal Edri wrote:
>>
>> Adding Tomas from Virt.
>>
>> On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski <
>> [email protected]> wrote:
>>
>>> On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <[email protected]>
>>> wrote:
>>> > Hi all,
>>> >
>>> > test-repo_ovirt_experimental_master (link to Jenkins) job failed on
>>> > basic_sanity scenario.
>>> > The job was triggered by https://gerrit.ovirt.org/#/c/69845/
>>> >
>>> > From looking at the logs, it seems that the reason is VDSM.
>>> >
>>> > In the VDSM log, i see the following error:
>>> >
>>> > 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor))
>>> [vds.dispatcher] SSL
>>> > error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected
>>> ('::1',
>>> > 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
>>>
>>
>> Daniel, could you please remind me the jenkins link? I see something
>> suspicious on the Vdsm log.
>>
>
> Please use my live system:
> ssh [email protected] (redacted)
> then run a console to the VM:
> lagocli --prefix-path /dev/shm/run/current shell engine
>
> (or 'host0' for the host)
>
> Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
>>
>
> It's not.
>
>
> I fooled myself. We have two hosts here. host0 was holding the VM up until
> the suspend. Then Engine decided to resume the VM on the other one, host1.
> While the VM was resuming, host0 experienced network issues, which led to
> soft-fencing. That explains the mess on host0, even though it is still
> unclear why host0 had network issues and heartbeat exceeded in the first
> place.
>
> On host1 the waters are even darker. The VM was resumed ~02:36
>
> 2017-01-11 02:36:04,700-05 INFO  
> [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand]
> (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] FINISH,
> CreateVmVDSCommand, return: RestoringState, log id: 378da701
> 2017-01-11 02:36:04,700-05 INFO  [org.ovirt.engine.core.bll.RunVmCommand]
> (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Lock freed to
> object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea64
> 2caf083=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
> 2017-01-11 02:36:04,704-05 INFO  [org.ovirt.engine.core.dal.
> dbbroker.auditloghandling.AuditLogDirector] (default task-17)
> [72c41f12-649b-4833-8485-44e8d20d2b49] Correlation ID:
> 72c41f12-649b-4833-8485-44e8d20d2b
> 49, Job ID: a93b571e-aed1-40e7-8d71-831f646255fb, Call Stack: null,
> Custom Event ID: -1, Message: VM vm0 was started by admin@internal-authz
> (Host: lago-basic-suite-master-host1).
>
> While well within the timeout limit (600s), the vm was still restoring its
> state:
>

This is an issue I (wrongfully, privately) complained to Michal - resume
takes way lot more time than it used to - the 'lzop' process decompressing
the memory is taking ages to complete.
This is reproducible, regardless of the other issues.

I did not open a bug about it :(
Y.


>
> 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.engine.core.
> vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8)
> [5582058d] START, GetAllVmStatsVDSCommand(HostName =
> lago-basic-suite-master-host1, VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='0336661b-721f-4c55-9327-8fd2fd3a0542'}), log id: 1803c51b
> 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc.
> client.reactors.stomp.impl.Message] (DefaultQuartzScheduler8) [5582058d]
> SEND
> destination:jms.topic.vdsm_requests
> reply-to:jms.topic.vdsm_responses
> content-length:103
>
> {"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{
> },"id":"f2997c1d-49cc-4d2e-937c-e4910fbb75df"}^@
> 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc.
> client.reactors.stomp.StompCommonClient] (DefaultQuartzScheduler8)
> [5582058d] Message sent: SEND
> destination:jms.topic.vdsm_requests
> content-length:103
> reply-to:jms.topic.vdsm_responses
>
> <JsonRpcRequest id: "f2997c1d-49cc-4d2e-937c-e4910fbb75df", method:
> Host.getAllVmStats, params: {}>
> 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc.
> client.reactors.stomp.impl.Message] (SSL Stomp Reactor) [5e453618] MESSAGE
> content-length:829
> destination:jms.topic.vdsm_responses
> content-type:application/json
> subscription:fe930de2-aa67-4bc4-a34c-be22edd1623e
>
> {"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df",
> "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1",
> "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort":
> "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}],
> "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp":
> "192.168.201.4", "guestFQDN": "", "vmId": 
> "42860011-acc3-44d6-9ddf-dea642caf083",
> "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort":
> "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable":
> "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0",
> "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status":
> "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "",
> "statusTime": "4295562220", "vmName": "vm0"}]}^@
> 2017-01-11 02:37:31,062-05 DEBUG 
> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
> (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id":
> "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username":
> "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4",
> "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress":
> "192.168.201.4", "type": "spice", "port": "-1"}], "hash":
> "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4",
> "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid":
> "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1",
> "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true",
> "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0",
> "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status":
> "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "",
> "statusTime": "4295562220", "vmName": "vm0"}]}
>
> Then a failed migration attempt, which fails
>
> 2017-01-11 02:46:07,826-05 INFO  
> [org.ovirt.engine.core.bll.MigrateVmToServerCommand]
> (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock Acquired to
> object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM,
> ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]',
> sharedLocks='null'}'
> 2017-01-11 02:46:07,827-05 WARN  
> [org.ovirt.engine.core.bll.MigrateVmToServerCommand]
> (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Validation of
> action 'MigrateVmToServer' failed for user admin@internal-authz. Reasons:
> VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__HOST_STATUS__UP,
> ACTION_TYPE_FAILED_VDS_STATUS_ILLEGAL
> 2017-01-11 02:46:07,827-05 INFO  
> [org.ovirt.engine.core.bll.MigrateVmToServerCommand]
> (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock freed to
> object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM,
> ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]',
> sharedLocks='null'}'
>
> which makes sense, host0 was being reinstalled (!?) meantime.
>
> After that I see no more evidence of 'vm0' in the logs. I couldn't yet
> cross-correlate with the tests, perhaps the test failed meantime?
>
> TL;DR: we need a reproducer to keep investigating this, so far I've seen
> quite some suspicious activity but nothing really concrete.
>
> PS: last few lago failures I've seen are been triggered by completely
> unrelated patches, so far I thik they are sporadic failures
>
>
> --
> Francesco Romani
> Red Hat Engineering Virtualization R & D
> IRC: fromani
>
>

_______________________________________________
Devel mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] suspend_resume_vm fail on master experimental

Reply via email to