On Wed, Jan 11, 2017 at 12:42 PM, Francesco Romani <[email protected]> wrote:
> > > On 01/11/2017 09:29 AM, Yaniv Kaul wrote: > > > > On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <[email protected]> > wrote: > >> Hi all >> >> On 01/11/2017 08:52 AM, Eyal Edri wrote: >> >> Adding Tomas from Virt. >> >> On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski < >> [email protected]> wrote: >> >>> On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <[email protected]> >>> wrote: >>> > Hi all, >>> > >>> > test-repo_ovirt_experimental_master (link to Jenkins) job failed on >>> > basic_sanity scenario. >>> > The job was triggered by https://gerrit.ovirt.org/#/c/69845/ >>> > >>> > From looking at the logs, it seems that the reason is VDSM. >>> > >>> > In the VDSM log, i see the following error: >>> > >>> > 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) >>> [vds.dispatcher] SSL >>> > error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected >>> ('::1', >>> > 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119) >>> >> >> Daniel, could you please remind me the jenkins link? I see something >> suspicious on the Vdsm log. >> > > Please use my live system: > ssh [email protected] (redacted) > then run a console to the VM: > lagocli --prefix-path /dev/shm/run/current shell engine > > (or 'host0' for the host) > > Most notably, Vdsm received SIGTERM. Is this expected and part of the test? >> > > It's not. > > > I fooled myself. We have two hosts here. host0 was holding the VM up until > the suspend. Then Engine decided to resume the VM on the other one, host1. > While the VM was resuming, host0 experienced network issues, which led to > soft-fencing. That explains the mess on host0, even though it is still > unclear why host0 had network issues and heartbeat exceeded in the first > place. > > On host1 the waters are even darker. The VM was resumed ~02:36 > > 2017-01-11 02:36:04,700-05 INFO > [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] > (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] FINISH, > CreateVmVDSCommand, return: RestoringState, log id: 378da701 > 2017-01-11 02:36:04,700-05 INFO [org.ovirt.engine.core.bll.RunVmCommand] > (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Lock freed to > object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea64 > 2caf083=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' > 2017-01-11 02:36:04,704-05 INFO [org.ovirt.engine.core.dal. > dbbroker.auditloghandling.AuditLogDirector] (default task-17) > [72c41f12-649b-4833-8485-44e8d20d2b49] Correlation ID: > 72c41f12-649b-4833-8485-44e8d20d2b > 49, Job ID: a93b571e-aed1-40e7-8d71-831f646255fb, Call Stack: null, > Custom Event ID: -1, Message: VM vm0 was started by admin@internal-authz > (Host: lago-basic-suite-master-host1). > > While well within the timeout limit (600s), the vm was still restoring its > state: > This is an issue I (wrongfully, privately) complained to Michal - resume takes way lot more time than it used to - the 'lzop' process decompressing the memory is taking ages to complete. This is reproducible, regardless of the other issues. I did not open a bug about it :( Y. > > 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.engine.core. > vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) > [5582058d] START, GetAllVmStatsVDSCommand(HostName = > lago-basic-suite-master-host1, VdsIdVDSCommandParametersBase:{runAsync='true', > hostId='0336661b-721f-4c55-9327-8fd2fd3a0542'}), log id: 1803c51b > 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc. > client.reactors.stomp.impl.Message] (DefaultQuartzScheduler8) [5582058d] > SEND > destination:jms.topic.vdsm_requests > reply-to:jms.topic.vdsm_responses > content-length:103 > > {"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{ > },"id":"f2997c1d-49cc-4d2e-937c-e4910fbb75df"}^@ > 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc. > client.reactors.stomp.StompCommonClient] (DefaultQuartzScheduler8) > [5582058d] Message sent: SEND > destination:jms.topic.vdsm_requests > content-length:103 > reply-to:jms.topic.vdsm_responses > > <JsonRpcRequest id: "f2997c1d-49cc-4d2e-937c-e4910fbb75df", method: > Host.getAllVmStats, params: {}> > 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc. > client.reactors.stomp.impl.Message] (SSL Stomp Reactor) [5e453618] MESSAGE > content-length:829 > destination:jms.topic.vdsm_responses > content-type:application/json > subscription:fe930de2-aa67-4bc4-a34c-be22edd1623e > > {"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df", > "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1", > "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": > "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}], > "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp": > "192.168.201.4", "guestFQDN": "", "vmId": > "42860011-acc3-44d6-9ddf-dea642caf083", > "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": > "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": > "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", > "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": > "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", > "statusTime": "4295562220", "vmName": "vm0"}]}^@ > 2017-01-11 02:37:31,062-05 DEBUG > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] > (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id": > "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username": > "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4", > "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress": > "192.168.201.4", "type": "spice", "port": "-1"}], "hash": > "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4", > "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid": > "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1", > "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true", > "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", > "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": > "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", > "statusTime": "4295562220", "vmName": "vm0"}]} > > Then a failed migration attempt, which fails > > 2017-01-11 02:46:07,826-05 INFO > [org.ovirt.engine.core.bll.MigrateVmToServerCommand] > (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock Acquired to > object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, > ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', > sharedLocks='null'}' > 2017-01-11 02:46:07,827-05 WARN > [org.ovirt.engine.core.bll.MigrateVmToServerCommand] > (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Validation of > action 'MigrateVmToServer' failed for user admin@internal-authz. Reasons: > VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__HOST_STATUS__UP, > ACTION_TYPE_FAILED_VDS_STATUS_ILLEGAL > 2017-01-11 02:46:07,827-05 INFO > [org.ovirt.engine.core.bll.MigrateVmToServerCommand] > (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock freed to > object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, > ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', > sharedLocks='null'}' > > which makes sense, host0 was being reinstalled (!?) meantime. > > After that I see no more evidence of 'vm0' in the logs. I couldn't yet > cross-correlate with the tests, perhaps the test failed meantime? > > TL;DR: we need a reproducer to keep investigating this, so far I've seen > quite some suspicious activity but nothing really concrete. > > PS: last few lago failures I've seen are been triggered by completely > unrelated patches, so far I thik they are sporadic failures > > > -- > Francesco Romani > Red Hat Engineering Virtualization R & D > IRC: fromani > >
_______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
