Il 12/02/2015 15:31, Nir Soffer ha scritto: > > > ----- Original Message ----- >> From: "Sandro Bonazzola" <sbona...@redhat.com> >> To: "Nir Soffer" <nsof...@redhat.com> >> Cc: "Yeela Kaplan" <ykap...@redhat.com>, "Alon Bar-Lev" <alo...@redhat.com>, >> "Allon Mureinik" <amure...@redhat.com>, >> "Yaniv Dary" <yd...@redhat.com>, "Federico Simoncelli" >> <fsimo...@redhat.com>, "Sahina Bose" <sab...@redhat.com>, >> "Yedidyah Bar David" <d...@redhat.com>, "Simone Tiraboschi" >> <stira...@redhat.com>, devel@ovirt.org, "Michal >> Skrivanek" <mskri...@redhat.com>, "David Teigland" <teigl...@redhat.com> >> Sent: Thursday, February 12, 2015 4:24:55 PM >> Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine >> >> Il 12/02/2015 14:03, Nir Soffer ha scritto: >>> ----- Original Message ----- >>>> From: "Sandro Bonazzola" <sbona...@redhat.com> >>>> To: "Yeela Kaplan" <ykap...@redhat.com>, "Alon Bar-Lev" >>>> <alo...@redhat.com>, "Allon Mureinik" <amure...@redhat.com>, >>>> "Yaniv Dary" <yd...@redhat.com>, "Federico Simoncelli" >>>> <fsimo...@redhat.com>, "Sahina Bose" <sab...@redhat.com>, >>>> "Yedidyah Bar David" <d...@redhat.com>, "Simone Tiraboschi" >>>> <stira...@redhat.com>, devel@ovirt.org, "Michal >>>> Skrivanek" <mskri...@redhat.com> >>>> Sent: Thursday, February 12, 2015 2:55:57 PM >>>> Subject: [ovirt-devel] [HC] Weird issue while deploying hosted-engine >>>> >>>> While deploying Hosted Engine on Hyper Converged Gluster Storage I've hit >>>> the >>>> following issue. >>>> >>>> ovirt-hosted-engine-setup: >>>> >>>> [ INFO ] Engine replied: DB Up!Welcome to Health Status! >>>> Enter the name of the cluster to which you want to add the host >>>> (Default) [Default]: >>>> [ INFO ] Waiting for the host to become operational in the engine. This >>>> may >>>> take several minutes... >>>> [ INFO ] Still waiting for VDSM host to become operational... >>>> >>>> Accessed to the engine, VM and engine working correctly, installed using >>>> 3.5 >>>> snapshot repositories. >>>> host-deploy started on the host (see attached logs) and got stuck at: >>>> >>>> 2015-02-12 13:08:33 DEBUG otopi.plugins.otopi.dialog.machine >>>> dialog.__logString:215 DIALOG:SEND ### Response is VALUE >>>> TIME=type:value or ABORT TIME >>>> >>>> until several minutes later when: >>>> 2015-02-12 13:24:31 DEBUG otopi.context context._executeMethod:152 method >>>> exception >>>> Traceback (most recent call last): >>>> File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/context.py", line 142, in >>>> _executeMethod >>>> method['method']() >>>> File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/system/clock.py", line >>>> 111, >>>> in _set_clock >>>> format="YYYYmmddHHMMSS+0000" >>>> File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/dialog/machine.py", line >>>> 235, in queryValue >>>> opcode, variable = self._readline().split(' ', 1) >>>> File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/dialog.py", line 259, in >>>> _readline >>>> raise IOError(_('End of file')) >>>> IOError: End of file >>>> >>>> resumed the process. >>>> >>>> Around the time when host-deploy got stucked I see in sanlock logs: >>>> >>>> 2015-02-12 13:08:25+0100 7254 [683]: open error -1 >>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/images/51156c24-5ea1-42df-bf33-daebf2c4780c/3719b0 >>>> 06-68a2-4742-964c-413a3f77c0b4.lease >>>> 2015-02-12 13:08:25+0100 7254 [683]: r4 release_token open error -2 >>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0 >>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res >>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset >>>> 0 >>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids >>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length >>>> 0 >>>> last_success 7239 >>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0 >>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res >>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset >>>> 0 >>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids >>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length >>>> 0 >>>> last_success 7239 >>>> 2015-02-12 13:08:32+0100 7261 [16910]: a7dba6e3 aio collect 0 >>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res >>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 delta_renew read rv -107 offset >>>> 0 >>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids >>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 renewal error -107 delta_length >>>> 0 >>>> last_success 7239 >>> >>> Sanlock cannot access your storage. The rest of the errors are less >>> interesting, >>> as nothing will work without storage. >>> >>> While sanlock is failing, can you access that storage from another host or >>> from the host running sanlock? >>> >>> Adding David for helping with debugging the sanlock errors. >> >> >> If you mean something like: >> # mount -t glusterfs minidell.home:/hosted_engine_glusterfs /tmp/test >> # # ll /tmp/test/ >> totale 4 >> drwxr-xr-x. 6 vdsm kvm 4096 12 feb 11.49 a7dba6e3-09ac-430a-8d6e-eea17cf18b8f >> -rwxr-xr-x. 1 vdsm kvm 0 12 feb 13.04 __DIRECT_IO_TEST__ >> >> yes, I can still access the gluster volume. > > While sanlock fail to update the lease or now? > > Try also to write to this storage.
dd if=/dev/zero of=/tmp/test/my_write_test bs=4k count=1k 1024+0 records in 1024+0 records out 4194304 bytes (4.2 MB) copied, 0.0289396 s, 145 MB/s ll /tmp/test/ total 4100 -rwxr-xr-x. 1 vdsm kvm 0 Feb 12 13:04 __DIRECT_IO_TEST__ drwxr-xr-x. 6 vdsm kvm 4096 Feb 12 11:49 a7dba6e3-09ac-430a-8d6e-eea17cf18b8f -rw-r--r--. 1 root root 4194304 Feb 12 15:33 my_write_test looks like it works > >> >> # gluster volume info hosted_engine_glusterfs >> >> Volume Name: hosted_engine_glusterfs >> Type: Distribute >> Volume ID: 8d78571d-69ef-41fb-bc0a-34dc7d903df5 >> Status: Started >> Number of Bricks: 1 >> Transport-type: tcp >> Bricks: >> Brick1: minidell.home:/path1 >> Options Reconfigured: >> nfs.disable: on >> storage.owner-gid: 36 >> storage.owner-uid: 36 >> network.ping-timeout: 10 >> auth.allow: * >> cluster.server-quorum-type: server >> cluster.quorum-type: auto >> network.remote-dio: enable >> cluster.eager-lock: enable >> performance.stat-prefetch: off >> performance.io-cache: off >> performance.read-ahead: off >> performance.quick-read: off >> server.allow-insecure: on >> user.cifs: disable >> >> # gluster volume status hosted_engine_glusterfs >> Status of volume: hosted_engine_glusterfs >> Gluster process Port Online >> Pid >> ------------------------------------------------------------------------------ >> Brick minidell.home:/path1 49217 Y 16660 >> >> Task Status of Volume hosted_engine_glusterfs >> ------------------------------------------------------------------------------ >> There are no active volume tasks >> >> >> >>> >>>> >>>> >>>> And in VDSM logs: >>>> >>>> MainThread::DEBUG::2015-02-12 >>>> 13:08:25,529::vdsm::58::vds::(sigtermHandler) >>>> Received signal 15 >>>> MainThread::DEBUG::2015-02-12 >>>> 13:08:25,529::protocoldetector::144::vds.MultiProtocolAcceptor::(stop) >>>> Stopping Acceptor >>>> MainThread::INFO::2015-02-12 >>>> 13:08:25,530::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping >>>> JsonRPC >>>> Server >>>> Detector thread::DEBUG::2015-02-12 >>>> 13:08:25,530::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup) >>>> Cleaning Acceptor >>>> MainThread::INFO::2015-02-12 13:08:25,531::vmchannels::188::vds::(stop) VM >>>> channels listener was stopped. >>>> MainThread::INFO::2015-02-12 13:08:25,535::momIF::91::MOM::(stop) Shutting >>>> down MOM >>>> MainThread::DEBUG::2015-02-12 >>>> 13:08:25,535::task::595::Storage.TaskManager.Task::(_updateState) >>>> Task=`c52bd13a-24f0-4d8a-97d8-f6c53ca8bddc`::moving >>>> from state init -> state preparing >>>> MainThread::INFO::2015-02-12 >>>> 13:08:25,535::logUtils::44::dispatcher::(wrapper) Run and protect: >>>> prepareForShutdown(options=None) >>>> Thread-11::DEBUG::2015-02-12 >>>> 13:08:25,535::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections) >>>> Monitoring stopped >>>> ioprocess communication (16795)::ERROR::2015-02-12 >>>> 13:08:25,530::__init__::152::IOProcessClient::(_communicate) IOProcess >>>> failure >>>> Traceback (most recent call last): >>>> File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, >>>> in >>>> _communicate >>>> raise Exception("FD closed") >>>> Exception: FD closed >>>> ioprocess communication (16795)::DEBUG::2015-02-12 >>>> 13:08:25,537::__init__::298::IOProcessClient::(_run) Starting IOProcess... >>>> >>>> >>>> Followed by: >>>> >>>> MainThread::INFO::2015-02-12 >>>> 13:08:25,548::domainMonitor::137::Storage.DomainMonitor::(close) Stopping >>>> domain monitors >>>> MainThread::INFO::2015-02-12 >>>> 13:08:25,549::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring) >>>> Stop monitoring >>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f >>>> Thread-94::DEBUG::2015-02-12 >>>> 13:08:25,549::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop) >>>> Stopping domain monitor for >>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f >>>> Thread-94::INFO::2015-02-12 >>>> 13:08:25,549::clusterlock::245::Storage.SANLock::(releaseHostId) Releasing >>>> host id for domain >>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f (id: 1) >>>> Thread-94::DEBUG::2015-02-12 >>>> 13:08:25,549::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop) >>>> Unable to release the host id 1 for domain >>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f >>>> Traceback (most recent call last): >>>> File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in >>>> _monitorLoop >>>> self.domain.releaseHostId(self.hostId, unused=True) >>>> File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId >>>> self._clusterLock.releaseHostId(hostId, async, unused) >>>> File "/usr/share/vdsm/storage/clusterlock.py", line 252, in >>>> releaseHostId >>>> raise se.ReleaseHostIdFailure(self._sdUUID, e) >>>> ReleaseHostIdFailure: Cannot release host id: >>>> (u'a7dba6e3-09ac-430a-8d6e-eea17cf18b8f', SanlockException(16, 'Sanlock >>>> lockspace remove failure', >>>> 'Device or resource busy')) >>>> >>>> supervdsm logs doesn't show anything interesting but: >>>> >>>> # service supervdsmd status >>>> Redirecting to /bin/systemctl status supervdsmd.service >>>> supervdsmd.service - "Auxiliary vdsm service for running helper functions >>>> as >>>> root" >>>> Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static) >>>> Active: inactive (dead) since gio 2015-02-12 13:08:30 CET; 27min ago >>>> Main PID: 15859 (code=exited, status=0/SUCCESS) >>>> CGroup: /system.slice/supervdsmd.service >>>> >>>> feb 12 13:08:30 minidell.home systemd[1]: Stopping "Auxiliary vdsm service >>>> for running helper functions as root"... >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: Traceback (most recent >>>> call last): >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File >>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in >>>> _run_finalizers >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: finalizer() >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File >>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__ >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: res = >>>> self._callback(*self._args, **self._kwargs) >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: OSError: [Errno 2] No >>>> such file or directory: '/var/run/vdsm/svdsm.sock' >>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 client mech >>>> dispose >>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 common mech >>>> dispose >>>> feb 12 13:08:30 minidell.home systemd[1]: Stopped "Auxiliary vdsm service >>>> for >>>> running helper functions as root". >>>> >>>> >>>> >>>> After a while the VM became unreachable as ovirt-hosted-engine-setup >>>> shows: >>>> >>>> 2015-02-12 13:10:33 DEBUG >>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host >>>> add_host._wait_host_ready:183 Error fetching host state: >>>> [ERROR]::oVirt API connection failure, [Errno 110] Connection timed out >>>> 2015-02-12 13:10:33 DEBUG >>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host >>>> add_host._wait_host_ready:189 VDSM host in state >>>> >>>> >>>> Gluster logs don't show anything special. >>>> >>>> I'm guessing that the 15 minutes of host-deploy being stuck caused the >>>> issue >>>> here, but I don't see any relevant change that may have caused it. >>>> >>>> Attaching the tar.gz with the code used for ovirt-hosted-engine-setup; the >>>> rest is 3.5-snapshot over RHEL 7.1 >>>> >>>> Any hint is welcome. >>>> >>>> -- >>>> Sandro Bonazzola >>>> Better technology. Faster innovation. Powered by community collaboration. >>>> See how it works at redhat.com >>>> >>>> _______________________________________________ >>>> Devel mailing list >>>> Devel@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/devel >> >> >> -- >> Sandro Bonazzola >> Better technology. Faster innovation. Powered by community collaboration. >> See how it works at redhat.com >> -- Sandro Bonazzola Better technology. Faster innovation. Powered by community collaboration. See how it works at redhat.com _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel