[ovirt-users] VMs hang periodically: gluster problem?

Diego Ercolani Thu, 01 Sep 2022 02:18:30 -0700

Hello, I have a cluster made by 3 nodes in a "self-hosted-engine" topology.
I implemented the storage with gluster implementation in 2 replica + arbiter 
topology.
I have two gluster volumes
glen - is the volume used by hosted-engine vm
gv0 - is the volume used by VMs


The physical disks are 4TB SSD used only to accomodate VMs (also hosted-engine)

I have continuos VMs hangs, even hosted-engine, this give full of troubles as I 
have continuous hangs by hosted-engine and this happen asyncrounosly even while 
there is management operation on VMs (mobility, cloning...)

after a while it happens that the VM is freed but in the VMs I have in console 
kernel complaining by CPU hang or timer hangs and the solution is only to 
shutdown/poweroff the VM... even hosted engine in fact it happens that 
hosted-engine -vm-status give "state=EngineUpBadHealth"

This is the log during the event in the host while there is the event:
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info>  
[1662018538.0166] device (vnet73): state change: activated -> unmanaged (reason 
'unmanaged', sys-iface-state: 'removed')
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info>  
[1662018538.0168] device (vnet73): released from master device ovirtmgmt
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: Unable to read from monitor: 
Connection reset by peer
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: internal error: qemu 
unexpectedly closed the monitor: 2022-09-01T07:48:57.930955Z qemu-kvm: -device 
virtio-blk-pci,iothread=iothread1,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-0a1a501c-fc45-430f-bfd3-076172cec406,bootindex=1,write-cache=on,serial=0a1a501c-fc45-430f-bfd3-076172cec406,werror=stop,rerror=stop:
 Failed to get "write" lock Is another process using the image 
[/run/vdsm/storage/3577c21e-f757-4405-97d1-0f827c9b4e22/0a1a501c-fc45-430f-bfd3-076172cec406/f65dab86-67f1-46fa-87c0-f9076f479741]?
Sep 01 07:48:58 ovirt-node3.ovirt kvm[268578]: 5 guests now active
Sep 01 07:48:58 ovirt-node3.ovirt systemd[1]: 
machine-qemu\x2d67\x2dHostedEngine.scope: Succeeded.
Sep 01 07:48:58 ovirt-node3.ovirt systemd-machined[1613]: Machine 
qemu-67-HostedEngine terminated.
Sep 01 07:49:08 ovirt-node3.ovirt systemd[1]: 
NetworkManager-dispatcher.service: Succeeded.
Sep 01 07:49:08 ovirt-node3.ovirt ovirt-ha-agent[3338]: ovirt-ha-agent 
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM stopped 
on localhost
Sep 01 07:49:14 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted 
Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted 
Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted 
Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted 
Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 
[5083]: s4 delta_renew long write time 11 sec
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 
[5033]: s3 delta_renew long write time 11 sec
Sep 01 07:49:34 ovirt-node3.ovirt libvirtd[2496]: Domain id=65 
name='ocr-Brain-28-ovirt.dmz.ssis' uuid=00425fb1-c24b-4eaa-9683-534d66b2cb04 is 
tainted: custom-ga-command
Sep 01 07:49:47 ovirt-node3.ovirt sudo[268984]:     root : TTY=unknown ; PWD=/ 
; USER=root ; COMMAND=/bin/privsep-helper --privsep_context 
os_brick.privileged.default --privsep_sock_path /tmp/tmp1iolt06i/privsep.sock

This is the indication I have on gluster:
[root@ovirt-node3 ~]# gluster volume heal gv0 info
Brick ovirt-node2.ovirt:/brickgv0/_gv0
Status: Connected
Number of entries: 0

Brick ovirt-node3.ovirt:/brickgv0/gv0_1
Status: Connected
Number of entries: 0

Brick ovirt-node4.ovirt:/dati/_gv0
Status: Connected
Number of entries: 0


[root@ovirt-node3 ~]# gluster volume heal glen info
Brick ovirt-node2.ovirt:/brickhe/_glen
Status: Connected
Number of entries: 0

Brick ovirt-node3.ovirt:/brickhe/glen
Status: Connected
Number of entries: 0

Brick ovirt-node4.ovirt:/dati/_glen
Status: Connected
Number of entries: 0

So it seem healty.

I don't know how to address the issue but this is a great problem.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/62EZCKA5HCAY2CP4O7RAXVNYEW4RAQOF/

[ovirt-users] VMs hang periodically: gluster problem?

Reply via email to