[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-22 Thread Diego Ercolani
I did it following the {read,write}-perf example reported in paragraph 12.6 and 
12.7 
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/sect-running_the_volume_top_command,
 here are the results:
https://cloud.ssis.sm/index.php/s/9bncnNSopnFReRS
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/NND3WXVLXSKYSIRXZTQJU4IWJYFO67K7/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-21 Thread Strahil Nikolov via Users
It's just a guess until there is a proof in the gluster's top 
read-perf/write-perf .
Can you share atleast the read-perf ?
I'm pretty confident that the issue is not network-related , as the 
cluster.choose-local requires all reads to be local (reducing the network 
usage).
Best Regards,Strahil Nikolov 
 
 
  On Wed, Sep 21, 2022 at 10:40, Diego Ercolani wrote:  
 I tried to measure IOs using gluster volume top but its results seem very 
cryptic to me (I need a deep analyze and don't have the time now)

Thank you very much for your analysis, if I understood the problem is that the 
consumer SSD cache is too weak to help in times under a smoll number ~15 not 
particularly IO intensive VMs, so the IO hangs as the performance is poor and 
this hangs the VMs. The VMs kernel think that the CPU had hanged and so it 
crash.

This seem to be the case

If it's possible would be very useful a sort of profiler in the gluster 
enviromnent that raise up the evidence of issue related to speed of the 
undelying storage infrastructure, it can be a problem related to disks or to 
network, in any case the errors reported to user are almost misleading as it 
seem there is a data integrity issue (cannot read... or something like this.
Only for reference these are the first lines of the "open" top command 
(currently I don't experience problems):
[root@ovirt-node2 ~]# gluster volume top gv1 open
Brick: ovirt-node2.ovirt:/brickgv1/gv1
Current open fds: 15, Max open fds: 38, Max openfd time: 2022-09-19 
07:27:20.033304 +
Count          filename
===
331763          /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/inbox
66284          /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/leases
53939          /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata.new
169            
/45b4f14c-8323-482f-90ab-99d8fd610018/images/910fa026-d30b-4be2-9111-3c9f4f646fde/b7d6f39a-1481-4f5c-84fd-fc43f9e14d71
[...]
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/F7FKIJHYOANZM657KDZMIKC23CHXKRDS/
  
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/DEPVELZ2YE336RMKVOYONDOILQAFCXQB/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-21 Thread Diego Ercolani
I tried to measure IOs using gluster volume top but its results seem very 
cryptic to me (I need a deep analyze and don't have the time now)

Thank you very much for your analysis, if I understood the problem is that the 
consumer SSD cache is too weak to help in times under a smoll number ~15 not 
particularly IO intensive VMs, so the IO hangs as the performance is poor and 
this hangs the VMs. The VMs kernel think that the CPU had hanged and so it 
crash.

This seem to be the case

If it's possible would be very useful a sort of profiler in the gluster 
enviromnent that raise up the evidence of issue related to speed of the 
undelying storage infrastructure, it can be a problem related to disks or to 
network, in any case the errors reported to user are almost misleading as it 
seem there is a data integrity issue (cannot read... or something like this.
Only for reference these are the first lines of the "open" top command 
(currently I don't experience problems):
[root@ovirt-node2 ~]# gluster volume top gv1 open
Brick: ovirt-node2.ovirt:/brickgv1/gv1
Current open fds: 15, Max open fds: 38, Max openfd time: 2022-09-19 
07:27:20.033304 +
Count   filename
===
331763  /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/inbox
66284   /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/leases
53939   /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata.new
169 
/45b4f14c-8323-482f-90ab-99d8fd610018/images/910fa026-d30b-4be2-9111-3c9f4f646fde/b7d6f39a-1481-4f5c-84fd-fc43f9e14d71
[...]
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/F7FKIJHYOANZM657KDZMIKC23CHXKRDS/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-18 Thread Strahil Nikolov via Users
I hope you do realize that modern consumer SSDs have a small cache (at least 
according to  https://www.storagereview.com/review/samsung-860-evo-ssd-review ) 
and we can't rule out the disks.
Use gluster's top command to view the read (17.2.6) and write (17.2.7) 
performance of the bricks before (regular usage),  during (high load)  and 
after the backup (regular usage):
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/sect-running_the_volume_top_command


Best Regards,Strahil Nikolov 
 
 
  On Sat, Sep 17, 2022 at 17:49, Diego Ercolani wrote:  
 Parameter cluster.choose-local set to off.
I confirm the filesystem of the bricks are all XFS as required.
I started the farm only to accomplish a test bench of oVirt implementation, so 
I used 3 hosts based on ryzen5 processor desktop environment equipped with 4 
DDR (4 32GB modules) and 1 disk for the OS and the others to use as data brick 
or nfs target. All SATA based, while the OS is installed on an internal M.2 
disk.
The node4 as it doesn't require much space as it is the arbiter, uses only the 
internal M.2 disk.
Every host is equipped with dual channel x520 intel chipset with 2 SFP+ 
configured with 9000 packet size. Access lan is the management lan (and even 
the lan used by gluster) the VLAN are the "production" vlans.

node2:
/dev/mapper/glustervg-glhe on /brickhe type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/glustervg-gv0 on /brickgv0 type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/glustervg-gv1 on /brickgv1 type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

  Devices file mpath_uuid part4-mpath-Samsung_SSD_860_EVO_500GB_S4XBNF0N334942Y 
PVID 9iyl5761LWcy3AYy36fNcPk0fADjNYtC last seen on 
/dev/mapper/Samsung_SSD_860_EVO_500GB_S4XBNF0N334942Y4 not found.
  PV /dev/mapper/Samsung_SSD_870_EVO_4TB_S6BCNG0R300064E  VG glustervg        
lvm2 [<3.64 TiB / 1.54 TiB free]
  PV /dev/nvme0n1p4                                        VG glustervg        
lvm2 [<287.02 GiB / <287.02 GiB free]
  PV /dev/nvme0n1p3                                        VG onn_ovirt-node2  
lvm2 [177.15 GiB / <33.71 GiB free]
  PV /dev/mapper/ST4000NM000A-2HZ100_WJG1ZC85              VG daticold          
lvm2 [<3.64 TiB / 2.44 TiB free]

node3:
/dev/mapper/glustervg-glhe on /brickhe type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/glustervg-gv1 on /brickgv1 type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/glustervg-gv0 on /brickgv0 type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

[root@ovirt-node3 ~]# pvscan -v
  PV /dev/sda        VG glustervg        lvm2 [<3.64 TiB / <1.64 TiB free]
  PV /dev/nvme0n1p4  VG glustervg        lvm2 [<287.02 GiB / <187.02 GiB free]
  PV /dev/nvme0n1p3  VG onn_ovirt-node3  lvm2 [177.15 GiB / <33.71 GiB free]
  Total: 3 [4.09 TiB] / in use: 3 [4.09 TiB] / in no VG: 0 [0  ]

node4:
/dev/mapper/onn_ovirt--node4-gluster on /dati type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=64k,sunit=128,swidth=128,noquota)
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PO7D7UJUAZV2KAD2KNFZY3TTZNDUXLSL/
  
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/RJJU22F44AFIDMY6A77WQS4Z2NUQMZOC/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-17 Thread Diego Ercolani
Parameter cluster.choose-local set to off.
I confirm the filesystem of the bricks are all XFS as required.
I started the farm only to accomplish a test bench of oVirt implementation, so 
I used 3 hosts based on ryzen5 processor desktop environment equipped with 4 
DDR (4 32GB modules) and 1 disk for the OS and the others to use as data brick 
or nfs target. All SATA based, while the OS is installed on an internal M.2 
disk.
The node4 as it doesn't require much space as it is the arbiter, uses only the 
internal M.2 disk.
Every host is equipped with dual channel x520 intel chipset with 2 SFP+ 
configured with 9000 packet size. Access lan is the management lan (and even 
the lan used by gluster) the VLAN are the "production" vlans.

node2:
/dev/mapper/glustervg-glhe on /brickhe type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/glustervg-gv0 on /brickgv0 type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/glustervg-gv1 on /brickgv1 type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

  Devices file mpath_uuid part4-mpath-Samsung_SSD_860_EVO_500GB_S4XBNF0N334942Y 
PVID 9iyl5761LWcy3AYy36fNcPk0fADjNYtC last seen on 
/dev/mapper/Samsung_SSD_860_EVO_500GB_S4XBNF0N334942Y4 not found.
  PV /dev/mapper/Samsung_SSD_870_EVO_4TB_S6BCNG0R300064E   VG glustervg 
lvm2 [<3.64 TiB / 1.54 TiB free]
  PV /dev/nvme0n1p4VG glustervg 
lvm2 [<287.02 GiB / <287.02 GiB free]
  PV /dev/nvme0n1p3VG onn_ovirt-node2   
lvm2 [177.15 GiB / <33.71 GiB free]
  PV /dev/mapper/ST4000NM000A-2HZ100_WJG1ZC85  VG daticold  
lvm2 [<3.64 TiB / 2.44 TiB free]

node3:
/dev/mapper/glustervg-glhe on /brickhe type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/glustervg-gv1 on /brickgv1 type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/mapper/glustervg-gv0 on /brickgv0 type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

[root@ovirt-node3 ~]# pvscan -v
  PV /dev/sda VG glustervg lvm2 [<3.64 TiB / <1.64 TiB free]
  PV /dev/nvme0n1p4   VG glustervg lvm2 [<287.02 GiB / <187.02 GiB free]
  PV /dev/nvme0n1p3   VG onn_ovirt-node3   lvm2 [177.15 GiB / <33.71 GiB free]
  Total: 3 [4.09 TiB] / in use: 3 [4.09 TiB] / in no VG: 0 [0   ]

node4:
/dev/mapper/onn_ovirt--node4-gluster on /dati type xfs 
(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=64k,sunit=128,swidth=128,noquota)
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PO7D7UJUAZV2KAD2KNFZY3TTZNDUXLSL/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-15 Thread Strahil Nikolov via Users
Set it back to the original value .
The option picks the local brick for reading instead of picking the fastest one 
(which could be either a remote or a local one) which could help with bandwidth 
issues.

Can you provide details about the bricks like HW raid/JBOD, raid type 
(0,5,6,10), stripe size, stripe width , filesystem (I expect XFS but it's nice 
to know) ,etc.
Also share the gluster client log from the node where the backup proxy is. 
Should be something like: 
/var/log/glusterfs/rhev-data-center-mnt-glusterSD-:_gv1.log
Best Regards,Strahil Nikolov  
 
  On Thu, Sep 15, 2022 at 17:01, Diego Ercolani wrote:  
 During this time (Hosted-Engine Hung, this appears in the host were it's 
supposed to have Hosted-Engine Running:
2022-09-15 13:59:27,762+ WARN  (Thread-10) [virt.vm] 
(vmId='8486ed73-df34-4c58-bfdc-7025dec63b7f') Shutdown by QEMU Guest Agent 
failed (agent probably inactive) (vm:5490)
2022-09-15 13:59:27,762+ WARN  (Thread-10) [virt.vm] 
(vmId='8486ed73-df34-4c58-bfdc-7025dec63b7f') Shutting down with guest agent 
FAILED (vmpowerdown:115)
2022-09-15 13:59:28,780+ ERROR (qgapoller/1) [virt.periodic.Operation] 
> 
operation failed (periodic:204)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 202, in 
__call__
    self._func()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 
493, in _poller
    vm_id, self._qga_call_get_vcpus(vm_obj))
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 
814, in _qga_call_get_vcpus
    if 'online' in vcpus:
TypeError: argument of type 'NoneType' is not iterable
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZMZ3V5E4ZFNWPW3R74ZXYFZA5RR3BV7R/
  
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/RANBGQMMFTPMRL4PIMFPGUBJPKGQ7CZ2/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-15 Thread Diego Ercolani
During this time (Hosted-Engine Hung, this appears in the host were it's 
supposed to have Hosted-Engine Running:
2022-09-15 13:59:27,762+ WARN  (Thread-10) [virt.vm] 
(vmId='8486ed73-df34-4c58-bfdc-7025dec63b7f') Shutdown by QEMU Guest Agent 
failed (agent probably inactive) (vm:5490)
2022-09-15 13:59:27,762+ WARN  (Thread-10) [virt.vm] 
(vmId='8486ed73-df34-4c58-bfdc-7025dec63b7f') Shutting down with guest agent 
FAILED (vmpowerdown:115)
2022-09-15 13:59:28,780+ ERROR (qgapoller/1) [virt.periodic.Operation] 
> 
operation failed (periodic:204)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 202, in 
__call__
self._func()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 
493, in _poller
vm_id, self._qga_call_get_vcpus(vm_obj))
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 
814, in _qga_call_get_vcpus
if 'online' in vcpus:
TypeError: argument of type 'NoneType' is not iterable
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZMZ3V5E4ZFNWPW3R74ZXYFZA5RR3BV7R/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-15 Thread Diego Ercolani
The current set is:
[root@ovirt-node2 ~]# gluster volume get glen cluster.choose-local| awk 
'/choose-local/ {print $2}'
off
[root@ovirt-node2 ~]# gluster volume get gv0 cluster.choose-local| awk 
'/choose-local/ {print $2}'
off
[root@ovirt-node2 ~]# gluster volume get gv1 cluster.choose-local| awk 
'/choose-local/ {print $2}'
off

Is stated in the "virt" group: 
/var/lib/glusterd/groups/virt:cluster.choose-local=off

I set the cluster.choose-local to true on every gluster volume and started 
migrating  Hosted Engine around... a bunch of vms freezed and after a while 
also the Hosted-Engine hung

To complete the environment, here it is the complete set for the glen (Hosted 
-Engine volume) gv0 and gv1 (volumes used by VMs):

[root@ovirt-node3 ~]# gluster volume info gv1
Volume Name: gv1
Type: Replicate
Volume ID: 863221f4-e11c-4589-95e9-aa3948e177f5
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: ovirt-node2.ovirt:/brickgv1/gv1
Brick2: ovirt-node3.ovirt:/brickgv1/gv1
Brick3: ovirt-node4.ovirt:/dati/gv1 (arbiter)
Options Reconfigured:
storage.build-pgfid: off
cluster.granular-entry-heal: enable
storage.owner-gid: 36
storage.owner-uid: 36
cluster.lookup-optimize: off
server.keepalive-count: 5
server.keepalive-interval: 2
server.keepalive-time: 10
server.tcp-user-timeout: 20
network.ping-timeout: 30
performance.client-io-threads: on
server.event-threads: 4
client.event-threads: 4
cluster.choose-local: true
features.shard: on
cluster.shd-wait-qlength: 1
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
performance.strict-o-direct: on
network.remote-dio: off
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
transport.address-family: inet
nfs.disable: on
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ARHV3AX7I7NZ5LYMZR7FHBXMENHSVVYN/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-15 Thread Strahil Nikolov via Users
Can you test the backup after setting:status=$(gluster volume get  
cluster.choose-local awk '/choose-local/ {print $2}')
gluster volume set  cluster.choose-local true
And after the test:gluster volume set  cluster.choose-local $status
Best Regards,Strahil Nikolov 
 
 
  On Thu, Sep 15, 2022 at 12:26, Diego Ercolani wrote:  
 Sorry, I see that the editor bring away all the head spaces that indent the 
timestamp.
I retried the test, hoping to find the same error, and I found it. On node3. I 
changed the code of the read routine:
cd /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1 ; do 
date +'Timestamp:%s.%N'; cat testfile  ; done

Also I have to point that in my gluster configuration: node2 and node3 are 
replicating while node4 is the arbiter.

I find this:
node2:
1663233449.088250919
1663233449.192357508
1663233449.296979848
1663233449.401279036
1663233449.504945285
1663233449.609107728
1663233449.713468581
1663233449.817435890
1663233449.922132348
1663233450.030449768
1663233450.134975317
1663233450.239171022
1663233450.342905278
1663233450.447466303
1663233450.551867180
1663233450.658387123
1663233450.762761972
1663233450.868063254
1663233450.973718716
1663233451.077074998
1663233451.181540916
1663233451.286831549
1663233451.393060700
1663233451.500488204
1663233451.606233103
1663233451.711308978
1663233451.816455012
1663233451.922142384
1663233452.028786138
1663233452.134080858
1663233452.239052098
1663233452.343540758
1663233452.449015706
1663233452.553832377
1663233452.658255495
1663233452.762774092
1663233452.866525770
1663233452.970784862
1663233453.075297458
1663233453.178379039
1663233453.281728609
1663233453.385722608
1663233453.489965321
1663233453.593885612
1663233453.698436388
1663233453.802415640
1663233453.906987275
1663233454.010658544
1663233454.114877122
1663233454.218459344
1663233454.322761948
1663233454.428025821
1663233454.533464752
1663233454.637652754
1663233454.741783087
1663233454.845600527
1663233454.950286885
1663233455.055143240
1663233455.161169524
1663233455.265582394
1663233455.369963173
1663233455.475453048
1663233455.580044209
1663233455.684503325
1663233455.788750947
1663233455.894135415
1663233455.998738750


node3:
Timestamp:1663233450.000172185
1663233449.296979848
Timestamp:1663233450.101871259
1663233449.296979848
Timestamp:1663233450.204006554
1663233449.296979848
Timestamp:1663233450.306014420
1663233449.296979848
Timestamp:1663233450.407890669
1663233450.342905278
Timestamp:1663233450.511435794
1663233450.342905278
Timestamp:1663233450.613144044
1663233450.342905278
Timestamp:1663233450.714936282
1663233450.342905278
Timestamp:1663233450.816689957
1663233450.342905278
Timestamp:1663233450.919563686
1663233450.342905278
Timestamp:1663233451.021558628
1663233450.342905278
Timestamp:1663233451.123617850
1663233450.342905278
Timestamp:1663233451.225769366
1663233450.342905278
Timestamp:1663233451.327726226
1663233450.342905278
Timestamp:1663233451.429934369
1663233451.393060700
Timestamp:1663233451.532945857
1663233451.393060700
Timestamp:1663233451.634935468
1663233451.393060700
Timestamp:1663233451.737058041
1663233451.393060700
Timestamp:1663233451.839167797
1663233451.393060700
Timestamp:1663233451.941486148
1663233451.393060700
Timestamp:1663233452.043288336
1663233451.393060700
Timestamp:1663233452.145090644
1663233451.393060700
Timestamp:1663233452.246825425
1663233451.393060700
Timestamp:1663233452.348501234
1663233451.393060700
Timestamp:1663233452.450351853
Timestamp:1663233452.553106458
Timestamp:1663233452.655222156
Timestamp:1663233452.757315704
Timestamp:1663233452.859298562
Timestamp:1663233452.961655817
Timestamp:1663233453.063383043
Timestamp:1663233453.165180993
Timestamp:1663233453.266883792
Timestamp:1663233453.368890215
Timestamp:1663233453.470586924
1663233453.385722608
Timestamp:1663233453.573171648
1663233453.385722608
Timestamp:1663233453.675160288
1663233453.385722608
Timestamp:1663233453.777281257
1663233453.385722608
Timestamp:1663233453.879306084
1663233453.385722608
Timestamp:1663233453.981588858
1663233453.385722608
Timestamp:1663233454.083371309
1663233453.385722608
Timestamp:1663233454.185268095
1663233453.385722608
Timestamp:1663233454.287256013
1663233453.385722608
Timestamp:1663233454.389068540
1663233453.385722608
Timestamp:1663233454.490809573
1663233454.428025821
Timestamp:1663233454.593597380
1663233454.428025821
Timestamp:1663233454.695329646
1663233454.428025821
Timestamp:1663233454.797029330
1663233454.428025821
Timestamp:1663233454.899000216
1663233454.428025821

node4:
Timestam:1663233450.043398632
1663233449.817435890
Timestam:1663233450.144889219
1663233449.817435890
Timestam:1663233450.246423969
1663233449.817435890
Timestam:1663233450.347730771
1663233449.817435890
Timestam:1663233450.449109919
1663233449.817435890
Timestam:1663233450.550659616
1663233449.817435890
Timestam:1663233450.652173237
1663233449.817435890
Timestam:1663233450.753610724
1663233449.817435890
Timestam:1663233450.855978621

[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-15 Thread Diego Ercolani
Sorry, I see that the editor bring away all the head spaces that indent the 
timestamp.
I retried the test, hoping to find the same error, and I found it. On node3. I 
changed the code of the read routine:
cd /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1 ; do 
date +'Timestamp:%s.%N'; cat testfile  ; done

Also I have to point that in my gluster configuration: node2 and node3 are 
replicating while node4 is the arbiter.

I find this:
node2:
1663233449.088250919
1663233449.192357508
1663233449.296979848
1663233449.401279036
1663233449.504945285
1663233449.609107728
1663233449.713468581
1663233449.817435890
1663233449.922132348
1663233450.030449768
1663233450.134975317
1663233450.239171022
1663233450.342905278
1663233450.447466303
1663233450.551867180
1663233450.658387123
1663233450.762761972
1663233450.868063254
1663233450.973718716
1663233451.077074998
1663233451.181540916
1663233451.286831549
1663233451.393060700
1663233451.500488204
1663233451.606233103
1663233451.711308978
1663233451.816455012
1663233451.922142384
1663233452.028786138
1663233452.134080858
1663233452.239052098
1663233452.343540758
1663233452.449015706
1663233452.553832377
1663233452.658255495
1663233452.762774092
1663233452.866525770
1663233452.970784862
1663233453.075297458
1663233453.178379039
1663233453.281728609
1663233453.385722608
1663233453.489965321
1663233453.593885612
1663233453.698436388
1663233453.802415640
1663233453.906987275
1663233454.010658544
1663233454.114877122
1663233454.218459344
1663233454.322761948
1663233454.428025821
1663233454.533464752
1663233454.637652754
1663233454.741783087
1663233454.845600527
1663233454.950286885
1663233455.055143240
1663233455.161169524
1663233455.265582394
1663233455.369963173
1663233455.475453048
1663233455.580044209
1663233455.684503325
1663233455.788750947
1663233455.894135415
1663233455.998738750


node3:
Timestamp:1663233450.000172185
1663233449.296979848
Timestamp:1663233450.101871259
1663233449.296979848
Timestamp:1663233450.204006554
1663233449.296979848
Timestamp:1663233450.306014420
1663233449.296979848
Timestamp:1663233450.407890669
1663233450.342905278
Timestamp:1663233450.511435794
1663233450.342905278
Timestamp:1663233450.613144044
1663233450.342905278
Timestamp:1663233450.714936282
1663233450.342905278
Timestamp:1663233450.816689957
1663233450.342905278
Timestamp:1663233450.919563686
1663233450.342905278
Timestamp:1663233451.021558628
1663233450.342905278
Timestamp:1663233451.123617850
1663233450.342905278
Timestamp:1663233451.225769366
1663233450.342905278
Timestamp:1663233451.327726226
1663233450.342905278
Timestamp:1663233451.429934369
1663233451.393060700
Timestamp:1663233451.532945857
1663233451.393060700
Timestamp:1663233451.634935468
1663233451.393060700
Timestamp:1663233451.737058041
1663233451.393060700
Timestamp:1663233451.839167797
1663233451.393060700
Timestamp:1663233451.941486148
1663233451.393060700
Timestamp:1663233452.043288336
1663233451.393060700
Timestamp:1663233452.145090644
1663233451.393060700
Timestamp:1663233452.246825425
1663233451.393060700
Timestamp:1663233452.348501234
1663233451.393060700
Timestamp:1663233452.450351853
Timestamp:1663233452.553106458
Timestamp:1663233452.655222156
Timestamp:1663233452.757315704
Timestamp:1663233452.859298562
Timestamp:1663233452.961655817
Timestamp:1663233453.063383043
Timestamp:1663233453.165180993
Timestamp:1663233453.266883792
Timestamp:1663233453.368890215
Timestamp:1663233453.470586924
1663233453.385722608
Timestamp:1663233453.573171648
1663233453.385722608
Timestamp:1663233453.675160288
1663233453.385722608
Timestamp:1663233453.777281257
1663233453.385722608
Timestamp:1663233453.879306084
1663233453.385722608
Timestamp:1663233453.981588858
1663233453.385722608
Timestamp:1663233454.083371309
1663233453.385722608
Timestamp:1663233454.185268095
1663233453.385722608
Timestamp:1663233454.287256013
1663233453.385722608
Timestamp:1663233454.389068540
1663233453.385722608
Timestamp:1663233454.490809573
1663233454.428025821
Timestamp:1663233454.593597380
1663233454.428025821
Timestamp:1663233454.695329646
1663233454.428025821
Timestamp:1663233454.797029330
1663233454.428025821
Timestamp:1663233454.899000216
1663233454.428025821

node4:
Timestam:1663233450.043398632
1663233449.817435890
Timestam:1663233450.144889219
1663233449.817435890
Timestam:1663233450.246423969
1663233449.817435890
Timestam:1663233450.347730771
1663233449.817435890
Timestam:1663233450.449109919
1663233449.817435890
Timestam:1663233450.550659616
1663233449.817435890
Timestam:1663233450.652173237
1663233449.817435890
Timestam:1663233450.753610724
1663233449.817435890
Timestam:1663233450.855978621
1663233450.762761972
Timestam:1663233450.958988505
1663233450.762761972
Timestam:1663233451.060495133
1663233450.762761972
Timestam:1663233451.162022459
1663233450.762761972
Timestam:1663233451.263371279
1663233450.762761972
Timestam:1663233451.364879118
1663233450.762761972
Timestam:1663233451.466311416
1663233450.762761972

[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-15 Thread Diego Ercolani
Thank you for the analisys:

The version is the last distributed in the ovirt@centos8 distribution:
[root@ovirt-node2 ~]# rpm -qa | grep '\(glusterfs-server\|ovirt-node\)'
ovirt-node-ng-image-update-placeholder-4.5.2-1.el8.noarch
glusterfs-server-10.2-1.el8s.x86_64
ovirt-node-ng-nodectl-4.4.2-1.el8.noarch
python3-ovirt-node-ng-nodectl-4.4.2-1.el8.noarch
ovirt-node-ng-image-update-4.5.2-1.el8.noarch

[root@ovirt-node3 ~]# rpm -qa | grep '\(glusterfs-server\|ovirt-node\)'
ovirt-node-ng-image-update-placeholder-4.5.2-1.el8.noarch
glusterfs-server-10.2-1.el8s.x86_64
ovirt-node-ng-nodectl-4.4.2-1.el8.noarch
python3-ovirt-node-ng-nodectl-4.4.2-1.el8.noarch
ovirt-node-ng-image-update-4.5.2-1.el8.noarch

[root@ovirt-node4 ~]# rpm -qa | grep '\(glusterfs-server\|ovirt-node\)'
ovirt-node-ng-image-update-placeholder-4.5.2-1.el8.noarch
glusterfs-server-10.2-1.el8s.x86_64
ovirt-node-ng-nodectl-4.4.2-1.el8.noarch
python3-ovirt-node-ng-nodectl-4.4.2-1.el8.noarch
ovirt-node-ng-image-update-4.5.2-1.el8.noarch

Duiring backup (or when there is an input/output (even not to intensive looking 
at the SDD led) the only think I noticed is that sometimes there is a sort of 
lag:
I issue "gluster volume heal glen|gv0|gv1 info" and the answer wait 4-5 seconds 
before answer... even if the aswer give 0 object missing... I have ever 
connected nodes. eg.
Brick ovirt-node2.ovirt:/brickhe/_glen
Status: Connected
Number of entries: 0

Brick ovirt-node3.ovirt:/brickhe/glen
Status: Connected
Number of entries: 0

Brick ovirt-node4.ovirt:/dati/_glen
Status: Connected
Number of entries: 0

For hte "rate limit" I didn't work on the QOS, but the destination is an NFS 
sata raid5 NAS publisced via 1Gb link so I think I have a 20MB/s "cap" by 
architecture, the gluster bricks are all built by SSD SATA drives, I recorded a 
troughput of 200MB/s.
I also tried to monitor performace via iotop command but I didn't recorded a 
"band problem" and even monitored network via iftop recording no band 
saturation and no errors.

Searching in the gluster mailing list 
(https://lists.gluster.org/pipermail/gluster-users/2022-September/040063.html) 
I tried the same test but under 1/10 seconds write and read:
[root@ovirt-node2 ~]# su - vdsm -s /bin/bash
Last login: Wed Sep 14 15:33:45 UTC 2022 on pts/1
nodectl must be run as root!
nodectl must be run as root!
cd /rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1; do 
date +'%s.%N' | tee testfile ; done

ovirt-node-ng-image-update-4.5.2-1.el8.noarch
[root@ovirt-node3 ~]# su - vdsm -s /bin/bash
nodectl must be run as root!
nodectl must be run as root!
[vdsm@ovirt-node3 ~]$ cd 
/rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1 ; do 
date +' %s.%N'; cat testfile  ; done

[root@ovirt-node4 ~]# su - vdsm -s /bin/bash
Last login: Wed Aug 24 16:52:55 UTC 2022
nodectl must be run as root!
nodectl must be run as root!
[vdsm@ovirt-node4 ~]$ cd 
/rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1; while sleep 0.1 ; do 
date +' %s.%N'; cat testfile  ; done

Obtaining that for the nodes reading glusterfs I record only a 1 second 
update... more or less:
to report the test I selected timestamp for node2 (the write node) betweeen 
1663228352 and 1663228356, for node3 and 4 between 1663228353 and 1663228356:

node2:
1663228352.589998302
1663228352.695887198
1663228352.801681699
1663228352.907548634
1663228353.011931276
1663228353.115904115
1663228353.222383590
1663228353.329941123
1663228353.436480791
1663228353.540536995
1663228353.644858473
1663228353.749470221
1663228353.853969491
1663228353.958703186
1663228354.062732971
1663228354.166616934
1663228354.270398507
1663228354.373989214
1663228354.477149100
1663228354.581862187
1663228354.686177524
1663228354.790362507
1663228354.894673446
1663228354.999136257
1663228355.102889616
1663228355.207043913
1663228355.312522545
1663228355.416667384
1663228355.520897473
1663228355.624582255
1663228355.728590069
1663228355.832979634
1663228355.937309737
1663228356.042289521
1663228356.146565174
1663228356.250773672
1663228356.356361818
1663228356.460048755
1663228356.565054968
1663228356.669126850
1663228356.773807899
1663228356.878011739
1663228356.983842597

node3:
 1663228353.027991911
1663228352.064562785
 1663228353.129696675
1663228353.115904115
 1663228353.232351572
1663228353.115904115
 1663228353.334188748
1663228353.115904115
 1663228353.436208688
1663228353.115904115
 1663228353.538268493
1663228353.115904115
 1663228353.641266519
1663228353.115904115
 1663228353.743094997
1663228353.115904115
 1663228353.845244131
1663228353.115904115
 1663228353.947049766
1663228353.115904115
 1663228354.048876741
1663228353.115904115
 1663228354.150979017
1663228354.062732971
 1663228354.254198339
1663228354.062732971
 1663228354.356197640
1663228354.270398507
 1663228354.459541685
1663228354.270398507
 1663228354.561548541
1663228354.270398507
 1663228354.664280563
1663228354.270398507
 1663228354.766557007
1663228354.270398507
 

[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-14 Thread Strahil Nikolov via Users
I see some entries that are not good:
[2022-09-11 03:50:26.131393 +] W [MSGID: 108001] 
[afr-transaction.c:1016:afr_handle_quorum] 0-gv1-replicate-0: 
228740f8-1d14-4253-b95b-47e5feb6a3cc: Failing WRITE as quorum is not met 
[Invalid argument]

When the backup goes , what is the output of 'gluster pool list'?
Which gluster version are you using:rpm -qa | grep gluster
Is it possible to rate limit the backup ?
Best Regards,Strahil Nikolov 
 
 
  On Mon, Sep 12, 2022 at 19:22, Diego Ercolani wrote:  
 Hello. I did a full backup using veeam but I recorded many errors in the 
gluster log.
This is the log (https://cloud.ssis.sm/index.php/s/KRimf5MLXK3Ds3d). The log is 
from the same node where veeam-proxy and the backupped VMs resides.
Both are running in the gv1 storage domain.

See that hours are reported in GMT and the backup started at 15:55:48 GMT and 
ended at 16:04:35.
The log is full of errors that I cannot understand:
...
[2022-09-12 15:56:24.983887 +] W [MSGID: 114031] 
[client-rpc-fops_v2.c:2620:client4_0_lookup_cbk] 0-gv1-client-2: remote 
operation failed. [{path=}, 
{gfid=2ca980cf-3cda-4f56-84bd-c089cdeb5878}, {errno=2}, {error=No such file or 
directory}]
...
[2022-09-12 15:56:42.570987 +] W [fuse-bridge.c:2979:fuse_readv_cbk] 
0-glusterfs-fuse: 891795: READ => -1 gfid=2906fff6-2292-451c-879c-547e189d0bb3 
fd=0x55b916d199b8 (Invalid argument)
...

e.g.
[root@ovirt-node3 ~]# mount -t glusterfs -o aux-gfid-mount,ro 
ovirt-node2.ovirt:/gv1 /mnt/
[root@ovirt-node3 ~]# gluster volume set gv1 build-pgfid on
volume set: success
[root@ovirt-node3 ~]# getfattr -n glusterfs.ancestry.path -e text 
/mnt/.gfid/2ca980cf-3cda-4f56-84bd-c089cdeb5878
getfattr: /mnt/.gfid/2ca980cf-3cda-4f56-84bd-c089cdeb5878: No such file or 
directory
[root@ovirt-node3 ~]# getfattr -n glusterfs.ancestry.path -e text 
/mnt/.gfid/2906fff6-2292-451c-879c-547e189d0bb3
getfattr: /mnt/.gfid/2906fff6-2292-451c-879c-547e189d0bb3: No such file or 
directory
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/GRFUCFTWFUZUP5GQ74SHKDHMAUTHR57S/
  
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/Y4QI63IUYTKDHAZTI43Y5S6S5N2UR26M/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-12 Thread Diego Ercolani
Hello. I did a full backup using veeam but I recorded many errors in the 
gluster log.
This is the log (https://cloud.ssis.sm/index.php/s/KRimf5MLXK3Ds3d). The log is 
from the same node where veeam-proxy and the backupped VMs resides.
Both are running in the gv1 storage domain.

See that hours are reported in GMT and the backup started at 15:55:48 GMT and 
ended at 16:04:35.
The log is full of errors that I cannot understand:
...
[2022-09-12 15:56:24.983887 +] W [MSGID: 114031] 
[client-rpc-fops_v2.c:2620:client4_0_lookup_cbk] 0-gv1-client-2: remote 
operation failed. [{path=}, 
{gfid=2ca980cf-3cda-4f56-84bd-c089cdeb5878}, {errno=2}, {error=No such file or 
directory}]
...
[2022-09-12 15:56:42.570987 +] W [fuse-bridge.c:2979:fuse_readv_cbk] 
0-glusterfs-fuse: 891795: READ => -1 gfid=2906fff6-2292-451c-879c-547e189d0bb3 
fd=0x55b916d199b8 (Invalid argument)
...

e.g.
[root@ovirt-node3 ~]# mount -t glusterfs -o aux-gfid-mount,ro 
ovirt-node2.ovirt:/gv1 /mnt/
[root@ovirt-node3 ~]# gluster volume set gv1 build-pgfid on
volume set: success
[root@ovirt-node3 ~]# getfattr -n glusterfs.ancestry.path -e text 
/mnt/.gfid/2ca980cf-3cda-4f56-84bd-c089cdeb5878
getfattr: /mnt/.gfid/2ca980cf-3cda-4f56-84bd-c089cdeb5878: No such file or 
directory
[root@ovirt-node3 ~]# getfattr -n glusterfs.ancestry.path -e text 
/mnt/.gfid/2906fff6-2292-451c-879c-547e189d0bb3
getfattr: /mnt/.gfid/2906fff6-2292-451c-879c-547e189d0bb3: No such file or 
directory
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/GRFUCFTWFUZUP5GQ74SHKDHMAUTHR57S/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-11 Thread Diego Ercolani
Yes, it seem so, but I cannot record any "erroir" on the interface, I have 0 TX 
error and 0 RX error. all three nodes are connected through a single 
switch. I set the MTU to 9000 to help gluster transfers but I cannot record any 
error.
In the /var/log/vdsm/vdsm.log I log periodically in all the nodes I cannot 
understand what is happening. I have a 10Gb/s unsaturated links between hosts:
2022-09-11 12:30:31,708+ ERROR (periodic/134) 
[ovirt_hosted_engine_ha.client.client.HAClient] Malformed metadata for host 2: 
received 0 of 512 expected bytes (client:137)
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ECAPNBSMZ4TYA3VSA62C77G6NT5O32UN/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-09 Thread Strahil Nikolov via Users
I suspect you have network issues.Check the gluster log for the client side 
/var/log/glusterfs/rhev-data-center-mnt-glusterSD-:_.log

 Best Regards,Strahil Nikolov 
 
  On Tue, Sep 6, 2022 at 17:19, Diego Ercolani wrote:   
I really don't understand, I was monitoring vdsm.log of one node (node2)
And I saw a complain:
2022-09-06 14:08:27,105+ ERROR (check/loop) [storage.monitor] Error 
checking path 
/rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1/45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata
 (monitor:511)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in 
_pathChecked
    delay = result.delay()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in 
delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
vdsm.storage.exception.MiscFileReadException: Internal file read failure: 
('/rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1/45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata',
 1, 'Read timeout')
2022-09-06 14:08:27,105+ INFO  (check/loop) [storage.monitor] Domain 
45b4f14c-8323-482f-90ab-99d8fd610018 became INVALID (monitor:482)
2022-09-06 14:08:27,149+ ERROR (check/loop) [storage.monitor] Error 
checking path 
/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata
 (monitor:511)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in 
_pathChecked
    delay = result.delay()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in 
delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
vdsm.storage.exception.MiscFileReadException: Internal file read failure: 
('/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata',
 1, 'Read timeout')
2022-09-06 14:08:27,814+ INFO  (jsonrpc/5) [api.virt] START getStats() 
from=::1,54242, vmId=8486ed73-df34-4c58-bfdc-7025dec63b7f (api:48)
2022-09-06 14:08:27,814+ INFO  (jsonrpc/5) [api] FINISH getStats 
error=Virtual machine does not exist: {'vmId': 
'8486ed73-df34-4c58-bfdc-7025dec63b7f'} (api:129)
2022-09-06 14:08:27,814+ INFO  (jsonrpc/5) [api.virt] FINISH getStats 
return={'status': {'code': 1, 'message': "Virtual machine does not exist: 
{'vmId': '8486ed73-df34-4c58-bfdc-7025dec63b7f'}"}} from=::1,54242, 
vmId=8486ed73-df34-4c58-bfdc-7025dec63b7f (api:54)
2022-09-06 14:08:27,814+ INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call 
VM.getStats failed (error 1) in 0.00 seconds (__init__:312)
2022-09-06 14:08:31,357+ ERROR (check/loop) [storage.monitor] Error 
checking path 
/rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata
 (monitor:511)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in 
_pathChecked
    delay = result.delay()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in 
delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
vdsm.storage.exception.MiscFileReadException: Internal file read failure: 
('/rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata',
 1, 'Read timeout')
2022-09-06 14:08:32,918+ INFO  (periodic/5) [Executor] Worker was discarded 
(executor:305)

but, on the same node, from commandline I can issue a simple cat without any 
problem:
[root@ovirt-node2 ~]# cat 
"/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata"
ALIGNMENT=1048576
BLOCK_SIZE=512
CLASS=Data
DESCRIPTION=gv0
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
POOL_UUID=da146814-f823-40e0-bd7b-8478dcfa38cd
REMOTE_PATH=localhost:/gv0
ROLE=Regular
SDUUID=60b7f172-08ed-4a22-8414-31fd5b100d72
TYPE=GLUSTERFS
VERSION=5
_SHA_CKSUM=a63324fa9b3030c3ffa35891c2d6c4e129c76af9

and
[root@ovirt-node2 ~]# cat 
'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata'
ALIGNMENT=1048576
BLOCK_SIZE=512
CLASS=Data
DESCRIPTION=gv0
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
POOL_UUID=da146814-f823-40e0-bd7b-8478dcfa38cd
REMOTE_PATH=localhost:/gv0
ROLE=Regular
SDUUID=60b7f172-08ed-4a22-8414-31fd5b100d72
TYPE=GLUSTERFS
VERSION=5
_SHA_CKSUM=a63324fa9b3030c3ffa35891c2d6c4e129c76af9

after while, I retried the same cat, and the host console hanged so, 
sometimes, gluster revoke access to file ?!?!?! why?

I think this "hang" is the source of all my problems.
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 

[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-06 Thread Diego Ercolani
I really don't understand, I was monitoring vdsm.log of one node (node2)
And I saw a complain:
2022-09-06 14:08:27,105+ ERROR (check/loop) [storage.monitor] Error 
checking path 
/rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1/45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata
 (monitor:511)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in 
_pathChecked
delay = result.delay()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in 
delay
raise exception.MiscFileReadException(self.path, self.rc, self.err)
vdsm.storage.exception.MiscFileReadException: Internal file read failure: 
('/rhev/data-center/mnt/glusterSD/ovirt-node2.ovirt:_gv1/45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata',
 1, 'Read timeout')
2022-09-06 14:08:27,105+ INFO  (check/loop) [storage.monitor] Domain 
45b4f14c-8323-482f-90ab-99d8fd610018 became INVALID (monitor:482)
2022-09-06 14:08:27,149+ ERROR (check/loop) [storage.monitor] Error 
checking path 
/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata
 (monitor:511)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in 
_pathChecked
delay = result.delay()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in 
delay
raise exception.MiscFileReadException(self.path, self.rc, self.err)
vdsm.storage.exception.MiscFileReadException: Internal file read failure: 
('/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata',
 1, 'Read timeout')
2022-09-06 14:08:27,814+ INFO  (jsonrpc/5) [api.virt] START getStats() 
from=::1,54242, vmId=8486ed73-df34-4c58-bfdc-7025dec63b7f (api:48)
2022-09-06 14:08:27,814+ INFO  (jsonrpc/5) [api] FINISH getStats 
error=Virtual machine does not exist: {'vmId': 
'8486ed73-df34-4c58-bfdc-7025dec63b7f'} (api:129)
2022-09-06 14:08:27,814+ INFO  (jsonrpc/5) [api.virt] FINISH getStats 
return={'status': {'code': 1, 'message': "Virtual machine does not exist: 
{'vmId': '8486ed73-df34-4c58-bfdc-7025dec63b7f'}"}} from=::1,54242, 
vmId=8486ed73-df34-4c58-bfdc-7025dec63b7f (api:54)
2022-09-06 14:08:27,814+ INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call 
VM.getStats failed (error 1) in 0.00 seconds (__init__:312)
2022-09-06 14:08:31,357+ ERROR (check/loop) [storage.monitor] Error 
checking path 
/rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata
 (monitor:511)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 509, in 
_pathChecked
delay = result.delay()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 398, in 
delay
raise exception.MiscFileReadException(self.path, self.rc, self.err)
vdsm.storage.exception.MiscFileReadException: Internal file read failure: 
('/rhev/data-center/mnt/glusterSD/localhost:_glen/3577c21e-f757-4405-97d1-0f827c9b4e22/dom_md/metadata',
 1, 'Read timeout')
2022-09-06 14:08:32,918+ INFO  (periodic/5) [Executor] Worker was discarded 
(executor:305)

but, on the same node, from commandline I can issue a simple cat without any 
problem:
[root@ovirt-node2 ~]# cat 
"/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata"
ALIGNMENT=1048576
BLOCK_SIZE=512
CLASS=Data
DESCRIPTION=gv0
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
POOL_UUID=da146814-f823-40e0-bd7b-8478dcfa38cd
REMOTE_PATH=localhost:/gv0
ROLE=Regular
SDUUID=60b7f172-08ed-4a22-8414-31fd5b100d72
TYPE=GLUSTERFS
VERSION=5
_SHA_CKSUM=a63324fa9b3030c3ffa35891c2d6c4e129c76af9

and
[root@ovirt-node2 ~]# cat 
'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/dom_md/metadata'
ALIGNMENT=1048576
BLOCK_SIZE=512
CLASS=Data
DESCRIPTION=gv0
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
POOL_UUID=da146814-f823-40e0-bd7b-8478dcfa38cd
REMOTE_PATH=localhost:/gv0
ROLE=Regular
SDUUID=60b7f172-08ed-4a22-8414-31fd5b100d72
TYPE=GLUSTERFS
VERSION=5
_SHA_CKSUM=a63324fa9b3030c3ffa35891c2d6c4e129c76af9

after while, I retried the same cat, and the host console hanged so, 
sometimes, gluster revoke access to file ?!?!?! why?

I think this "hang" is the source of all my problems.
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/DY5WC45NQSQK2IPWSUKUT5U4MIFMPNW7/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-05 Thread Diego Ercolani
I don't have disk problems as I enabled smartd and I perform a periodic test 
(smartctl -t long )
but in sanlock I have some problems, and also in gluster glheal logs are not 
clean:

The last event I recorded is today at 00:28 (22/09/4  22:28 GMTZ), this is the 
time when node3 sent mail:
ovirt-hosted-engine state transition EngineMaybeAway-EngineDown
Received: from ovirt-node3.ovirt 
Mon,  5 Sep 2022 00:28:45 +0200 (CEST)
These are a bunch of log from the three nodes. The gluster volume of the 
hosted-engine is "glen":
https://cloud.ssis.sm/index.php/s/dSRyT3MM6ESnrb9
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/2RD5WE2A6SLSL54K5YZKLNU4DEUCVOSB/


[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-03 Thread Strahil Nikolov via Users
Any sanlock errors to indicate storage problems ?Have you checked Gluster logs 
for errors or indication of network disruption?
Best Regards,Strahil Nikolov 
 
 
  On Thu, Sep 1, 2022 at 12:18, Diego Ercolani wrote:   
Hello, I have a cluster made by 3 nodes in a "self-hosted-engine" topology.
I implemented the storage with gluster implementation in 2 replica + arbiter 
topology.
I have two gluster volumes
glen - is the volume used by hosted-engine vm
gv0 - is the volume used by VMs

The physical disks are 4TB SSD used only to accomodate VMs (also hosted-engine)

I have continuos VMs hangs, even hosted-engine, this give full of troubles as I 
have continuous hangs by hosted-engine and this happen asyncrounosly even while 
there is management operation on VMs (mobility, cloning...)

after a while it happens that the VM is freed but in the VMs I have in console 
kernel complaining by CPU hang or timer hangs and the solution is only to 
shutdown/poweroff the VM... even hosted engine in fact it happens that 
hosted-engine -vm-status give "state=EngineUpBadHealth"

This is the log during the event in the host while there is the event:
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]:   
[1662018538.0166] device (vnet73): state change: activated -> unmanaged (reason 
'unmanaged', sys-iface-state: 'removed')
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]:   
[1662018538.0168] device (vnet73): released from master device ovirtmgmt
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: Unable to read from monitor: 
Connection reset by peer
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: internal error: qemu 
unexpectedly closed the monitor: 2022-09-01T07:48:57.930955Z qemu-kvm: -device 
virtio-blk-pci,iothread=iothread1,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-0a1a501c-fc45-430f-bfd3-076172cec406,bootindex=1,write-cache=on,serial=0a1a501c-fc45-430f-bfd3-076172cec406,werror=stop,rerror=stop:
 Failed to get "write" lock Is another process using the image 
[/run/vdsm/storage/3577c21e-f757-4405-97d1-0f827c9b4e22/0a1a501c-fc45-430f-bfd3-076172cec406/f65dab86-67f1-46fa-87c0-f9076f479741]?
Sep 01 07:48:58 ovirt-node3.ovirt kvm[268578]: 5 guests now active
Sep 01 07:48:58 ovirt-node3.ovirt systemd[1]: 
machine-qemu\x2d67\x2dHostedEngine.scope: Succeeded.
Sep 01 07:48:58 ovirt-node3.ovirt systemd-machined[1613]: Machine 
qemu-67-HostedEngine terminated.
Sep 01 07:49:08 ovirt-node3.ovirt systemd[1]: 
NetworkManager-dispatcher.service: Succeeded.
Sep 01 07:49:08 ovirt-node3.ovirt ovirt-ha-agent[3338]: ovirt-ha-agent 
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM stopped 
on localhost
Sep 01 07:49:14 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted 
Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted 
Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted 
Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted 
Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 
[5083]: s4 delta_renew long write time 11 sec
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 
[5033]: s3 delta_renew long write time 11 sec
Sep 01 07:49:34 ovirt-node3.ovirt libvirtd[2496]: Domain id=65 
name='ocr-Brain-28-ovirt.dmz.ssis' uuid=00425fb1-c24b-4eaa-9683-534d66b2cb04 is 
tainted: custom-ga-command
Sep 01 07:49:47 ovirt-node3.ovirt sudo[268984]:    root : TTY=unknown ; PWD=/ ; 
USER=root ; COMMAND=/bin/privsep-helper --privsep_context 
os_brick.privileged.default --privsep_sock_path /tmp/tmp1iolt06i/privsep.sock

This is the indication I have on gluster:
[root@ovirt-node3 ~]# gluster volume heal gv0 info
Brick ovirt-node2.ovirt:/brickgv0/_gv0
Status: Connected
Number of entries: 0

Brick ovirt-node3.ovirt:/brickgv0/gv0_1
Status: Connected
Number of entries: 0

Brick ovirt-node4.ovirt:/dati/_gv0
Status: Connected
Number of entries: 0


[root@ovirt-node3 ~]# gluster volume heal glen info
Brick ovirt-node2.ovirt:/brickhe/_glen
Status: Connected
Number of entries: 0

Brick ovirt-node3.ovirt:/brickhe/glen
Status: Connected
Number of entries: 0

Brick ovirt-node4.ovirt:/dati/_glen
Status: Connected
Number of entries: 0

So it seem healty.

I don't know how to address the issue but this is a great problem.
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/62EZCKA5HCAY2CP4O7RAXVNYEW4RAQOF/
  
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: 

[ovirt-users] Re: VMs hang periodically: gluster problem?

2022-09-01 Thread Diego Ercolani
Versions are last:
ovirt-host-4.5.0-3.el8.x86_64
ovirt-engine-4.5.2.4-1.el8.noarch
glusterfs-server-10.2-1.el8s.x86_64
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/FPKATK2I5E7EEHVX3NDJYL6PIFUXUCAQ/