Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-21 Thread Krutika Dhananjay
Hi Martin,

Glad it worked! And yes, 3.7.6 is really old! :)

So the issue is occurring when the vm flushes outstanding data to disk. And
this
is taking > 120s because there's lot of buffered writes to flush, possibly
followed
by an fsync too which needs to sync them to disk (volume profile would have
been helpful in confirming this). All these two options do is to truly
honor O_DIRECT flag
(which is what we want anyway given the vms are opened with 'cache=none'
qemu option).
This will skip write-caching on gluster client side and also bypass the
page-cache on the
gluster-bricks, and so data gets flushed faster, thereby eliminating these
timeouts.

-Krutika


On Mon, May 20, 2019 at 3:38 PM Martin  wrote:

> Hi Krutika,
>
> Also, gluster version please?
>
> I am running old 3.7.6. (Yes I know I should upgrade asap)
>
> I’ve applied firstly "network.remote-dio off", behaviour did not changed,
> VMs got stuck after some time again.
> Then I’ve set "performance.strict-o-direct on" and problem completly
> disappeared. No more stucks at all (7 days without any problems at all).
> This SOLVED the issue.
>
> Can you explain what remote-dio and strict-o-direct variables changed in
> behaviour of my Gluster? It would be great for later archive/users to
> understand what and why this solved my issue.
>
> Anyway, Thanks a LOT!!!
>
> BR,
> Martin
>
> On 13 May 2019, at 10:20, Krutika Dhananjay  wrote:
>
> OK. In that case, can you check if the following two changes help:
>
> # gluster volume set $VOL network.remote-dio off
> # gluster volume set $VOL performance.strict-o-direct on
>
> preferably one option changed at a time, its impact tested and then the
> next change applied and tested.
>
> Also, gluster version please?
>
> -Krutika
>
> On Mon, May 13, 2019 at 1:02 PM Martin Toth  wrote:
>
>> Cache in qemu is none. That should be correct. This is full command :
>>
>> /usr/bin/qemu-system-x86_64 -name one-312 -S -machine
>> pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp
>> 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1
>> -no-user-config -nodefaults -chardev
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait
>> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime
>> -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device
>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
>>
>> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4
>> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
>> -drive file=/var/lib/one//datastores/116/312/*disk.0*
>> ,format=raw,if=none,id=drive-virtio-disk1,cache=none
>> -device
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1
>> -drive file=gluster://localhost:24007/imagestore/
>> *7b64d6757acc47a39503f68731f89b8e*
>> ,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none
>> -device
>> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
>> -drive file=/var/lib/one//datastores/116/312/*disk.1*
>> ,format=raw,if=none,id=drive-ide0-0-0,readonly=on
>> -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
>>
>> -netdev tap,fd=26,id=hostnet0
>> -device 
>> e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3
>> -chardev pty,id=charserial0 -device
>> isa-serial,chardev=charserial0,id=serial0
>> -chardev 
>> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait
>> -device
>> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
>> -vnc 0.0.0.0:312,password -device
>> cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
>> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
>>
>> I’ve highlighted disks. First is VM context disk - Fuse used, second is
>> SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used.
>>
>> Krutika,
>> I will start profiling on Gluster Volumes and wait for next VM to fail.
>> Than I will attach/send profiling info after some VM will be failed. I
>> suppose this is correct profiling strategy.
>>
>
> About this, how many vms do you need to recreate it? A single vm? Or
> multiple vms doing IO in parallel?
>
>
>> Thanks,
>> BR!
>> Martin
>>
>> On 13 May 2019, at 09:21, Krutika Dhananjay  wrote:
>>
>> Also, what's the caching policy that qemu is using on the affected vms?
>> Is it cache=none? Or something else? You can get this information in the
>> command line of qemu-kvm process corresponding to your vm in the ps output.
>>
>> -Krutika
>>
>> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay 
>> wrote:
>>
>>> What version of gluster are you using?
>>> Also, can you capture and share volume-profile output for a run where
>>> you manage to recreate this issue?
>>>
>>> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command
>>> Let me know if you have any 

Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-20 Thread Martin
Hi Krutika,

> Also, gluster version please?

I am running old 3.7.6. (Yes I know I should upgrade asap)

I’ve applied firstly "network.remote-dio off", behaviour did not changed, VMs 
got stuck after some time again.
Then I’ve set "performance.strict-o-direct on" and problem completly 
disappeared. No more stucks at all (7 days without any problems at all). This 
SOLVED the issue.

Can you explain what remote-dio and strict-o-direct variables changed in 
behaviour of my Gluster? It would be great for later archive/users to 
understand what and why this solved my issue.

Anyway, Thanks a LOT!!!

BR, 
Martin

> On 13 May 2019, at 10:20, Krutika Dhananjay  wrote:
> 
> OK. In that case, can you check if the following two changes help:
> 
> # gluster volume set $VOL network.remote-dio off
> # gluster volume set $VOL performance.strict-o-direct on
> 
> preferably one option changed at a time, its impact tested and then the next 
> change applied and tested.
> 
> Also, gluster version please?
> 
> -Krutika
> 
> On Mon, May 13, 2019 at 1:02 PM Martin Toth  > wrote:
> Cache in qemu is none. That should be correct. This is full command :
> 
> /usr/bin/qemu-system-x86_64 -name one-312 -S -machine 
> pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 
> 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 
> -no-user-config -nodefaults -chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait
>  -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime 
> -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device 
> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 
> 
> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4
> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
> -drive 
> file=/var/lib/one//datastores/116/312/disk.0,format=raw,if=none,id=drive-virtio-disk1,cache=none
>   -device 
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1
> -drive file=gluster://localhost:24007/imagestore/ 
> <>7b64d6757acc47a39503f68731f89b8e,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none
>   -device 
> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
> -drive 
> file=/var/lib/one//datastores/116/312/disk.1,format=raw,if=none,id=drive-ide0-0-0,readonly=on
>   -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
> 
> -netdev tap,fd=26,id=hostnet0 -device 
> e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 
> -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
> -chardev 
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait
>  -device 
> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
>  -vnc 0.0.0.0:312 ,password -device 
> cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
> 
> I’ve highlighted disks. First is VM context disk - Fuse used, second is SDA 
> (OS is installed here) - libgfapi used, third is SWAP - Fuse used.
> 
> Krutika,
> I will start profiling on Gluster Volumes and wait for next VM to fail. Than 
> I will attach/send profiling info after some VM will be failed. I suppose 
> this is correct profiling strategy.
> 
> About this, how many vms do you need to recreate it? A single vm? Or multiple 
> vms doing IO in parallel?
> 
> 
> Thanks,
> BR!
> Martin
> 
>> On 13 May 2019, at 09:21, Krutika Dhananjay > > wrote:
>> 
>> Also, what's the caching policy that qemu is using on the affected vms?
>> Is it cache=none? Or something else? You can get this information in the 
>> command line of qemu-kvm process corresponding to your vm in the ps output.
>> 
>> -Krutika
>> 
>> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay > > wrote:
>> What version of gluster are you using?
>> Also, can you capture and share volume-profile output for a run where you 
>> manage to recreate this issue?
>> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command
>>  
>> 
>> Let me know if you have any questions.
>> 
>> -Krutika
>> 
>> On Mon, May 13, 2019 at 12:34 PM Martin Toth > > wrote:
>> Hi,
>> 
>> there is no healing operation, not peer disconnects, no readonly filesystem. 
>> Yes, storage is slow and unavailable for 120 seconds, but why, its SSD with 
>> 10G, performance is good.
>> 
>> > you'd have it's log on qemu's standard output,
>> 
>> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking for 
>> problem for more than month, tried everything. Can’t find 

Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-13 Thread Krutika Dhananjay
OK. In that case, can you check if the following two changes help:

# gluster volume set $VOL network.remote-dio off
# gluster volume set $VOL performance.strict-o-direct on

preferably one option changed at a time, its impact tested and then the
next change applied and tested.

Also, gluster version please?

-Krutika

On Mon, May 13, 2019 at 1:02 PM Martin Toth  wrote:

> Cache in qemu is none. That should be correct. This is full command :
>
> /usr/bin/qemu-system-x86_64 -name one-312 -S -machine
> pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp
> 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1
> -no-user-config -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime
> -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device
> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
>
> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4
> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
> -drive file=/var/lib/one//datastores/116/312/*disk.0*
> ,format=raw,if=none,id=drive-virtio-disk1,cache=none
> -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1
> -drive file=gluster://localhost:24007/imagestore/
> *7b64d6757acc47a39503f68731f89b8e*
> ,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none
> -device
> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
> -drive file=/var/lib/one//datastores/116/312/*disk.1*
> ,format=raw,if=none,id=drive-ide0-0-0,readonly=on
> -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
>
> -netdev tap,fd=26,id=hostnet0
> -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3
> -chardev pty,id=charserial0 -device
> isa-serial,chardev=charserial0,id=serial0
> -chardev 
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait
> -device
> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
> -vnc 0.0.0.0:312,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
>
> I’ve highlighted disks. First is VM context disk - Fuse used, second is
> SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used.
>
> Krutika,
> I will start profiling on Gluster Volumes and wait for next VM to fail.
> Than I will attach/send profiling info after some VM will be failed. I
> suppose this is correct profiling strategy.
>

About this, how many vms do you need to recreate it? A single vm? Or
multiple vms doing IO in parallel?


> Thanks,
> BR!
> Martin
>
> On 13 May 2019, at 09:21, Krutika Dhananjay  wrote:
>
> Also, what's the caching policy that qemu is using on the affected vms?
> Is it cache=none? Or something else? You can get this information in the
> command line of qemu-kvm process corresponding to your vm in the ps output.
>
> -Krutika
>
> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay 
> wrote:
>
>> What version of gluster are you using?
>> Also, can you capture and share volume-profile output for a run where you
>> manage to recreate this issue?
>>
>> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command
>> Let me know if you have any questions.
>>
>> -Krutika
>>
>> On Mon, May 13, 2019 at 12:34 PM Martin Toth 
>> wrote:
>>
>>> Hi,
>>>
>>> there is no healing operation, not peer disconnects, no readonly
>>> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why,
>>> its SSD with 10G, performance is good.
>>>
>>> > you'd have it's log on qemu's standard output,
>>>
>>> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking
>>> for problem for more than month, tried everything. Can’t find anything. Any
>>> more clues or leads?
>>>
>>> BR,
>>> Martin
>>>
>>> > On 13 May 2019, at 08:55, lemonni...@ulrar.net wrote:
>>> >
>>> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote:
>>> >> Hi all,
>>> >
>>> > Hi
>>> >
>>> >>
>>> >> I am running replica 3 on SSDs with 10G networking, everything works
>>> OK but VMs stored in Gluster volume occasionally freeze with “Task XY
>>> blocked for more than 120 seconds”.
>>> >> Only solution is to poweroff (hard) VM and than boot it up again. I
>>> am unable to SSH and also login with console, its stuck probably on some
>>> disk operation. No error/warning logs or messages are store in VMs logs.
>>> >>
>>> >
>>> > As far as I know this should be unrelated, I get this during heals
>>> > without any freezes, it just means the storage is slow I think.
>>> >
>>> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on
>>> replica volume. Can someone advice  how to debug this problem or what can
>>> cause these issues?
>>> >> It’s really 

Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-13 Thread Andrey Volodin
what is the context from dmesg ?

On Mon, May 13, 2019 at 7:33 AM Andrey Volodin 
wrote:

> as per
> https://helpful.knobs-dials.com/index.php/INFO:_task_blocked_for_more_than_120_seconds.
>  ,
> the informational warning could be suppressed with :
>
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>
> Moreover, as per their website : "*This message is not an error*.
> It is an indication that a program has had to wait for a very long time,
> and what it was doing. "
> More reference:
> https://serverfault.com/questions/405210/can-high-load-cause-server-hang-and-error-blocked-for-more-than-120-seconds
>
> Regards,
> Andrei
>
> On Mon, May 13, 2019 at 7:32 AM Martin Toth  wrote:
>
>> Cache in qemu is none. That should be correct. This is full command :
>>
>> /usr/bin/qemu-system-x86_64 -name one-312 -S -machine
>> pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp
>> 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1
>> -no-user-config -nodefaults -chardev
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait
>> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime
>> -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device
>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
>>
>> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4
>> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
>> -drive file=/var/lib/one//datastores/116/312/*disk.0*
>> ,format=raw,if=none,id=drive-virtio-disk1,cache=none
>> -device
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1
>> -drive file=gluster://localhost:24007/imagestore/
>> *7b64d6757acc47a39503f68731f89b8e*
>> ,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none
>> -device
>> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
>> -drive file=/var/lib/one//datastores/116/312/*disk.1*
>> ,format=raw,if=none,id=drive-ide0-0-0,readonly=on
>> -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
>>
>> -netdev tap,fd=26,id=hostnet0
>> -device 
>> e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3
>> -chardev pty,id=charserial0 -device
>> isa-serial,chardev=charserial0,id=serial0
>> -chardev 
>> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait
>> -device
>> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
>> -vnc 0.0.0.0:312,password -device
>> cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
>> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
>>
>> I’ve highlighted disks. First is VM context disk - Fuse used, second is
>> SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used.
>>
>> Krutika,
>> I will start profiling on Gluster Volumes and wait for next VM to fail.
>> Than I will attach/send profiling info after some VM will be failed. I
>> suppose this is correct profiling strategy.
>>
>> Thanks,
>> BR!
>> Martin
>>
>> On 13 May 2019, at 09:21, Krutika Dhananjay  wrote:
>>
>> Also, what's the caching policy that qemu is using on the affected vms?
>> Is it cache=none? Or something else? You can get this information in the
>> command line of qemu-kvm process corresponding to your vm in the ps output.
>>
>> -Krutika
>>
>> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay 
>> wrote:
>>
>>> What version of gluster are you using?
>>> Also, can you capture and share volume-profile output for a run where
>>> you manage to recreate this issue?
>>>
>>> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command
>>> Let me know if you have any questions.
>>>
>>> -Krutika
>>>
>>> On Mon, May 13, 2019 at 12:34 PM Martin Toth 
>>> wrote:
>>>
 Hi,

 there is no healing operation, not peer disconnects, no readonly
 filesystem. Yes, storage is slow and unavailable for 120 seconds, but why,
 its SSD with 10G, performance is good.

 > you'd have it's log on qemu's standard output,

 If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking
 for problem for more than month, tried everything. Can’t find anything. Any
 more clues or leads?

 BR,
 Martin

 > On 13 May 2019, at 08:55, lemonni...@ulrar.net wrote:
 >
 > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote:
 >> Hi all,
 >
 > Hi
 >
 >>
 >> I am running replica 3 on SSDs with 10G networking, everything works
 OK but VMs stored in Gluster volume occasionally freeze with “Task XY
 blocked for more than 120 seconds”.
 >> Only solution is to poweroff (hard) VM and than boot it up again. I
 am unable to SSH and also login with console, its stuck probably on some
 disk operation. No error/warning logs or messages are store in VMs logs.
 >>
 >
 > As far as I know this 

Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-13 Thread Andrey Volodin
as per
https://helpful.knobs-dials.com/index.php/INFO:_task_blocked_for_more_than_120_seconds.
,
the informational warning could be suppressed with :

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs"

Moreover, as per their website : "*This message is not an error*.
It is an indication that a program has had to wait for a very long time,
and what it was doing. "
More reference:
https://serverfault.com/questions/405210/can-high-load-cause-server-hang-and-error-blocked-for-more-than-120-seconds

Regards,
Andrei

On Mon, May 13, 2019 at 7:32 AM Martin Toth  wrote:

> Cache in qemu is none. That should be correct. This is full command :
>
> /usr/bin/qemu-system-x86_64 -name one-312 -S -machine
> pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp
> 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1
> -no-user-config -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime
> -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device
> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
>
> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4
> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
> -drive file=/var/lib/one//datastores/116/312/*disk.0*
> ,format=raw,if=none,id=drive-virtio-disk1,cache=none
> -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1
> -drive file=gluster://localhost:24007/imagestore/
> *7b64d6757acc47a39503f68731f89b8e*
> ,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none
> -device
> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
> -drive file=/var/lib/one//datastores/116/312/*disk.1*
> ,format=raw,if=none,id=drive-ide0-0-0,readonly=on
> -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
>
> -netdev tap,fd=26,id=hostnet0
> -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3
> -chardev pty,id=charserial0 -device
> isa-serial,chardev=charserial0,id=serial0
> -chardev 
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait
> -device
> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
> -vnc 0.0.0.0:312,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
>
> I’ve highlighted disks. First is VM context disk - Fuse used, second is
> SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used.
>
> Krutika,
> I will start profiling on Gluster Volumes and wait for next VM to fail.
> Than I will attach/send profiling info after some VM will be failed. I
> suppose this is correct profiling strategy.
>
> Thanks,
> BR!
> Martin
>
> On 13 May 2019, at 09:21, Krutika Dhananjay  wrote:
>
> Also, what's the caching policy that qemu is using on the affected vms?
> Is it cache=none? Or something else? You can get this information in the
> command line of qemu-kvm process corresponding to your vm in the ps output.
>
> -Krutika
>
> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay 
> wrote:
>
>> What version of gluster are you using?
>> Also, can you capture and share volume-profile output for a run where you
>> manage to recreate this issue?
>>
>> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command
>> Let me know if you have any questions.
>>
>> -Krutika
>>
>> On Mon, May 13, 2019 at 12:34 PM Martin Toth 
>> wrote:
>>
>>> Hi,
>>>
>>> there is no healing operation, not peer disconnects, no readonly
>>> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why,
>>> its SSD with 10G, performance is good.
>>>
>>> > you'd have it's log on qemu's standard output,
>>>
>>> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking
>>> for problem for more than month, tried everything. Can’t find anything. Any
>>> more clues or leads?
>>>
>>> BR,
>>> Martin
>>>
>>> > On 13 May 2019, at 08:55, lemonni...@ulrar.net wrote:
>>> >
>>> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote:
>>> >> Hi all,
>>> >
>>> > Hi
>>> >
>>> >>
>>> >> I am running replica 3 on SSDs with 10G networking, everything works
>>> OK but VMs stored in Gluster volume occasionally freeze with “Task XY
>>> blocked for more than 120 seconds”.
>>> >> Only solution is to poweroff (hard) VM and than boot it up again. I
>>> am unable to SSH and also login with console, its stuck probably on some
>>> disk operation. No error/warning logs or messages are store in VMs logs.
>>> >>
>>> >
>>> > As far as I know this should be unrelated, I get this during heals
>>> > without any freezes, it just means the storage is slow I think.
>>> >
>>> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on
>>> replica volume. Can someone advice 

Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-13 Thread Martin Toth
Cache in qemu is none. That should be correct. This is full command :

/usr/bin/qemu-system-x86_64 -name one-312 -S -machine 
pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 
4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 
-no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime 
-no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device 
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 

-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5
-drive 
file=/var/lib/one//datastores/116/312/disk.0,format=raw,if=none,id=drive-virtio-disk1,cache=none
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1
-drive 
file=gluster://localhost:24007/imagestore/7b64d6757acc47a39503f68731f89b8e,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none
-device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0
-drive 
file=/var/lib/one//datastores/116/312/disk.1,format=raw,if=none,id=drive-ide0-0-0,readonly=on
-device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0

-netdev tap,fd=26,id=hostnet0 -device 
e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 -chardev 
pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev 
socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait
 -device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
 -vnc 0.0.0.0:312,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on

I’ve highlighted disks. First is VM context disk - Fuse used, second is SDA (OS 
is installed here) - libgfapi used, third is SWAP - Fuse used.

Krutika,
I will start profiling on Gluster Volumes and wait for next VM to fail. Than I 
will attach/send profiling info after some VM will be failed. I suppose this is 
correct profiling strategy.

Thanks,
BR!
Martin

> On 13 May 2019, at 09:21, Krutika Dhananjay  wrote:
> 
> Also, what's the caching policy that qemu is using on the affected vms?
> Is it cache=none? Or something else? You can get this information in the 
> command line of qemu-kvm process corresponding to your vm in the ps output.
> 
> -Krutika
> 
> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay  > wrote:
> What version of gluster are you using?
> Also, can you capture and share volume-profile output for a run where you 
> manage to recreate this issue?
> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command
>  
> 
> Let me know if you have any questions.
> 
> -Krutika
> 
> On Mon, May 13, 2019 at 12:34 PM Martin Toth  > wrote:
> Hi,
> 
> there is no healing operation, not peer disconnects, no readonly filesystem. 
> Yes, storage is slow and unavailable for 120 seconds, but why, its SSD with 
> 10G, performance is good.
> 
> > you'd have it's log on qemu's standard output,
> 
> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking for 
> problem for more than month, tried everything. Can’t find anything. Any more 
> clues or leads?
> 
> BR,
> Martin
> 
> > On 13 May 2019, at 08:55, lemonni...@ulrar.net 
> >  wrote:
> > 
> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote:
> >> Hi all,
> > 
> > Hi
> > 
> >> 
> >> I am running replica 3 on SSDs with 10G networking, everything works OK 
> >> but VMs stored in Gluster volume occasionally freeze with “Task XY blocked 
> >> for more than 120 seconds”.
> >> Only solution is to poweroff (hard) VM and than boot it up again. I am 
> >> unable to SSH and also login with console, its stuck probably on some disk 
> >> operation. No error/warning logs or messages are store in VMs logs.
> >> 
> > 
> > As far as I know this should be unrelated, I get this during heals
> > without any freezes, it just means the storage is slow I think.
> > 
> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on 
> >> replica volume. Can someone advice  how to debug this problem or what can 
> >> cause these issues? 
> >> It’s really annoying, I’ve tried to google everything but nothing came up. 
> >> I’ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but 
> >> its not related.
> >> 
> > 
> > Any chance your gluster goes readonly ? Have you checked your gluster
> > logs to see if maybe they lose each other some times ?
> > /var/log/glusterfs
> > 
> > For libgfapi accesses you'd have 

Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-13 Thread Krutika Dhananjay
Also, what's the caching policy that qemu is using on the affected vms?
Is it cache=none? Or something else? You can get this information in the
command line of qemu-kvm process corresponding to your vm in the ps output.

-Krutika

On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay 
wrote:

> What version of gluster are you using?
> Also, can you capture and share volume-profile output for a run where you
> manage to recreate this issue?
>
> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command
> Let me know if you have any questions.
>
> -Krutika
>
> On Mon, May 13, 2019 at 12:34 PM Martin Toth  wrote:
>
>> Hi,
>>
>> there is no healing operation, not peer disconnects, no readonly
>> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why,
>> its SSD with 10G, performance is good.
>>
>> > you'd have it's log on qemu's standard output,
>>
>> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking
>> for problem for more than month, tried everything. Can’t find anything. Any
>> more clues or leads?
>>
>> BR,
>> Martin
>>
>> > On 13 May 2019, at 08:55, lemonni...@ulrar.net wrote:
>> >
>> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote:
>> >> Hi all,
>> >
>> > Hi
>> >
>> >>
>> >> I am running replica 3 on SSDs with 10G networking, everything works
>> OK but VMs stored in Gluster volume occasionally freeze with “Task XY
>> blocked for more than 120 seconds”.
>> >> Only solution is to poweroff (hard) VM and than boot it up again. I am
>> unable to SSH and also login with console, its stuck probably on some disk
>> operation. No error/warning logs or messages are store in VMs logs.
>> >>
>> >
>> > As far as I know this should be unrelated, I get this during heals
>> > without any freezes, it just means the storage is slow I think.
>> >
>> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on
>> replica volume. Can someone advice  how to debug this problem or what can
>> cause these issues?
>> >> It’s really annoying, I’ve tried to google everything but nothing came
>> up. I’ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but
>> its not related.
>> >>
>> >
>> > Any chance your gluster goes readonly ? Have you checked your gluster
>> > logs to see if maybe they lose each other some times ?
>> > /var/log/glusterfs
>> >
>> > For libgfapi accesses you'd have it's log on qemu's standard output,
>> > that might contain the actual error at the time of the freez.
>> > ___
>> > Gluster-users mailing list
>> > Gluster-users@gluster.org
>> > https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-13 Thread Krutika Dhananjay
What version of gluster are you using?
Also, can you capture and share volume-profile output for a run where you
manage to recreate this issue?
https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command
Let me know if you have any questions.

-Krutika

On Mon, May 13, 2019 at 12:34 PM Martin Toth  wrote:

> Hi,
>
> there is no healing operation, not peer disconnects, no readonly
> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why,
> its SSD with 10G, performance is good.
>
> > you'd have it's log on qemu's standard output,
>
> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking
> for problem for more than month, tried everything. Can’t find anything. Any
> more clues or leads?
>
> BR,
> Martin
>
> > On 13 May 2019, at 08:55, lemonni...@ulrar.net wrote:
> >
> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote:
> >> Hi all,
> >
> > Hi
> >
> >>
> >> I am running replica 3 on SSDs with 10G networking, everything works OK
> but VMs stored in Gluster volume occasionally freeze with “Task XY blocked
> for more than 120 seconds”.
> >> Only solution is to poweroff (hard) VM and than boot it up again. I am
> unable to SSH and also login with console, its stuck probably on some disk
> operation. No error/warning logs or messages are store in VMs logs.
> >>
> >
> > As far as I know this should be unrelated, I get this during heals
> > without any freezes, it just means the storage is slow I think.
> >
> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on
> replica volume. Can someone advice  how to debug this problem or what can
> cause these issues?
> >> It’s really annoying, I’ve tried to google everything but nothing came
> up. I’ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but
> its not related.
> >>
> >
> > Any chance your gluster goes readonly ? Have you checked your gluster
> > logs to see if maybe they lose each other some times ?
> > /var/log/glusterfs
> >
> > For libgfapi accesses you'd have it's log on qemu's standard output,
> > that might contain the actual error at the time of the freez.
> > ___
> > Gluster-users mailing list
> > Gluster-users@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-13 Thread Martin Toth
Hi,

there is no healing operation, not peer disconnects, no readonly filesystem. 
Yes, storage is slow and unavailable for 120 seconds, but why, its SSD with 
10G, performance is good.

> you'd have it's log on qemu's standard output,

If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking for 
problem for more than month, tried everything. Can’t find anything. Any more 
clues or leads?

BR,
Martin

> On 13 May 2019, at 08:55, lemonni...@ulrar.net wrote:
> 
> On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote:
>> Hi all,
> 
> Hi
> 
>> 
>> I am running replica 3 on SSDs with 10G networking, everything works OK but 
>> VMs stored in Gluster volume occasionally freeze with “Task XY blocked for 
>> more than 120 seconds”.
>> Only solution is to poweroff (hard) VM and than boot it up again. I am 
>> unable to SSH and also login with console, its stuck probably on some disk 
>> operation. No error/warning logs or messages are store in VMs logs.
>> 
> 
> As far as I know this should be unrelated, I get this during heals
> without any freezes, it just means the storage is slow I think.
> 
>> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on 
>> replica volume. Can someone advice  how to debug this problem or what can 
>> cause these issues? 
>> It’s really annoying, I’ve tried to google everything but nothing came up. 
>> I’ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but its 
>> not related.
>> 
> 
> Any chance your gluster goes readonly ? Have you checked your gluster
> logs to see if maybe they lose each other some times ?
> /var/log/glusterfs
> 
> For libgfapi accesses you'd have it's log on qemu's standard output,
> that might contain the actual error at the time of the freez.
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] VMs blocked for more than 120 seconds

2019-05-13 Thread lemonnierk
On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote:
> Hi all,

Hi

> 
> I am running replica 3 on SSDs with 10G networking, everything works OK but 
> VMs stored in Gluster volume occasionally freeze with “Task XY blocked for 
> more than 120 seconds”.
> Only solution is to poweroff (hard) VM and than boot it up again. I am unable 
> to SSH and also login with console, its stuck probably on some disk 
> operation. No error/warning logs or messages are store in VMs logs.
> 

As far as I know this should be unrelated, I get this during heals
without any freezes, it just means the storage is slow I think.

> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on replica 
> volume. Can someone advice  how to debug this problem or what can cause these 
> issues? 
> It’s really annoying, I’ve tried to google everything but nothing came up. 
> I’ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but its 
> not related.
> 

Any chance your gluster goes readonly ? Have you checked your gluster
logs to see if maybe they lose each other some times ?
/var/log/glusterfs

For libgfapi accesses you'd have it's log on qemu's standard output,
that might contain the actual error at the time of the freez.
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users