[ceph-users] IO Hang on rbd
Hi all! We have an annoying problem - when we launch intensive reading with rbd, the client, to which mounted image, hangs in this state: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.001.20 0.00 0.00 8.00 0.000.000.000.00 0.00 0.00 dm-0 0.00 0.000.001.20 0.00 0.00 8.00 0.000.000.000.00 0.00 0.00 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 rbd0 0.00 0.000.000.00 0.00 0.00 0.00 32.000.000.000.00 0.00 100.00 Only reboot helps. The logs are clean. The fastest way to get hang it is run fio read with block size 512K, 4K usually works fine. But client may hang without fio - only because of heavy load. We used different versions of the linux kernel and ceph - now on OSD and MONS we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried the latest versions from here http://gitbuilder.ceph.com/. , for example Ceph 0.87-68. Through libvirt everything works fine - we also use KVM and stgt (but stgs is slow) Here is my config: [global] fsid = 566d9cab-793e-47e0-a0cd-e5da09f8037a mon_initial_members = srt-mon-001-02,amz-mon-001-000601,db24-mon-001-000105 mon_host = 10.201.20.31,10.203.20.56,10.202.20.58 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public network = 10.201.20.0/22 cluster network = 10.212.36.0/22 osd crush update on start = false [mon] debug mon = 0 debug paxos = 0/0 debug auth = 0 [mon.srt-mon-001-02] host = srt-mon-001-02 mon addr = 10.201.20.31:6789 [mon.db24-mon-001-000105] host = db24-mon-001-000105 mon addr = 10.202.20.58:6789 [mon.amz-mon-001-000601] host = amz-mon-001-000601 mon addr = 10.203.20.56:6789 [osd] osd crush update on start = false osd mount options xfs = rw,noatime,inode64,allocsize=4M osd mkfs type = xfs osd mkfs options xfs = -f -i size=2048 osd op threads = 20 osd disk threads =8 journal block align = true journal dio = true journal aio = true osd recovery max active = 1 filestore max sync interval = 100 filestore min sync interval = 10 filestore queue max ops = 2000 filestore queue max bytes = 536870912 filestore queue committing max ops = 2000 filestore queue committing max bytes = 536870912 osd max backfills = 1 osd client op priority = 63 [osd.5] host = srt-osd-001-050204 [osd.6] host = srt-osd-001-050204 [osd.7] host = srt-osd-001-050204 [osd.8] host = srt-osd-001-050204 [osd.109] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IO Hang on rbd
Try lowering filestore max sync interval and filestore min sync interval. It looks like during the hanged period data is flushed from some overly big buffer. If this does not help you can monitor perf stats on OSDs to see if some queue is unusually large. -- Tomasz Kuzemko tomasz.kuze...@ovh.net On Thu, Dec 11, 2014 at 07:57:48PM +0300, reistlin87 wrote: Hi all! We have an annoying problem - when we launch intensive reading with rbd, the client, to which mounted image, hangs in this state: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.001.20 0.00 0.00 8.00 0.000.000.000.00 0.00 0.00 dm-0 0.00 0.000.001.20 0.00 0.00 8.00 0.000.000.000.00 0.00 0.00 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 rbd0 0.00 0.000.000.00 0.00 0.00 0.00 32.000.000.000.00 0.00 100.00 Only reboot helps. The logs are clean. The fastest way to get hang it is run fio read with block size 512K, 4K usually works fine. But client may hang without fio - only because of heavy load. We used different versions of the linux kernel and ceph - now on OSD and MONS we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried the latest versions from here http://gitbuilder.ceph.com/. , for example Ceph 0.87-68. Through libvirt everything works fine - we also use KVM and stgt (but stgs is slow) Here is my config: [global] fsid = 566d9cab-793e-47e0-a0cd-e5da09f8037a mon_initial_members = srt-mon-001-02,amz-mon-001-000601,db24-mon-001-000105 mon_host = 10.201.20.31,10.203.20.56,10.202.20.58 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public network = 10.201.20.0/22 cluster network = 10.212.36.0/22 osd crush update on start = false [mon] debug mon = 0 debug paxos = 0/0 debug auth = 0 [mon.srt-mon-001-02] host = srt-mon-001-02 mon addr = 10.201.20.31:6789 [mon.db24-mon-001-000105] host = db24-mon-001-000105 mon addr = 10.202.20.58:6789 [mon.amz-mon-001-000601] host = amz-mon-001-000601 mon addr = 10.203.20.56:6789 [osd] osd crush update on start = false osd mount options xfs = rw,noatime,inode64,allocsize=4M osd mkfs type = xfs osd mkfs options xfs = -f -i size=2048 osd op threads = 20 osd disk threads =8 journal block align = true journal dio = true journal aio = true osd recovery max active = 1 filestore max sync interval = 100 filestore min sync interval = 10 filestore queue max ops = 2000 filestore queue max bytes = 536870912 filestore queue committing max ops = 2000 filestore queue committing max bytes = 536870912 osd max backfills = 1 osd client op priority = 63 [osd.5] host = srt-osd-001-050204 [osd.6] host = srt-osd-001-050204 [osd.7] host = srt-osd-001-050204 [osd.8] host = srt-osd-001-050204 [osd.109] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IO Hang on rbd
On Thu, Dec 11, 2014 at 7:57 PM, reistlin87 79026480...@yandex.ru wrote: Hi all! We have an annoying problem - when we launch intensive reading with rbd, the client, to which mounted image, hangs in this state: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.001.20 0.00 0.00 8.00 0.000.000.000.00 0.00 0.00 dm-0 0.00 0.000.001.20 0.00 0.00 8.00 0.000.000.000.00 0.00 0.00 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 rbd0 0.00 0.000.000.00 0.00 0.00 0.00 32.000.000.000.00 0.00 100.00 Only reboot helps. The logs are clean. The fastest way to get hang it is run fio read with block size 512K, 4K usually works fine. But client may hang without fio - only because of heavy load. We used different versions of the linux kernel and ceph - now on OSD and MONS we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried the latest versions from here http://gitbuilder.ceph.com/. , for example Ceph 0.87-68. Through libvirt everything works fine - we also use KVM and stgt (but stgs is slow) Is there anything in dmesg around the time it hangs? If possible, don't change anything about your config - number of osds, number of pgs, pools, etc so you can reproduce with logging enabled. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IO Hang on rbd
On Mon, Dec 15, 2014 at 4:11 PM, Tomasz Kuzemko tomasz.kuze...@ovh.net wrote: Try lowering filestore max sync interval and filestore min sync interval. It looks like during the hanged period data is flushed from some overly big buffer. If this does not help you can monitor perf stats on OSDs to see if some queue is unusually large. This must be a kernel client issue. OP, please don't change any settings, I need it to reproduce to gather more info. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IO Hang on rbd
We tried default configuration without additional parameters, but it still hangs How can we see a OSD queue? 15.12.2014, 16:11, Tomasz Kuzemko tomasz.kuze...@ovh.net: Try lowering filestore max sync interval and filestore min sync interval. It looks like during the hanged period data is flushed from some overly big buffer. If this does not help you can monitor perf stats on OSDs to see if some queue is unusually large. -- Tomasz Kuzemko tomasz.kuze...@ovh.net On Thu, Dec 11, 2014 at 07:57:48PM +0300, reistlin87 wrote: Hi all! We have an annoying problem - when we launch intensive reading with rbd, the client, to which mounted image, hangs in this state: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 1.20 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 1.20 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 rbd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 32.00 0.00 0.00 0.00 0.00 100.00 Only reboot helps. The logs are clean. The fastest way to get hang it is run fio read with block size 512K, 4K usually works fine. But client may hang without fio - only because of heavy load. We used different versions of the linux kernel and ceph - now on OSD and MONS we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried the latest versions from here http://gitbuilder.ceph.com/. , for example Ceph 0.87-68. Through libvirt everything works fine - we also use KVM and stgt (but stgs is slow) Here is my config: [global] fsid = 566d9cab-793e-47e0-a0cd-e5da09f8037a mon_initial_members = srt-mon-001-02,amz-mon-001-000601,db24-mon-001-000105 mon_host = 10.201.20.31,10.203.20.56,10.202.20.58 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public network = 10.201.20.0/22 cluster network = 10.212.36.0/22 osd crush update on start = false [mon] debug mon = 0 debug paxos = 0/0 debug auth = 0 [mon.srt-mon-001-02] host = srt-mon-001-02 mon addr = 10.201.20.31:6789 [mon.db24-mon-001-000105] host = db24-mon-001-000105 mon addr = 10.202.20.58:6789 [mon.amz-mon-001-000601] host = amz-mon-001-000601 mon addr = 10.203.20.56:6789 [osd] osd crush update on start = false osd mount options xfs = rw,noatime,inode64,allocsize=4M osd mkfs type = xfs osd mkfs options xfs = -f -i size=2048 osd op threads = 20 osd disk threads =8 journal block align = true journal dio = true journal aio = true osd recovery max active = 1 filestore max sync interval = 100 filestore min sync interval = 10 filestore queue max ops = 2000 filestore queue max bytes = 536870912 filestore queue committing max ops = 2000 filestore queue committing max bytes = 536870912 osd max backfills = 1 osd client op priority = 63 [osd.5] host = srt-osd-001-050204 [osd.6] host = srt-osd-001-050204 [osd.7] host = srt-osd-001-050204 [osd.8] host = srt-osd-001-050204 [osd.109] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IO Hang on rbd
No, in dmesg is nothing about hangs Here is the versions of software: root@ceph-esx-conv03-001:~# uname -a Linux ceph-esx-conv03-001 3.17.0-ceph #1 SMP Sun Oct 5 19:47:51 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux root@ceph-esx-conv03-001:~# ceph --version ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) 15.12.2014, 16:20, Ilya Dryomov ilya.dryo...@inktank.com: On Thu, Dec 11, 2014 at 7:57 PM, reistlin87 79026480...@yandex.ru wrote: Hi all! We have an annoying problem - when we launch intensive reading with rbd, the client, to which mounted image, hangs in this state: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 1.20 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 1.20 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 rbd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 32.00 0.00 0.00 0.00 0.00 100.00 Only reboot helps. The logs are clean. The fastest way to get hang it is run fio read with block size 512K, 4K usually works fine. But client may hang without fio - only because of heavy load. We used different versions of the linux kernel and ceph - now on OSD and MONS we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried the latest versions from here http://gitbuilder.ceph.com/. , for example Ceph 0.87-68. Through libvirt everything works fine - we also use KVM and stgt (but stgs is slow) Is there anything in dmesg around the time it hangs? If possible, don't change anything about your config - number of osds, number of pgs, pools, etc so you can reproduce with logging enabled. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IO Hang on rbd
On Mon, Dec 15, 2014 at 7:05 PM, reistlin87 79026480...@yandex.ru wrote: No, in dmesg is nothing about hangs Not necessarily about hangs. socket closed messages? Can you pastebin the entire kernel log for me? Here is the versions of software: root@ceph-esx-conv03-001:~# uname -a Linux ceph-esx-conv03-001 3.17.0-ceph #1 SMP Sun Oct 5 19:47:51 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Which kernel are you running on the client box? 3.17 or 3.18? If 3.17, can you try 3.18? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com