[ceph-users] IO Hang on rbd

2014-12-15 Thread reistlin87
Hi all!

We have an annoying problem - when we launch intensive reading with rbd, the 
client, to which mounted image, hangs in this state:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.001.20 0.00 0.00 8.00 
0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.001.20 0.00 0.00 8.00 
0.000.000.000.00   0.00   0.00
dm-1  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
rbd0  0.00 0.000.000.00 0.00 0.00 0.00
32.000.000.000.00   0.00 100.00

Only  reboot helps. The logs are clean.

The fastest way to get hang it is run fio read with block size 512K, 4K  
usually works fine. But client may hang without fio - only because of heavy 
load.

We used different versions of the linux kernel and ceph - now on OSD and MONS 
we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried the 
latest versions from here http://gitbuilder.ceph.com/. , for example Ceph  
0.87-68. Through libvirt everything works fine - we also  use  KVM  and stgt 
(but stgs is slow)

Here is my config:
[global]
fsid = 566d9cab-793e-47e0-a0cd-e5da09f8037a
mon_initial_members = 
srt-mon-001-02,amz-mon-001-000601,db24-mon-001-000105
mon_host = 10.201.20.31,10.203.20.56,10.202.20.58
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public network = 10.201.20.0/22
cluster network = 10.212.36.0/22
osd crush update on start = false
[mon]
debug mon = 0
debug paxos = 0/0
debug auth = 0

[mon.srt-mon-001-02]
host = srt-mon-001-02
mon addr = 10.201.20.31:6789
[mon.db24-mon-001-000105]
host = db24-mon-001-000105
mon addr = 10.202.20.58:6789
[mon.amz-mon-001-000601]
host = amz-mon-001-000601
mon addr = 10.203.20.56:6789
[osd]
osd crush update on start = false
osd mount options xfs = rw,noatime,inode64,allocsize=4M
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd op threads = 20
osd disk threads =8
journal block align = true
journal dio = true
journal aio = true
osd recovery max active = 1
filestore max sync interval = 100
filestore min sync interval = 10
filestore queue max ops = 2000
filestore queue max bytes = 536870912
filestore queue committing max ops = 2000
filestore queue committing max bytes = 536870912
osd max backfills = 1
osd client op priority = 63
[osd.5]
host = srt-osd-001-050204
[osd.6]
host = srt-osd-001-050204
[osd.7]
host = srt-osd-001-050204
[osd.8]
host = srt-osd-001-050204
[osd.109]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO Hang on rbd

2014-12-15 Thread Tomasz Kuzemko
Try lowering filestore max sync interval and filestore min sync
interval. It looks like during the hanged period data is flushed from
some overly big buffer.

If this does not help you can monitor perf stats on OSDs to see if some
queue is unusually large.

-- 
Tomasz Kuzemko
tomasz.kuze...@ovh.net

On Thu, Dec 11, 2014 at 07:57:48PM +0300, reistlin87 wrote:
 Hi all!
 
 We have an annoying problem - when we launch intensive reading with rbd, the 
 client, to which mounted image, hangs in this state:
 
 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
 sda   0.00 0.000.001.20 0.00 0.00 8.00
  0.000.000.000.00   0.00   0.00
 dm-0  0.00 0.000.001.20 0.00 0.00 8.00
  0.000.000.000.00   0.00   0.00
 dm-1  0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 rbd0  0.00 0.000.000.00 0.00 0.00 0.00
 32.000.000.000.00   0.00 100.00
 
 Only  reboot helps. The logs are clean.
 
 The fastest way to get hang it is run fio read with block size 512K, 4K  
 usually works fine. But client may hang without fio - only because of heavy 
 load.
 
 We used different versions of the linux kernel and ceph - now on OSD and MONS 
 we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried the 
 latest versions from here http://gitbuilder.ceph.com/. , for example Ceph  
 0.87-68. Through libvirt everything works fine - we also  use  KVM  and stgt 
 (but stgs is slow)
 
 Here is my config:
 [global]
 fsid = 566d9cab-793e-47e0-a0cd-e5da09f8037a
 mon_initial_members = 
 srt-mon-001-02,amz-mon-001-000601,db24-mon-001-000105
 mon_host = 10.201.20.31,10.203.20.56,10.202.20.58
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 public network = 10.201.20.0/22
 cluster network = 10.212.36.0/22
 osd crush update on start = false
 [mon]
 debug mon = 0
 debug paxos = 0/0
 debug auth = 0
 
 [mon.srt-mon-001-02]
 host = srt-mon-001-02
 mon addr = 10.201.20.31:6789
 [mon.db24-mon-001-000105]
 host = db24-mon-001-000105
 mon addr = 10.202.20.58:6789
 [mon.amz-mon-001-000601]
 host = amz-mon-001-000601
 mon addr = 10.203.20.56:6789
 [osd]
 osd crush update on start = false
 osd mount options xfs = rw,noatime,inode64,allocsize=4M
 osd mkfs type = xfs
 osd mkfs options xfs = -f -i size=2048
 osd op threads = 20
 osd disk threads =8
 journal block align = true
 journal dio = true
 journal aio = true
 osd recovery max active = 1
 filestore max sync interval = 100
 filestore min sync interval = 10
 filestore queue max ops = 2000
 filestore queue max bytes = 536870912
 filestore queue committing max ops = 2000
 filestore queue committing max bytes = 536870912
 osd max backfills = 1
 osd client op priority = 63
 [osd.5]
 host = srt-osd-001-050204
 [osd.6]
 host = srt-osd-001-050204
 [osd.7]
 host = srt-osd-001-050204
 [osd.8]
 host = srt-osd-001-050204
 [osd.109]
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO Hang on rbd

2014-12-15 Thread Ilya Dryomov
On Thu, Dec 11, 2014 at 7:57 PM, reistlin87 79026480...@yandex.ru wrote:
 Hi all!

 We have an annoying problem - when we launch intensive reading with rbd, the 
 client, to which mounted image, hangs in this state:

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
 sda   0.00 0.000.001.20 0.00 0.00 8.00
  0.000.000.000.00   0.00   0.00
 dm-0  0.00 0.000.001.20 0.00 0.00 8.00
  0.000.000.000.00   0.00   0.00
 dm-1  0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 rbd0  0.00 0.000.000.00 0.00 0.00 0.00
 32.000.000.000.00   0.00 100.00

 Only  reboot helps. The logs are clean.

 The fastest way to get hang it is run fio read with block size 512K, 4K  
 usually works fine. But client may hang without fio - only because of heavy 
 load.

 We used different versions of the linux kernel and ceph - now on OSD and MONS 
 we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried the 
 latest versions from here http://gitbuilder.ceph.com/. , for example Ceph  
 0.87-68. Through libvirt everything works fine - we also  use  KVM  and stgt 
 (but stgs is slow)

Is there anything in dmesg around the time it hangs?

If possible, don't change anything about your config - number of osds,
number of pgs, pools, etc so you can reproduce with logging enabled.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO Hang on rbd

2014-12-15 Thread Ilya Dryomov
On Mon, Dec 15, 2014 at 4:11 PM, Tomasz Kuzemko tomasz.kuze...@ovh.net wrote:
 Try lowering filestore max sync interval and filestore min sync
 interval. It looks like during the hanged period data is flushed from
 some overly big buffer.

 If this does not help you can monitor perf stats on OSDs to see if some
 queue is unusually large.

This must be a kernel client issue.  OP, please don't change any
settings, I need it to reproduce to gather more info.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO Hang on rbd

2014-12-15 Thread reistlin87
We tried default configuration without additional parameters, but it still hangs
How can  we  see a OSD queue?

15.12.2014, 16:11, Tomasz Kuzemko tomasz.kuze...@ovh.net:
 Try lowering filestore max sync interval and filestore min sync
 interval. It looks like during the hanged period data is flushed from
 some overly big buffer.

 If this does not help you can monitor perf stats on OSDs to see if some
 queue is unusually large.

 --
 Tomasz Kuzemko
 tomasz.kuze...@ovh.net

 On Thu, Dec 11, 2014 at 07:57:48PM +0300, reistlin87 wrote:
  Hi all!

  We have an annoying problem - when we launch intensive reading with rbd, 
 the client, to which mounted image, hangs in this state:

  Device: rrqm/s   wrqm/s r/s w/s    rMB/s    wMB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
  sda   0.00 0.00    0.00    1.20 0.00 0.00 8.00  
    0.00    0.00    0.00    0.00   0.00   0.00
  dm-0  0.00 0.00    0.00    1.20 0.00 0.00 8.00  
    0.00    0.00    0.00    0.00   0.00   0.00
  dm-1  0.00 0.00    0.00    0.00 0.00 0.00 0.00  
    0.00    0.00    0.00    0.00   0.00   0.00
  rbd0  0.00 0.00    0.00    0.00 0.00 0.00 0.00  
   32.00    0.00    0.00    0.00   0.00 100.00

  Only  reboot helps. The logs are clean.

  The fastest way to get hang it is run fio read with block size 512K, 4K  
 usually works fine. But client may hang without fio - only because of heavy 
 load.

  We used different versions of the linux kernel and ceph - now on OSD and 
 MONS we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried 
 the latest versions from here http://gitbuilder.ceph.com/. , for example 
 Ceph  0.87-68. Through libvirt everything works fine - we also  use  KVM  
 and stgt (but stgs is slow)

  Here is my config:
  [global]
  fsid = 566d9cab-793e-47e0-a0cd-e5da09f8037a
  mon_initial_members = 
 srt-mon-001-02,amz-mon-001-000601,db24-mon-001-000105
  mon_host = 10.201.20.31,10.203.20.56,10.202.20.58
  auth_cluster_required = cephx
  auth_service_required = cephx
  auth_client_required = cephx
  filestore_xattr_use_omap = true
  public network = 10.201.20.0/22
  cluster network = 10.212.36.0/22
  osd crush update on start = false
  [mon]
  debug mon = 0
  debug paxos = 0/0
  debug auth = 0

  [mon.srt-mon-001-02]
  host = srt-mon-001-02
  mon addr = 10.201.20.31:6789
  [mon.db24-mon-001-000105]
  host = db24-mon-001-000105
  mon addr = 10.202.20.58:6789
  [mon.amz-mon-001-000601]
  host = amz-mon-001-000601
  mon addr = 10.203.20.56:6789
  [osd]
  osd crush update on start = false
  osd mount options xfs = rw,noatime,inode64,allocsize=4M
  osd mkfs type = xfs
  osd mkfs options xfs = -f -i size=2048
  osd op threads = 20
  osd disk threads =8
  journal block align = true
  journal dio = true
  journal aio = true
  osd recovery max active = 1
  filestore max sync interval = 100
  filestore min sync interval = 10
  filestore queue max ops = 2000
  filestore queue max bytes = 536870912
  filestore queue committing max ops = 2000
  filestore queue committing max bytes = 536870912
  osd max backfills = 1
  osd client op priority = 63
  [osd.5]
  host = srt-osd-001-050204
  [osd.6]
  host = srt-osd-001-050204
  [osd.7]
  host = srt-osd-001-050204
  [osd.8]
  host = srt-osd-001-050204
  [osd.109]
  
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO Hang on rbd

2014-12-15 Thread reistlin87
No, in dmesg is nothing about hangs
Here is the versions of software:
root@ceph-esx-conv03-001:~# uname -a
Linux ceph-esx-conv03-001 3.17.0-ceph #1 SMP Sun Oct 5 19:47:51 UTC 2014 x86_64 
x86_64 x86_64 GNU/Linux
root@ceph-esx-conv03-001:~# ceph --version
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)


15.12.2014, 16:20, Ilya Dryomov ilya.dryo...@inktank.com:
 On Thu, Dec 11, 2014 at 7:57 PM, reistlin87 79026480...@yandex.ru wrote:
  Hi all!

  We have an annoying problem - when we launch intensive reading with rbd, 
 the client, to which mounted image, hangs in this state:

  Device: rrqm/s   wrqm/s r/s w/s    rMB/s    wMB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
  sda   0.00 0.00    0.00    1.20 0.00 0.00 8.00  
    0.00    0.00    0.00    0.00   0.00   0.00
  dm-0  0.00 0.00    0.00    1.20 0.00 0.00 8.00  
    0.00    0.00    0.00    0.00   0.00   0.00
  dm-1  0.00 0.00    0.00    0.00 0.00 0.00 0.00  
    0.00    0.00    0.00    0.00   0.00   0.00
  rbd0  0.00 0.00    0.00    0.00 0.00 0.00 0.00  
   32.00    0.00    0.00    0.00   0.00 100.00

  Only  reboot helps. The logs are clean.

  The fastest way to get hang it is run fio read with block size 512K, 4K  
 usually works fine. But client may hang without fio - only because of heavy 
 load.

  We used different versions of the linux kernel and ceph - now on OSD and 
 MONS we use ceph 0.87-1 and linux kernel 3.18. On the clients we have tried 
 the latest versions from here http://gitbuilder.ceph.com/. , for example 
 Ceph  0.87-68. Through libvirt everything works fine - we also  use  KVM  
 and stgt (but stgs is slow)

 Is there anything in dmesg around the time it hangs?

 If possible, don't change anything about your config - number of osds,
 number of pgs, pools, etc so you can reproduce with logging enabled.

 Thanks,

 Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO Hang on rbd

2014-12-15 Thread Ilya Dryomov
On Mon, Dec 15, 2014 at 7:05 PM, reistlin87 79026480...@yandex.ru wrote:
 No, in dmesg is nothing about hangs

Not necessarily about hangs.  socket closed messages?  Can you
pastebin the entire kernel log for me?

 Here is the versions of software:
 root@ceph-esx-conv03-001:~# uname -a
 Linux ceph-esx-conv03-001 3.17.0-ceph #1 SMP Sun Oct 5 19:47:51 UTC 2014 
 x86_64 x86_64 x86_64 GNU/Linux

Which kernel are you running on the client box?  3.17 or 3.18?
If 3.17, can you try 3.18?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com