Re: [ceph-users] osd daemons stuck in D state

Jan Schermer Mon, 27 Jul 2015 05:39:07 -0700

When those processes become blocked are the drives busy or idle?

Can you post the output from
"ps -awexo pid,tt,user,fname,tmout,f,wchan” on those processes when that 
happens?


My guess would be they really are waiting for the disk array for some reason - 
can you check if you can read/write to the OSD partitions when this happens? 
iostat output?
And not sure what your HBA is but sometimes very bad things happen when you 
saturate the drives and cache completely, like lowering the LUN queue depth to 
1 (which completely kills IO) or even dropping commands (but that would likely 
show in dmesg). This situation is often masked by driver completely. And ceph 
is really good at saturating drives.

Also how much memory does your machine have? vm.min_free_kbytes = 2640322 looks 
pretty high to me and it could block anything anytime if kswapd kicks in and 
starts cleaning pages. (But the whole system would be unusable so you’d likely 
notice that)

Jan

> On 27 Jul 2015, at 13:24, Simion Rad <[email protected]> wrote:
> 
> Hello all, 
> 
> When I try to add more than one osds to a host and the backfilling process 
> starts , all the osd daemons except one of them become stuck in D state. When 
> this happends they are shown as out and down (when running ceph osd tree).
> 
> The only way I can kill the processes is to 
> remove the osds  from crushmap , then run kill -9 on them and then wait for a 
> couple of minutes.
> There are no exception messages in osd logs and the dmesg looks ok too 
> (nothing out of the ordinary).
> I run ceph firefly 0.80.10 on Ubuntu 14.04 (linux 3.13).
> The osds are running on RAID0 LUNs (2 drives for every diskgroup) created on 
> a Dell MD3000 array with Hitachi hard drives (450 GB , 15K RPM).
> The issue happens even with 2 or 3 osds active on the host.
> I have only 1 Gb/s link to the host. Could the network bandwidth be the issue 
> ? 
> 
> The settings from sysctl.conf:
> 
> net.core.netdev_max_backlog = 250000
> net.core.optmem_max = 16777216
> net.core.rmem_default = 16777216
> net.core.wmem_default = 16777216
> net.core.rmem_max = 16777216
> net.core.wmem_max = 16777216
> net.ipv4.tcp_mem = 16777216 16777216 16777216
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.ipv4.tcp_wmem = 4096 87380 16777216
> net.ipv4.tcp_low_latency = 1
> net.ipv4.tcp_sack = 0
> net.ipv4.tcp_timestamps = 0
> net.ipv4.conf.default.rp_filter = 0
> net.ipv4.conf.all.rp_filter = 0
> net.ipv4.ip_forward = 1
> net.ipv4.tcp_tw_recycle = 0
> net.ipv4.tcp_tw_reuse = 0
> net.ipv4.tcp_window_scaling = 0
> net.ipv4.route.flush=1
> vm.min_free_kbytes = 2640322
> vm.swappiness = 0
> vm.overcommit_memory = 1
> vm.oom_kill_allocating_task = 0
> vm.dirty_expire_centisecs = 360000
> vm.dirty_writeback_centisecs = 360000
> kernel.pid_max = 4194303
> fs.file-max = 16815744
> vm.dirty_ratio = 99
> vm.dirty_background_ratio = 99
> vm.vfs_cache_pressure = 100
> 
> Thanks, 
> Simion Rad.
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd daemons stuck in D state

Reply via email to