Hi Jan,

Can you get perf top running? It should show you where the OSDs are spinning...

Cheers, Dan

On Thu, Jun 11, 2015 at 11:21 AM, Jan Schermer <j...@schermer.cz> wrote:
> Hi,
> hoping someone can point me in the right direction.
>
> Some of my OSDs have a larger CPU usage (and ops latencies) than others. If I 
> restart the OSD everything runs nicely for some time, then it creeps up.
>
> 1) most of my OSDs have ~40% CPU (core) usage (user+sys), some are closer to 
> 80%. Restarting means the offending OSDs only use 40% again.
> 2) average latencies and CPU usage on the host are the same - so it’s not 
> caused by the host that the OSD is running on
> 3) I can’t say exactly when or how the issue happens. I can’t even say if 
> it’s the same OSDs. It seems it either happens when something heavy happens 
> in a cluster (like dropping very old snapshots, rebalancing) and then doesn’t 
> come back, or maybe it happens slowly over time and I can’t find it in the 
> graphs. Looking at the graphs it seems to be the former.
>
> I have just one suspicion and that is the “fd cache size” - we have it set to 
> 16384 but the open fds suggest there are more open files for the osd process 
> (over 17K fds) - it varies by some hundreds between the osds. Maybe some are 
> just slightly over the limit and the misses cause this? Restarting the OSD 
> clears them (~2K) and they increase over time. I increased it to 32768 
> yesterday and it consistently nice now, but it might take another few days to 
> manifest…
> Could this explain it? Any other tips?
>
> Thanks
>
> Jan
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to