Hi folks,

I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
like to identify the OSD that is causing those events
once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
get this info in realtime).

Collecting this information could help identify aging disks if you were able to 
accumulate and analyze which OSD had blocking
requests in the past and how often those events occur.

My research so far let's me think that this information is only available as 
long as the requests are actually blocked. Is this

MON logs only show that those events occure and how many requests are in 
blocking state but no indication of which OSD is
affected. Is there a way to identify blocking requests from the OSD log files?

On a side note: I was trying to write a small Python script that would extract 
this kind of information in realtime but while I
was able to register a MonitorLog callback that would receive the same messages 
as you would get with "ceph -w" I haven's seen in
the librados Python bindings documentation the possibility to do the equivalent 
of "ceph health detail". Any suggestions on how to
get the blocking OSDs via librados?


