Hi Yehuda,
Recently with our pre-production clusters (with radosgw), we had an outage that 
all radosgw worker threads got stuck and all clients request resulted in 500 
because that there is no worker thread taking care of them.

What we observed from the cluster, is that there was a PG stuck at *peering* 
state, as a result, all requests hitting that PG would occupy a worker thread 
infinitely and that gradually stuck all workers.

The reason why the PG stuck at peering is still under investigation, but 
radosgw side, I am wondering if we can pursue anything to improve such use case 
(to be more specific, 1 out of 8192 PGs' issue cascading to a service 
unavailable across the entire cluster):

1. The first approach I can think of is to add timeout at objecter layer for 
each OP to OSD, I think the complexity comes with WRITE, that is, how do we 
make sure the integrity if we abort at objecter layer. But for immutable op, I 
think we certainly can do this, since at an upper layer, we already reply back 
to client with an error.
2. Do thread pool/working queue sharding  at radosgw, in which case, partial 
failure would (hopefully) only impact partial of worker threads and only cause 
a partial outage.

How do you think?

Thanks,
Guang                                     --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to