> =================Faster Peering/Lower Tail Latency====================
>
> https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
> 3A_Faster_Peering
>
> https://wiki.ceph.com/Planning/Blueprints/Infernalis/Improve_tail_latency
>
> http://pad.ceph.com/p/I-faster-peering_tailing
>
> In addition to what is in the blueprint, Sage suggested that the primary
> in some cases can keep the peer_info and peer_missing sets which it
> already has if the acting set stays the same or shrinks.
>
> We also touched on prepopulating pg_temp at the monitor and setting a
> different temp pg primary at the monitor in the map which marks an osd
> back up to avoid that pg being primary immediately (and having to block
> reads and writes on recovery).
>
Hi Sam,
With our experience, the peering is more painful when the OSD(s) stayed down 
(but still in) for a while and then got up, for example, the OSD crashed or one 
OSD host crashed without notice (or it takes time to repair the hardware), when 
it is up, it will need to populate the PG::recovery_map, say there are N 
objects missing, and there are M replicas, currently the complexity of the 
search for missing is N*M*logN. When N is large (OSD down for a while), and M 
is large (EC pool), and many PGs are going through this process, it is 
non-trivial. Tracker #9558 has some logs with more details. I am thinking a 
simple optimization is to detect the case that only 1 replica (in the 
actingbackfill set) has missing and all others are complete, we can simply 
populate the recovery_map by specifying (M - 1) replicas who does not have any 
missing as recovery source, this could improve the complexity to N*logN.

Does that make sense? If it does, I will go ahead providing a patch.

Thanks,
Guang                                     

Reply via email to