Re: Ceph backfilling explained ( maybe )

Loic Dachary Sat, 25 May 2013 07:27:26 -0700


On 05/25/2013 02:33 PM, Leen Besselink wrote:
Hi Leen,


> - a Cehp object can store keys/values, not just data

I did not know that. Could you explain or give me the URL ?

> - when using RBD the RBD client will create a 'directory' object which 
> contains general information
>   like the version/type of RBD-image and a list of names of the image parts. 
> Each part is the same
>   size, I think it was 4MB ?

That's also my understanding : 4MB is the default.

> - when an OSD or client connects to an OSD they also communicate information 
> about atleast the osdmap and monmap.
> - when one OSD or monitor can't reach an other mon or OSD, they will use a 
> gossip protocol to communicate that to connected clients, OSDs or mons.
> - when a new OSD comes online the other OSD's talk to it to know what data 
> they might need to exchange
>   this is called peering.
> - the RADOS-algorithm works similair to consistent hashing, so a client can 
> talk directly to the OSD where the data is or should be stored.
> - backfilling is what a master OSD does when it is checking if the other 
> OSD's that should have a copy actaully has a copy. It will send a copy of 
> missing objects.

I guess that's the area where I'm still unsure how it goes. I should look into 
the state machine of PG.{h,cc} to figure out how backfill related messages are 
exchanged.

Thanks for taking the time to explain :-)

Cheers

> How the RADOS-algoritm calculates based on the osdmap and pgmap what pg and 
> master-osd an object belongs to I'm not a 100% sure.
> 
> Does that help ?
> 
>> Cheers
>>
>> Ceph stores objects in pools which are divided in placement groups.
>>
>>    +---------------------------- pool a ----+
>>    |+----- placement group 1 -------------+ |
>>    ||+-------+  +-------+                 | |
>>    |||object |  |object |                 | |
>>    ||+-------+  +-------+                 | |
>>    |+-------------------------------------+ |
>>    |+----- placement group 2 -------------+ |
>>    ||+-------+  +-------+                 | |
>>    |||object |  |object |   ...           | |
>>    ||+-------+  +-------+                 | |
>>    |+-------------------------------------+ |
>>    |               ....                     |
>>    |                                        |
>>    +----------------------------------------+
>>
>>    +---------------------------- pool b ----+
>>    |+----- placement group 1 -------------+ |
>>    ||+-------+  +-------+                 | |
>>    |||object |  |object |                 | |
>>    ||+-------+  +-------+                 | |
>>    |+-------------------------------------+ |
>>    |+----- placement group 2 -------------+ |
>>    ||+-------+  +-------+                 | |
>>    |||object |  |object |   ...           | |
>>    ||+-------+  +-------+                 | |
>>    |+-------------------------------------+ |
>>    |               ....                     |
>>    |                                        |
>>    +----------------------------------------+
>>
>>    ...
>>
>> The placement group is supported by OSDs to store the objects. They are 
>> daemons running on machines where storage  For instance, a placement group 
>> supporting three replicates will have three OSDs at his disposal : one OSDs 
>> is the primary and the two other store copies of each object.
>>
>>        +-------- placement group -------------+
>>        |+----------------+ +----------------+ |
>>        || object A       | | object B       | |
>>        |+----------------+ +----------------+ |
>>        +---+-------------+-----------+--------+
>>            |             |           |
>>            |             |           |
>>          OSD 0         OSD 1       OSD 2
>>         +------+      +------+    +------+
>>         |+---+ |      |+---+ |    |+---+ |
>>         || A | |      || A | |    || A | |
>>         |+---+ |      |+---+ |    |+---+ |
>>         |+---+ |      |+---+ |    |+---+ |
>>         || B | |      || B | |    || B | |
>>         |+---+ |      |+---+ |    |+---+ |
>>         +------+      +------+    +------+
>>
>> The OSDs are not for the exclusive use of the placement group : multiple 
>> placement groups can use the same OSDs to store their objects. However, the 
>> collocation of objects from various placement groups in the same OSD is 
>> transparent and is not discussed here.
>>
>> The placement group does not run as a single daemon as suggested above. 
>> Instead it os distributed and resides within each OSD. Whenever an OSD dies, 
>> the placement group for this OSD is gone and needs to be reconstructed using 
>> another OSD.
>>
>>                OSD 0                                           OSD 1 ...
>>         +----------------+---- placement group --------+  +------
>>         |+--- object --+ |+--------------------------+ |  |
>>         || name : B    | ||  pg_log_entry_t MODIFY   | |  |
>>         || key : 2     | ||  pg_log_entry_t DELETE   | |  |
>>         |+-------------+ |+--------------------------+ |  |
>>         |+--- object --+ >------ last_backfill         |  | ....
>>         || name : A    | |                             |  |
>>         || key : 5     | |                             |  |
>>         |+-------------+ |                             |  |
>>         |                |                             |  |
>>         |    ....        |                             |  |
>>         +----------------+-----------------------------+  +-----
>>
>>
>> When an object is deleted or modified in the placement group, it is recorded 
>> in a log to be replayed if needed. In the simplest case, if an OSD gets 
>> disconnected, reconnects and needs to catch up with the other OSDs, copies 
>> of the log entries will be sent to it. However, the logs have a limited size 
>> and it may be more efficient, in some cases, to just copy the objects over 
>> instead of replaying the logs.
>>
>> Each object name is hashed into an integer that can be used to order them. 
>> For instance, the object B above has been hashed to key 2 and the object A 
>> above has been hashed to key 5. The last_backfill pointer of the placement 
>> group draws the limit separating the objects that have already been copied 
>> from other OSDs and those in the process of being copied. The objects that 
>> are lower than last_backfill have been copied ( that would be object B above 
>> ) and the objects that are greater than last_backfill are going to be copied.
>>
>> It may take time for an OSD to catch up and it is useful to allow replaying 
>> the logs while backfilling. log entries related to objects lower than 
>> last_backfill are applied. However, log entries related to objects greater 
>> than last_backfill are discarded because it is scheduled to be copied at a 
>> later time anyway.
>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

signature.asc
Description: OpenPGP digital signature

Re: Ceph backfilling explained ( maybe )

Reply via email to