On 05/25/2013 02:33 PM, Leen Besselink wrote: Hi Leen,
> - a Cehp object can store keys/values, not just data
I did not know that. Could you explain or give me the URL ?
> - when using RBD the RBD client will create a 'directory' object which
> contains general information
> like the version/type of RBD-image and a list of names of the image parts.
> Each part is the same
> size, I think it was 4MB ?
That's also my understanding : 4MB is the default.
> - when an OSD or client connects to an OSD they also communicate information
> about atleast the osdmap and monmap.
> - when one OSD or monitor can't reach an other mon or OSD, they will use a
> gossip protocol to communicate that to connected clients, OSDs or mons.
> - when a new OSD comes online the other OSD's talk to it to know what data
> they might need to exchange
> this is called peering.
> - the RADOS-algorithm works similair to consistent hashing, so a client can
> talk directly to the OSD where the data is or should be stored.
> - backfilling is what a master OSD does when it is checking if the other
> OSD's that should have a copy actaully has a copy. It will send a copy of
> missing objects.
I guess that's the area where I'm still unsure how it goes. I should look into
the state machine of PG.{h,cc} to figure out how backfill related messages are
exchanged.
Thanks for taking the time to explain :-)
Cheers
> How the RADOS-algoritm calculates based on the osdmap and pgmap what pg and
> master-osd an object belongs to I'm not a 100% sure.
>
> Does that help ?
>
>> Cheers
>>
>> Ceph stores objects in pools which are divided in placement groups.
>>
>> +---------------------------- pool a ----+
>> |+----- placement group 1 -------------+ |
>> ||+-------+ +-------+ | |
>> |||object | |object | | |
>> ||+-------+ +-------+ | |
>> |+-------------------------------------+ |
>> |+----- placement group 2 -------------+ |
>> ||+-------+ +-------+ | |
>> |||object | |object | ... | |
>> ||+-------+ +-------+ | |
>> |+-------------------------------------+ |
>> | .... |
>> | |
>> +----------------------------------------+
>>
>> +---------------------------- pool b ----+
>> |+----- placement group 1 -------------+ |
>> ||+-------+ +-------+ | |
>> |||object | |object | | |
>> ||+-------+ +-------+ | |
>> |+-------------------------------------+ |
>> |+----- placement group 2 -------------+ |
>> ||+-------+ +-------+ | |
>> |||object | |object | ... | |
>> ||+-------+ +-------+ | |
>> |+-------------------------------------+ |
>> | .... |
>> | |
>> +----------------------------------------+
>>
>> ...
>>
>> The placement group is supported by OSDs to store the objects. They are
>> daemons running on machines where storage For instance, a placement group
>> supporting three replicates will have three OSDs at his disposal : one OSDs
>> is the primary and the two other store copies of each object.
>>
>> +-------- placement group -------------+
>> |+----------------+ +----------------+ |
>> || object A | | object B | |
>> |+----------------+ +----------------+ |
>> +---+-------------+-----------+--------+
>> | | |
>> | | |
>> OSD 0 OSD 1 OSD 2
>> +------+ +------+ +------+
>> |+---+ | |+---+ | |+---+ |
>> || A | | || A | | || A | |
>> |+---+ | |+---+ | |+---+ |
>> |+---+ | |+---+ | |+---+ |
>> || B | | || B | | || B | |
>> |+---+ | |+---+ | |+---+ |
>> +------+ +------+ +------+
>>
>> The OSDs are not for the exclusive use of the placement group : multiple
>> placement groups can use the same OSDs to store their objects. However, the
>> collocation of objects from various placement groups in the same OSD is
>> transparent and is not discussed here.
>>
>> The placement group does not run as a single daemon as suggested above.
>> Instead it os distributed and resides within each OSD. Whenever an OSD dies,
>> the placement group for this OSD is gone and needs to be reconstructed using
>> another OSD.
>>
>> OSD 0 OSD 1 ...
>> +----------------+---- placement group --------+ +------
>> |+--- object --+ |+--------------------------+ | |
>> || name : B | || pg_log_entry_t MODIFY | | |
>> || key : 2 | || pg_log_entry_t DELETE | | |
>> |+-------------+ |+--------------------------+ | |
>> |+--- object --+ >------ last_backfill | | ....
>> || name : A | | | |
>> || key : 5 | | | |
>> |+-------------+ | | |
>> | | | |
>> | .... | | |
>> +----------------+-----------------------------+ +-----
>>
>>
>> When an object is deleted or modified in the placement group, it is recorded
>> in a log to be replayed if needed. In the simplest case, if an OSD gets
>> disconnected, reconnects and needs to catch up with the other OSDs, copies
>> of the log entries will be sent to it. However, the logs have a limited size
>> and it may be more efficient, in some cases, to just copy the objects over
>> instead of replaying the logs.
>>
>> Each object name is hashed into an integer that can be used to order them.
>> For instance, the object B above has been hashed to key 2 and the object A
>> above has been hashed to key 5. The last_backfill pointer of the placement
>> group draws the limit separating the objects that have already been copied
>> from other OSDs and those in the process of being copied. The objects that
>> are lower than last_backfill have been copied ( that would be object B above
>> ) and the objects that are greater than last_backfill are going to be copied.
>>
>> It may take time for an OSD to catch up and it is useful to allow replaying
>> the logs while backfilling. log entries related to objects lower than
>> last_backfill are applied. However, log entries related to objects greater
>> than last_backfill are discarded because it is scheduled to be copied at a
>> later time anyway.
>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.
signature.asc
Description: OpenPGP digital signature
