On Sat, May 25, 2013 at 04:27:16PM +0200, Loic Dachary wrote:
> 
> 
> On 05/25/2013 02:33 PM, Leen Besselink wrote:
> Hi Leen,
> 
> > - a Cehp object can store keys/values, not just data
> 
> I did not know that. Could you explain or give me the URL ?
> 

Well, I got that impression from some of the earlier talks and from this blog 
post:

http://ceph.com/community/my-first-impressions-of-ceph-as-a-summer-intern/

But I haven't read it in while.

But at this time I only see something like:

http://ceph.com/docs/master/rados/api/librados/?highlight=rados_getxattr#rados_getxattr

Which looks like it is storing it in filesystem attributes.

So maybe an object can be a piece of data or a key/value store.

> > - when using RBD the RBD client will create a 'directory' object which 
> > contains general information
> >   like the version/type of RBD-image and a list of names of the image 
> > parts. Each part is the same
> >   size, I think it was 4MB ?
> 
> That's also my understanding : 4MB is the default.
> 
> > - when an OSD or client connects to an OSD they also communicate 
> > information about atleast the osdmap and monmap.
> > - when one OSD or monitor can't reach an other mon or OSD, they will use a 
> > gossip protocol to communicate that to connected clients, OSDs or mons.
> > - when a new OSD comes online the other OSD's talk to it to know what data 
> > they might need to exchange
> >   this is called peering.
> > - the RADOS-algorithm works similair to consistent hashing, so a client can 
> > talk directly to the OSD where the data is or should be stored.
> > - backfilling is what a master OSD does when it is checking if the other 
> > OSD's that should have a copy actaully has a copy. It will send a copy of 
> > missing objects.
> 
> I guess that's the area where I'm still unsure how it goes. I should look 
> into the state machine of PG.{h,cc} to figure out how backfill related 
> messages are exchanged.
> 

Well, I assume the master of an object knows it is the master.

When it knows an OSD has left the cluster it knows it has to store a new copy.

When changes happen in the cluster, all placement groups/object locations in 
the pool will have to be re-calculated.

I assumed this might mean an OSD which was master of an object before isn't 
master anymore, the new master is now resposible for the object and the number 
of copies.

> Thanks for taking the time to explain :-)
> 

This was just from memory, that is why I left some things unanswered.

My idea was when I transfer my knowledge to you, you can use it to search the 
source and documentation. :-)

> Cheers
> 
> > How the RADOS-algoritm calculates based on the osdmap and pgmap what pg and 
> > master-osd an object belongs to I'm not a 100% sure.
> > 
> > Does that help ?
> > 
> >> Cheers
> >>
> >> Ceph stores objects in pools which are divided in placement groups.
> >>
> >>    +---------------------------- pool a ----+
> >>    |+----- placement group 1 -------------+ |
> >>    ||+-------+  +-------+                 | |
> >>    |||object |  |object |                 | |
> >>    ||+-------+  +-------+                 | |
> >>    |+-------------------------------------+ |
> >>    |+----- placement group 2 -------------+ |
> >>    ||+-------+  +-------+                 | |
> >>    |||object |  |object |   ...           | |
> >>    ||+-------+  +-------+                 | |
> >>    |+-------------------------------------+ |
> >>    |               ....                     |
> >>    |                                        |
> >>    +----------------------------------------+
> >>
> >>    +---------------------------- pool b ----+
> >>    |+----- placement group 1 -------------+ |
> >>    ||+-------+  +-------+                 | |
> >>    |||object |  |object |                 | |
> >>    ||+-------+  +-------+                 | |
> >>    |+-------------------------------------+ |
> >>    |+----- placement group 2 -------------+ |
> >>    ||+-------+  +-------+                 | |
> >>    |||object |  |object |   ...           | |
> >>    ||+-------+  +-------+                 | |
> >>    |+-------------------------------------+ |
> >>    |               ....                     |
> >>    |                                        |
> >>    +----------------------------------------+
> >>
> >>    ...
> >>
> >> The placement group is supported by OSDs to store the objects. They are 
> >> daemons running on machines where storage  For instance, a placement group 
> >> supporting three replicates will have three OSDs at his disposal : one 
> >> OSDs is the primary and the two other store copies of each object.
> >>
> >>        +-------- placement group -------------+
> >>        |+----------------+ +----------------+ |
> >>        || object A       | | object B       | |
> >>        |+----------------+ +----------------+ |
> >>        +---+-------------+-----------+--------+
> >>            |             |           |
> >>            |             |           |
> >>          OSD 0         OSD 1       OSD 2
> >>         +------+      +------+    +------+
> >>         |+---+ |      |+---+ |    |+---+ |
> >>         || A | |      || A | |    || A | |
> >>         |+---+ |      |+---+ |    |+---+ |
> >>         |+---+ |      |+---+ |    |+---+ |
> >>         || B | |      || B | |    || B | |
> >>         |+---+ |      |+---+ |    |+---+ |
> >>         +------+      +------+    +------+
> >>
> >> The OSDs are not for the exclusive use of the placement group : multiple 
> >> placement groups can use the same OSDs to store their objects. However, 
> >> the collocation of objects from various placement groups in the same OSD 
> >> is transparent and is not discussed here.
> >>
> >> The placement group does not run as a single daemon as suggested above. 
> >> Instead it os distributed and resides within each OSD. Whenever an OSD 
> >> dies, the placement group for this OSD is gone and needs to be 
> >> reconstructed using another OSD.
> >>
> >>                OSD 0                                           OSD 1 ...
> >>         +----------------+---- placement group --------+  +------
> >>         |+--- object --+ |+--------------------------+ |  |
> >>         || name : B    | ||  pg_log_entry_t MODIFY   | |  |
> >>         || key : 2     | ||  pg_log_entry_t DELETE   | |  |
> >>         |+-------------+ |+--------------------------+ |  |
> >>         |+--- object --+ >------ last_backfill         |  | ....
> >>         || name : A    | |                             |  |
> >>         || key : 5     | |                             |  |
> >>         |+-------------+ |                             |  |
> >>         |                |                             |  |
> >>         |    ....        |                             |  |
> >>         +----------------+-----------------------------+  +-----
> >>
> >>
> >> When an object is deleted or modified in the placement group, it is 
> >> recorded in a log to be replayed if needed. In the simplest case, if an 
> >> OSD gets disconnected, reconnects and needs to catch up with the other 
> >> OSDs, copies of the log entries will be sent to it. However, the logs have 
> >> a limited size and it may be more efficient, in some cases, to just copy 
> >> the objects over instead of replaying the logs.
> >>
> >> Each object name is hashed into an integer that can be used to order them. 
> >> For instance, the object B above has been hashed to key 2 and the object A 
> >> above has been hashed to key 5. The last_backfill pointer of the placement 
> >> group draws the limit separating the objects that have already been copied 
> >> from other OSDs and those in the process of being copied. The objects that 
> >> are lower than last_backfill have been copied ( that would be object B 
> >> above ) and the objects that are greater than last_backfill are going to 
> >> be copied.
> >>
> >> It may take time for an OSD to catch up and it is useful to allow 
> >> replaying the logs while backfilling. log entries related to objects lower 
> >> than last_backfill are applied. However, log entries related to objects 
> >> greater than last_backfill are discarded because it is scheduled to be 
> >> copied at a later time anyway.
> >>
> >>
> >> -- 
> >> Loïc Dachary, Artisan Logiciel Libre
> >> All that is necessary for the triumph of evil is that good people do 
> >> nothing.
> >>
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to [email protected]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to