On Sat, May 25, 2013 at 04:27:16PM +0200, Loic Dachary wrote: > > > On 05/25/2013 02:33 PM, Leen Besselink wrote: > Hi Leen, > > > - a Cehp object can store keys/values, not just data > > I did not know that. Could you explain or give me the URL ? >
Well, I got that impression from some of the earlier talks and from this blog post: http://ceph.com/community/my-first-impressions-of-ceph-as-a-summer-intern/ But I haven't read it in while. But at this time I only see something like: http://ceph.com/docs/master/rados/api/librados/?highlight=rados_getxattr#rados_getxattr Which looks like it is storing it in filesystem attributes. So maybe an object can be a piece of data or a key/value store. > > - when using RBD the RBD client will create a 'directory' object which > > contains general information > > like the version/type of RBD-image and a list of names of the image > > parts. Each part is the same > > size, I think it was 4MB ? > > That's also my understanding : 4MB is the default. > > > - when an OSD or client connects to an OSD they also communicate > > information about atleast the osdmap and monmap. > > - when one OSD or monitor can't reach an other mon or OSD, they will use a > > gossip protocol to communicate that to connected clients, OSDs or mons. > > - when a new OSD comes online the other OSD's talk to it to know what data > > they might need to exchange > > this is called peering. > > - the RADOS-algorithm works similair to consistent hashing, so a client can > > talk directly to the OSD where the data is or should be stored. > > - backfilling is what a master OSD does when it is checking if the other > > OSD's that should have a copy actaully has a copy. It will send a copy of > > missing objects. > > I guess that's the area where I'm still unsure how it goes. I should look > into the state machine of PG.{h,cc} to figure out how backfill related > messages are exchanged. > Well, I assume the master of an object knows it is the master. When it knows an OSD has left the cluster it knows it has to store a new copy. When changes happen in the cluster, all placement groups/object locations in the pool will have to be re-calculated. I assumed this might mean an OSD which was master of an object before isn't master anymore, the new master is now resposible for the object and the number of copies. > Thanks for taking the time to explain :-) > This was just from memory, that is why I left some things unanswered. My idea was when I transfer my knowledge to you, you can use it to search the source and documentation. :-) > Cheers > > > How the RADOS-algoritm calculates based on the osdmap and pgmap what pg and > > master-osd an object belongs to I'm not a 100% sure. > > > > Does that help ? > > > >> Cheers > >> > >> Ceph stores objects in pools which are divided in placement groups. > >> > >> +---------------------------- pool a ----+ > >> |+----- placement group 1 -------------+ | > >> ||+-------+ +-------+ | | > >> |||object | |object | | | > >> ||+-------+ +-------+ | | > >> |+-------------------------------------+ | > >> |+----- placement group 2 -------------+ | > >> ||+-------+ +-------+ | | > >> |||object | |object | ... | | > >> ||+-------+ +-------+ | | > >> |+-------------------------------------+ | > >> | .... | > >> | | > >> +----------------------------------------+ > >> > >> +---------------------------- pool b ----+ > >> |+----- placement group 1 -------------+ | > >> ||+-------+ +-------+ | | > >> |||object | |object | | | > >> ||+-------+ +-------+ | | > >> |+-------------------------------------+ | > >> |+----- placement group 2 -------------+ | > >> ||+-------+ +-------+ | | > >> |||object | |object | ... | | > >> ||+-------+ +-------+ | | > >> |+-------------------------------------+ | > >> | .... | > >> | | > >> +----------------------------------------+ > >> > >> ... > >> > >> The placement group is supported by OSDs to store the objects. They are > >> daemons running on machines where storage For instance, a placement group > >> supporting three replicates will have three OSDs at his disposal : one > >> OSDs is the primary and the two other store copies of each object. > >> > >> +-------- placement group -------------+ > >> |+----------------+ +----------------+ | > >> || object A | | object B | | > >> |+----------------+ +----------------+ | > >> +---+-------------+-----------+--------+ > >> | | | > >> | | | > >> OSD 0 OSD 1 OSD 2 > >> +------+ +------+ +------+ > >> |+---+ | |+---+ | |+---+ | > >> || A | | || A | | || A | | > >> |+---+ | |+---+ | |+---+ | > >> |+---+ | |+---+ | |+---+ | > >> || B | | || B | | || B | | > >> |+---+ | |+---+ | |+---+ | > >> +------+ +------+ +------+ > >> > >> The OSDs are not for the exclusive use of the placement group : multiple > >> placement groups can use the same OSDs to store their objects. However, > >> the collocation of objects from various placement groups in the same OSD > >> is transparent and is not discussed here. > >> > >> The placement group does not run as a single daemon as suggested above. > >> Instead it os distributed and resides within each OSD. Whenever an OSD > >> dies, the placement group for this OSD is gone and needs to be > >> reconstructed using another OSD. > >> > >> OSD 0 OSD 1 ... > >> +----------------+---- placement group --------+ +------ > >> |+--- object --+ |+--------------------------+ | | > >> || name : B | || pg_log_entry_t MODIFY | | | > >> || key : 2 | || pg_log_entry_t DELETE | | | > >> |+-------------+ |+--------------------------+ | | > >> |+--- object --+ >------ last_backfill | | .... > >> || name : A | | | | > >> || key : 5 | | | | > >> |+-------------+ | | | > >> | | | | > >> | .... | | | > >> +----------------+-----------------------------+ +----- > >> > >> > >> When an object is deleted or modified in the placement group, it is > >> recorded in a log to be replayed if needed. In the simplest case, if an > >> OSD gets disconnected, reconnects and needs to catch up with the other > >> OSDs, copies of the log entries will be sent to it. However, the logs have > >> a limited size and it may be more efficient, in some cases, to just copy > >> the objects over instead of replaying the logs. > >> > >> Each object name is hashed into an integer that can be used to order them. > >> For instance, the object B above has been hashed to key 2 and the object A > >> above has been hashed to key 5. The last_backfill pointer of the placement > >> group draws the limit separating the objects that have already been copied > >> from other OSDs and those in the process of being copied. The objects that > >> are lower than last_backfill have been copied ( that would be object B > >> above ) and the objects that are greater than last_backfill are going to > >> be copied. > >> > >> It may take time for an OSD to catch up and it is useful to allow > >> replaying the logs while backfilling. log entries related to objects lower > >> than last_backfill are applied. However, log entries related to objects > >> greater than last_backfill are discarded because it is scheduled to be > >> copied at a later time anyway. > >> > >> > >> -- > >> Loïc Dachary, Artisan Logiciel Libre > >> All that is necessary for the triumph of evil is that good people do > >> nothing. > >> > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to [email protected] > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Loïc Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do nothing. > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
