On Sat, May 25, 2013 at 01:55:30PM +0200, Loic Dachary wrote:
> Hi,
> 

Hi Loic,

> Here is a draft of my current understanding of backfilling. Disclaimer : it 
> is possible that I completely misunderstood ;-)
> 

Maybe I'm wrong, but I think there are some flaws in your explanation.

Disclaimer: I'm not a Ceph developer, but a fellow Ceph tester/user.

I would think almost any explanation of how Ceph works would use words like 
hash or algorithm.

Because the RADOS algorithm determines where data ends up.

It is used to calculate how to balance the data over the different placement 
groups, osds and machines.

I believe/assumed it works like this:
- an OSD belongs to one pool
- an OSD can serve many placement groups
- a placement group belongs to one pool
- the monitors know which OSD's exist in each pool and which are up
- this is sometimes called the topology, usually the osdmap and pgmap
- the monitors need to have quorum to be authoritive for their data before 
making changes.
- the list of monitors is called the monmap.
- when storing/retrieving a client will have to do a RADOS-calculation
- to do this calculation it will first need to talk to one of the monitor which 
has quorum
  to get the osdmap and pgmap. They serve as input for the RADOS-algorithm
- one OSD is the master for one piece of data and serves as the contact-point 
for clients
- that master OSD for a piece of data also runs the RADOS-calculation to talk 
to the other OSDs
  when replicating data changes
- a piece of data is called an object. Ceph is an Object Storage system/Object 
Store.
- the Ceph Object Store is not the same as a Swift or RADOS-GW object store.
- a Cehp object can store keys/values, not just data
- when using RBD the RBD client will create a 'directory' object which contains 
general information
  like the version/type of RBD-image and a list of names of the image parts. 
Each part is the same
  size, I think it was 4MB ?
- when an OSD or client connects to an OSD they also communicate information 
about atleast the osdmap and monmap.
- when one OSD or monitor can't reach an other mon or OSD, they will use a 
gossip protocol to communicate that to connected clients, OSDs or mons.
- when a new OSD comes online the other OSD's talk to it to know what data they 
might need to exchange
  this is called peering.
- the RADOS-algorithm works similair to consistent hashing, so a client can 
talk directly to the OSD where the data is or should be stored.
- backfilling is what a master OSD does when it is checking if the other OSD's 
that should have a copy actaully has a copy. It will send a copy of missing 
objects.

How the RADOS-algoritm calculates based on the osdmap and pgmap what pg and 
master-osd an object belongs to I'm not a 100% sure.

Does that help ?

> Cheers
> 
> Ceph stores objects in pools which are divided in placement groups.
> 
>    +---------------------------- pool a ----+
>    |+----- placement group 1 -------------+ |
>    ||+-------+  +-------+                 | |
>    |||object |  |object |                 | |
>    ||+-------+  +-------+                 | |
>    |+-------------------------------------+ |
>    |+----- placement group 2 -------------+ |
>    ||+-------+  +-------+                 | |
>    |||object |  |object |   ...           | |
>    ||+-------+  +-------+                 | |
>    |+-------------------------------------+ |
>    |               ....                     |
>    |                                        |
>    +----------------------------------------+
> 
>    +---------------------------- pool b ----+
>    |+----- placement group 1 -------------+ |
>    ||+-------+  +-------+                 | |
>    |||object |  |object |                 | |
>    ||+-------+  +-------+                 | |
>    |+-------------------------------------+ |
>    |+----- placement group 2 -------------+ |
>    ||+-------+  +-------+                 | |
>    |||object |  |object |   ...           | |
>    ||+-------+  +-------+                 | |
>    |+-------------------------------------+ |
>    |               ....                     |
>    |                                        |
>    +----------------------------------------+
> 
>    ...
> 
> The placement group is supported by OSDs to store the objects. They are 
> daemons running on machines where storage  For instance, a placement group 
> supporting three replicates will have three OSDs at his disposal : one OSDs 
> is the primary and the two other store copies of each object.
> 
>        +-------- placement group -------------+
>        |+----------------+ +----------------+ |
>        || object A       | | object B       | |
>        |+----------------+ +----------------+ |
>        +---+-------------+-----------+--------+
>            |             |           |
>            |             |           |
>          OSD 0         OSD 1       OSD 2
>         +------+      +------+    +------+
>         |+---+ |      |+---+ |    |+---+ |
>         || A | |      || A | |    || A | |
>         |+---+ |      |+---+ |    |+---+ |
>         |+---+ |      |+---+ |    |+---+ |
>         || B | |      || B | |    || B | |
>         |+---+ |      |+---+ |    |+---+ |
>         +------+      +------+    +------+
> 
> The OSDs are not for the exclusive use of the placement group : multiple 
> placement groups can use the same OSDs to store their objects. However, the 
> collocation of objects from various placement groups in the same OSD is 
> transparent and is not discussed here.
> 
> The placement group does not run as a single daemon as suggested above. 
> Instead it os distributed and resides within each OSD. Whenever an OSD dies, 
> the placement group for this OSD is gone and needs to be reconstructed using 
> another OSD.
> 
>                OSD 0                                           OSD 1 ...
>         +----------------+---- placement group --------+  +------
>         |+--- object --+ |+--------------------------+ |  |
>         || name : B    | ||  pg_log_entry_t MODIFY   | |  |
>         || key : 2     | ||  pg_log_entry_t DELETE   | |  |
>         |+-------------+ |+--------------------------+ |  |
>         |+--- object --+ >------ last_backfill         |  | ....
>         || name : A    | |                             |  |
>         || key : 5     | |                             |  |
>         |+-------------+ |                             |  |
>         |                |                             |  |
>         |    ....        |                             |  |
>         +----------------+-----------------------------+  +-----
> 
> 
> When an object is deleted or modified in the placement group, it is recorded 
> in a log to be replayed if needed. In the simplest case, if an OSD gets 
> disconnected, reconnects and needs to catch up with the other OSDs, copies of 
> the log entries will be sent to it. However, the logs have a limited size and 
> it may be more efficient, in some cases, to just copy the objects over 
> instead of replaying the logs.
> 
> Each object name is hashed into an integer that can be used to order them. 
> For instance, the object B above has been hashed to key 2 and the object A 
> above has been hashed to key 5. The last_backfill pointer of the placement 
> group draws the limit separating the objects that have already been copied 
> from other OSDs and those in the process of being copied. The objects that 
> are lower than last_backfill have been copied ( that would be object B above 
> ) and the objects that are greater than last_backfill are going to be copied.
> 
> It may take time for an OSD to catch up and it is useful to allow replaying 
> the logs while backfilling. log entries related to objects lower than 
> last_backfill are applied. However, log entries related to objects greater 
> than last_backfill are discarded because it is scheduled to be copied at a 
> later time anyway.
> 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to