There are a few reasons. At the most basic level, grouping things into
PGs (you might think of them as "shards" of a pool) limits metadata and
tracking. Calculating the mapping is pretty cheap but not computationally
free; it also helps us there.
A better answer has to do with reliability and the probability of failure
loss. If we _independently_ calculate a placement of an object, then it
doesn't take long before you can pick any two (or even three) nodes in the
system and there will be object stored by exactly those two nodes. That
means that _any_ double failure guarantees you will lose some data. But
generally speaking we do want replication to be declustered because it
allows massively parallel recovery and (in general) is a reliability
win(*). As a practical matter, we want to balance those two things and
mitigate when possible by imposing some order (e.g. separating replicas
across failure domains). PGs are the tool to do that.
They are also nice when you have, say, a small pool with a small number of
objects. You can break in into a small number of PGs and talk to a small
number of OSDs (instead of talking to, say, 10 OSDs to read/write 10
object.
>From a more practical standpoint, PGs are a simple abstraction upon which
to implement all the peering and syncrhonization between OSDs. The map
update processing _only_ has to recalculate mappings for the PGs currently
stored locally, not for every single object stored locally. And sync is
in terms of the PG version for the PGs shared between a pair of OSDs, not
the versions of every object they share. The peering protocols are a
delicate balance between ease of synchronization, simplicity, and
minimization of centralized metadata.
sage
(* If you do the math, it's actually a wash for 2x. The probability of
any data loss is the same (although you will lose some data). Once you
factor in factors at the margins, declustered replication is a slight win.
If you look at the expected _amount_ of data lost, declustering is always
a win. See Qin Xin's paper on the publications page for more info.)
On Tue, 15 Feb 2011, Tommi Virtanen wrote:
> Hi. I'm reading the thesis, and wondering what the thinking is behind
> how Ceph uses the placement groups (PGs).
>
> It seems that CRUSH is used for a deterministic, pseudorandom mapping,
> object_id --> pg --> osds. I'm wondering why the extra level of PGs
> was felt desirable, why that isn't just object_id --> osds.
>
> Colin explained on IRC that the primary OSD for a PG is responsible
> for some managerial duties, but that just tells me *how* PGs are used;
> not *why*. Surely you could organize these responsibilities
> differently, e.g. manage replication of an object on an
> object-by-object basis, by the primary OSD for that object.
>
> --
> :(){ :|:&};:
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html