Hi Matthias, Marcel,
Thanks for the discussion yesterday on a cold storage tier[1]. I think
the idea of minimizing/preventing migration of data for a particular RADOS
pool while still minimizing placement metadata (i.e., no MDS) is very
interesting and has lots of possible applications. I thought I'd
summarize what I was suggesting yesterday in case it didn't come across
well verbally.
The basic tradeoff is around placement metadata. In RADOS we have a
compact OSDMap structure on the client that lets you calculate where any
object is based on a simple policy (pool properties, CRUSH map) and OSD
state (up/down/in/out, current IP address). If placement of an object
is purely a function of the name and the cluster state, then generally
when the cluster changes data will move.
To avoid that, my suggestion is to incorporate a timestamp into part of
the name (say, a prefix). Then placement becomes a function of an
arbitrary string, time written (which together form the object name), and
cluster state. This would normally mean a metadata layer so that you can
tell that 'foo' was written at time X and is actually 'X_foo'. But, if we
combine it with the proposed RADOS redirect mechanism, then the active
storage tier would have a zillion pointers (stored as 'foo') that point
off into some cold tier with the correct name ('X_foo'). Basically,
another RADOS pool becomes that metadata layer. At that point it needn't
even be 'X_foo'.. it could be X-anything, as long as it is unique and has
the timestamp X in there to inform placement.
For the placement thing, my suggestion is to look at the basic idea behind
the original RUSH-L algorithm (reimplemented as CRUSH list buckets),
originally described in this paper
http://pdf.aminer.org/000/409/291/replication_under_scalable_hashing_a_family_of_algorithms_for_scalable.pdf
The core idea is that at a point in time, data is distributed in a
particular way. In the base case, we just hash/stripe over a set of
identical nodes. Each time we deploy new gear, we "patch" the previous
distribution, so that some % of objects are instead placed on the new
gear. This approach has various flaws, mainly when it comes to removing
old gear, but I think the idea of patching the previous distirbution can
be applied here.
Currently, we do:
object_name
hash(object_name) % pg_num -> ps (placement seed)
(ps, poolid) -> pgid
crush(pgid) -> [set of osds]
Here, we could define a series of time intervals, and for each interval,
we would create a new set of PGs. More like:
(name, timestamp)
hash(object_name) % interval_pg_num -> ps
(ps, poolid, interval #) -> tpgid
crush(tpgid) -> [set of osds]
The trick would be that for each time interval, CRUSH would define how the
objects distribute. When they fill or new hardware is deployed, we'd
close out the current interval and start a new interval that mapped to
new PGs that CRUSH mapped to new hardware.
Hmm, you could actually do this by simply creating a new RADOS pool for
every interval and not changing anything in the existing code at all. As
HW in old pools fails you'd have to include some new HW in the mix to
offload some content. There are probably some changes we can do
there to avoid writing anything new to the surviving full nodes (that case
is awkward to handle currently). Or, there may be benefits to pulling
this functionality into a new approach within CRUSH.. I'm not sure.
Would need to think about it a bit more ...
sage
[1]
http://pad.ceph.com/p/hammer-cold_storage
https://wiki.ceph.com/Planning/Blueprints/Hammer/Towards_Ceph_Cold_Storage
http://youtu.be/FARNRvYMQJ4
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html