Yeah, this is very much like DreamHost is doing with their DreamCompute installation (you can find some talks about it online, I believe, though I'm not sure how much detail they include there versus in the Q&As).
On Fri, Apr 12, 2013 at 7:32 PM, Chen, Xiaoxi <[email protected]> wrote: > We are also discussing this internally, and come out with an idea to walk > around it(Only for RBD case,havent think about Obj store),but not yet tested. > If Mark and Greg can provide some feedback,that would be great. > > We are trying to write a script to generate some pools,for rack A,there is a > pool A,which defined the crush ruleset to choose Osd in rackA as the > primary.so if we have 10 racks,we will have 10 pools and 10 rules. > > When the VM migrated to other rack,or the volume be detached and attached to > another VM hosted in other rack,a data migration is needed.we are thinking > about how to smooth such migration This is one of the use cases that layering is designed to handle (in addition to standard cloning and snapshots). Just create a clone that lives in the new pool, and either let it copy-up to the new position lazily or run the command at a time when you know your network is less busy. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com > > 发自我的 iPhone > > 在 2013-4-13,0:20,"Gregory Farnum" <[email protected]> 写道: > >> I was in the middle of writing a response to this when Mark's email >> came in, so I'll just add a few things: >> >> On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <[email protected]> wrote: >>> On 04/11/2013 10:59 PM, Matthias Urlichs wrote: >>>> >>>> As I understand it, in Ceph one can cluster storage nodes, but otherwise >>>> every node is essentially identical, so if three storage nodes have a >>>> file, >>>> ceph randomly uses one of them. >>> >>> >>> Ceph clusters have the concept of pools, where each pool has a certain >>> number of placement groups. Placement groups are just collections of >>> mappings to OSDs. Each PG has a primary OSD and a number of secondary ones, >>> based on the replication level you set when you make the pool. When an >>> object gets written to the cluster, CRUSH will determine which PG the data >>> should be sent to. The data will first hit the primary OSD and then >>> replicated out to the other OSDs in the same placement group. >>> >>> Currently reads always come from the primary OSD in the placement group >>> rather than a secondary even if the secondary is closer to the client. I'm >>> guessing there are probably some tricks that could be played here to best >>> determine which machines should service which clients, but it's not exactly >>> an easy problem. In many cases spreading reads out over all of the OSDs in >>> the cluster is better than trying to optimize reads to only hit local OSDs. >>> Ideally you probably want to prefer local OSDs first, but not exclusively. >> >> In addition to just determining the locality (which we've started on >> via external interfaces), this has a number of consistency challenges >> associated with it. The infrastructure we have to allow reading from >> non-primaries tends to involve clients having different consistency >> expectations, and it's not fully explored yet or set up so that >> clients can choose to read from a specific non-primary ― the options >> currently are "local if available and we can tell", "random", and >> "primary". >> >> >>>> This is not efficient use of network resources in a distributed data >>>> center. >>>> Or even in a multi-rack situation. >>>> >>>> I want to prefer accessing nodes which are "local". >>>> The client in rack A should prefer to read from the storage nodes that are >>>> also in rack A. >>>> Ditto for rack B. >>>> Ditto for s/rack/data center/. >> >> I do want to ask if you're sure this is as useful as you think it is. >> There are use cases where it would be, but since writes have to >> traverse these links (at a multiple of the actual write count) as well >> you should be very certain. :) >> >>>> As far as I understand, the Ceph clients can't do that. >>>> (Nor can Ceph nodes among each other, but I care less about that, as most >>>> traffic is reading data.) >>>> >>>> I think this is an important feature for many high-reliability situations. >>>> >>>> What would be the next steps to get this feature, assuming I don't have >>>> time >>>> to implement it myself? Persistently annoy this mailing list that people >>>> need it? Offer to pay for implementing it? Shut up and look for some other >>>> solution -- which I already did, but I didn't find any that's as good as >>>> Ceph, otherwise? >>> >>> >>> I don't really have that much insight into the product roadmap, but I assume >>> that if you spoke to some of our business folks about paying for development >>> work you'd at least get a response. >> >> Yeah. It's not a feature in large enough demand right now that we can >> see to be worth bumping up over other things, but I don't think >> anybody's opposed to it existing. As with Mark I have no idea if >> you're best off asking us or others to do things for money, but it >> would certainly have to go through business channels. (If somebody >> outside Inktank did want to implement this feature, I'd love to talk >> to them about it on an informal but ongoing basis during development.) >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to [email protected] >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
