Thanks Gerg. Technically speaking, it's still workable that someone may even want to make such policy "per node".
在 2013-4-16,0:42,"Gregory Farnum" <[email protected]> 写道: > Yeah, this is very much like DreamHost is doing with their > DreamCompute installation (you can find some talks about it online, I > believe, though I'm not sure how much detail they include there versus > in the Q&As). > > On Fri, Apr 12, 2013 at 7:32 PM, Chen, Xiaoxi <[email protected]> wrote: >> We are also discussing this internally, and come out with an idea to walk >> around it(Only for RBD case,havent think about Obj store),but not yet >> tested. If Mark and Greg can provide some feedback,that would be great. >> >> We are trying to write a script to generate some pools,for rack A,there is a >> pool A,which defined the crush ruleset to choose Osd in rackA as the >> primary.so if we have 10 racks,we will have 10 pools and 10 rules. >> >> When the VM migrated to other rack,or the volume be detached and attached to >> another VM hosted in other rack,a data migration is needed.we are thinking >> about how to smooth such migration > > This is one of the use cases that layering is designed to handle (in > addition to standard cloning and snapshots). Just create a clone that > lives in the new pool, and either let it copy-up to the new position > lazily or run the command at a time when you know your network is less > busy. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > >> >> 发自我的 iPhone >> >> 在 2013-4-13,0:20,"Gregory Farnum" <[email protected]> 写道: >> >>> I was in the middle of writing a response to this when Mark's email >>> came in, so I'll just add a few things: >>> >>> On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <[email protected]> >>> wrote: >>>> On 04/11/2013 10:59 PM, Matthias Urlichs wrote: >>>>> >>>>> As I understand it, in Ceph one can cluster storage nodes, but otherwise >>>>> every node is essentially identical, so if three storage nodes have a >>>>> file, >>>>> ceph randomly uses one of them. >>>> >>>> >>>> Ceph clusters have the concept of pools, where each pool has a certain >>>> number of placement groups. Placement groups are just collections of >>>> mappings to OSDs. Each PG has a primary OSD and a number of secondary >>>> ones, >>>> based on the replication level you set when you make the pool. When an >>>> object gets written to the cluster, CRUSH will determine which PG the data >>>> should be sent to. The data will first hit the primary OSD and then >>>> replicated out to the other OSDs in the same placement group. >>>> >>>> Currently reads always come from the primary OSD in the placement group >>>> rather than a secondary even if the secondary is closer to the client. I'm >>>> guessing there are probably some tricks that could be played here to best >>>> determine which machines should service which clients, but it's not exactly >>>> an easy problem. In many cases spreading reads out over all of the OSDs in >>>> the cluster is better than trying to optimize reads to only hit local OSDs. >>>> Ideally you probably want to prefer local OSDs first, but not exclusively. >>> >>> In addition to just determining the locality (which we've started on >>> via external interfaces), this has a number of consistency challenges >>> associated with it. The infrastructure we have to allow reading from >>> non-primaries tends to involve clients having different consistency >>> expectations, and it's not fully explored yet or set up so that >>> clients can choose to read from a specific non-primary ― the options >>> currently are "local if available and we can tell", "random", and >>> "primary". >>> >>> >>>>> This is not efficient use of network resources in a distributed data >>>>> center. >>>>> Or even in a multi-rack situation. >>>>> >>>>> I want to prefer accessing nodes which are "local". >>>>> The client in rack A should prefer to read from the storage nodes that are >>>>> also in rack A. >>>>> Ditto for rack B. >>>>> Ditto for s/rack/data center/. >>> >>> I do want to ask if you're sure this is as useful as you think it is. >>> There are use cases where it would be, but since writes have to >>> traverse these links (at a multiple of the actual write count) as well >>> you should be very certain. :) >>> >>>>> As far as I understand, the Ceph clients can't do that. >>>>> (Nor can Ceph nodes among each other, but I care less about that, as most >>>>> traffic is reading data.) >>>>> >>>>> I think this is an important feature for many high-reliability situations. >>>>> >>>>> What would be the next steps to get this feature, assuming I don't have >>>>> time >>>>> to implement it myself? Persistently annoy this mailing list that people >>>>> need it? Offer to pay for implementing it? Shut up and look for some other >>>>> solution -- which I already did, but I didn't find any that's as good as >>>>> Ceph, otherwise? >>>> >>>> >>>> I don't really have that much insight into the product roadmap, but I >>>> assume >>>> that if you spoke to some of our business folks about paying for >>>> development >>>> work you'd at least get a response. >>> >>> Yeah. It's not a feature in large enough demand right now that we can >>> see to be worth bumping up over other things, but I don't think >>> anybody's opposed to it existing. As with Mark I have no idea if >>> you're best off asking us or others to do things for money, but it >>> would certainly have to go through business channels. (If somebody >>> outside Inktank did want to implement this feature, I'd love to talk >>> to them about it on an informal but ongoing basis during development.) >>> -Greg >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to [email protected] >>> More majordomo info at http://vger.kernel.org/majordomo-info.html N�Р骒r��y����b�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay������,j��f"�h���z��wア� ⒎�j:+v���w�j�m������赙zZ+�����茛j"��!�i
