Thanks Gerg.
Technically speaking, it's still workable that someone may even want to make 
such policy "per node".


在 2013-4-16,0:42,"Gregory Farnum" <[email protected]> 写道:

> Yeah, this is very much like DreamHost is doing with their
> DreamCompute installation (you can find some talks about it online, I
> believe, though I'm not sure how much detail they include there versus
> in the Q&As).
> 
> On Fri, Apr 12, 2013 at 7:32 PM, Chen, Xiaoxi <[email protected]> wrote:
>> We are also discussing this internally, and come out with an idea to walk 
>> around it(Only for RBD case,havent think about Obj store),but not yet 
>> tested.  If Mark and Greg can provide some feedback,that would be great.
>> 
>> We are trying to write a script to generate some pools,for rack A,there is a 
>> pool A,which defined the crush ruleset to choose Osd in rackA as the 
>> primary.so if we have 10 racks,we will have 10 pools and 10 rules.
>> 
>> When the VM migrated to other rack,or the volume be detached and attached to 
>> another VM hosted in other rack,a data migration is needed.we are thinking 
>> about how to smooth such migration
> 
> This is one of the use cases that layering is designed to handle (in
> addition to standard cloning and snapshots). Just create a clone that
> lives in the new pool, and either let it copy-up to the new position
> lazily or run the command at a time when you know your network is less
> busy.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
>> 
>> 发自我的 iPhone
>> 
>> 在 2013-4-13,0:20,"Gregory Farnum" <[email protected]> 写道:
>> 
>>> I was in the middle of writing a response to this when Mark's email
>>> came in, so I'll just add a few things:
>>> 
>>> On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <[email protected]> 
>>> wrote:
>>>> On 04/11/2013 10:59 PM, Matthias Urlichs wrote:
>>>>> 
>>>>> As I understand it, in Ceph one can cluster storage nodes, but otherwise
>>>>> every node is essentially identical, so if three storage nodes have a
>>>>> file,
>>>>> ceph randomly uses one of them.
>>>> 
>>>> 
>>>> Ceph clusters have the concept of pools, where each pool has a certain
>>>> number of placement groups.  Placement groups are just collections of
>>>> mappings to OSDs.  Each PG has a primary OSD and a number of secondary 
>>>> ones,
>>>> based on the replication level you set when you make the pool. When an
>>>> object gets written to the cluster, CRUSH will determine which PG the data
>>>> should be sent to.  The data will first hit the primary OSD and then
>>>> replicated out to the other OSDs in the same placement group.
>>>> 
>>>> Currently reads always come from the primary OSD in the placement group
>>>> rather than a secondary even if the secondary is closer to the client. I'm
>>>> guessing there are probably some tricks that could be played here to best
>>>> determine which machines should service which clients, but it's not exactly
>>>> an easy problem.  In many cases spreading reads out over all of the OSDs in
>>>> the cluster is better than trying to optimize reads to only hit local OSDs.
>>>> Ideally you probably want to prefer local OSDs first, but not exclusively.
>>> 
>>> In addition to just determining the locality (which we've started on
>>> via external interfaces), this has a number of consistency challenges
>>> associated with it. The infrastructure we have to allow reading from
>>> non-primaries tends to involve clients having different consistency
>>> expectations, and it's not fully explored yet or set up so that
>>> clients can choose to read from a specific non-primary ― the options
>>> currently are "local if available and we can tell", "random", and
>>> "primary".
>>> 
>>> 
>>>>> This is not efficient use of network resources in a distributed data
>>>>> center.
>>>>> Or even in a multi-rack situation.
>>>>> 
>>>>> I want to prefer accessing nodes which are "local".
>>>>> The client in rack A should prefer to read from the storage nodes that are
>>>>> also in rack A.
>>>>> Ditto for rack B.
>>>>> Ditto for s/rack/data center/.
>>> 
>>> I do want to ask if you're sure this is as useful as you think it is.
>>> There are use cases where it would be, but since writes have to
>>> traverse these links (at a multiple of the actual write count) as well
>>> you should be very certain. :)
>>> 
>>>>> As far as I understand, the Ceph clients can't do that.
>>>>> (Nor can Ceph nodes among each other, but I care less about that, as most
>>>>> traffic is reading data.)
>>>>> 
>>>>> I think this is an important feature for many high-reliability situations.
>>>>> 
>>>>> What would be the next steps to get this feature, assuming I don't have
>>>>> time
>>>>> to implement it myself? Persistently annoy this mailing list that people
>>>>> need it? Offer to pay for implementing it? Shut up and look for some other
>>>>> solution -- which I already did, but I didn't find any that's as good as
>>>>> Ceph, otherwise?
>>>> 
>>>> 
>>>> I don't really have that much insight into the product roadmap, but I 
>>>> assume
>>>> that if you spoke to some of our business folks about paying for 
>>>> development
>>>> work you'd at least get a response.
>>> 
>>> Yeah. It's not a feature in large enough demand right now that we can
>>> see to be worth bumping up over other things, but I don't think
>>> anybody's opposed to it existing. As with Mark I have no idea if
>>> you're best off asking us or others to do things for money, but it
>>> would certainly have to go through business channels. (If somebody
>>> outside Inktank did want to implement this feature, I'd love to talk
>>> to them about it on an informal but ongoing basis during development.)
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to [email protected]
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
N�Р骒r��y����b�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay������,j��f"�h���z��wア�
⒎�j:+v���w�j�m������赙zZ+�����茛j"��!�i

Reply via email to