Re: PG Backend Proposal

Loic Dachary Fri, 02 Aug 2013 08:12:05 -0700

Hi Sam,

> - coll_t needs to include a chunk_id_t.
https://github.com/athanatos/ceph/blob/2234bdf7fc30738363160d598ae8b4d6f75e1dd1/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions


That would be for sanity check ? Since the rank of the chunk ( chunk_id_t ) 
matches the position in the acting set and a history of osdmaps is kept, would 
this be used when loading the pg from disk to make sure it matches the expected 
chunk_id_t ?

Cheers

On 02/08/2013 09:39, Loic Dachary wrote:
> Hi Sam,
> 
> I think I understand and paraphrasing you to make sure I do. We may save 
> bandwidth because chunks are not moved as much if their position is not tied 
> to the position of the OSD containing them in the acting set. But this is 
> mitigated by the use of the indep crush mode. And it may require to handle 
> tricky edge cases. In addition, you think that being able to know which OSD 
> contains which chunk by using only the OSDMap and the (v)hobject_t is going 
> to simplify the design.
> 
> For the record:
> 
> Back in April Sage suggested that
> 
> "- those PGs use the parity ('INDEP') crush mode so that placement is 
> intelligent"
> 
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14579
> 
> "The indep placement avoids moving around a shard between ranks, because a 
> mapping of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 
> fails and the shards on 2,3,4 won't need to be copied around."
> 
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/14582
> 
> and I assume that's what you refer to when you write "CRUSH has a mode which 
> will cause replacement to behave well for erasure codes:
> 
>  initial: [0,1,2]
>  0 fails: [3,1,2]
>  2 fails: [3,1,4]
>  0 recovers: [0,1,4]
> 
> I understand this is implemented here:
> 
> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/crush/mapper.c#L523
> 
> and will determine to order of the acting set 
> 
> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/osd/OSDMap.cc#L998
> 
> when called by the monitor when creating or updating a PG
> 
> https://github.com/ceph/ceph/blob/e7d47827f0333c96ad43d257607fb92ed4176550/src/mon/PGMonitor.cc#L814
> 
> Cheers
> 
> On 02/08/2013 03:34, Samuel Just wrote:
>> I think there are some tricky edge cases with the above approach.  You
>> might end up with two pg replicas in the same acting set which happen
>> for reasons of history to have the same chunk for one or more objects.
>>  That would have to be detected and repaired even though the object
>> would be missing from neither replica (and might not even be in the pg
>> log).  The erasure_code_rank would have to be somehow maintained
>> through recovery (do we remember the original holder of a particular
>> chunk in case it ever comes back?).
>>
>> The chunk rank doesn't *need* to match the acting set position, but
>> there are some good reasons to arrange for that to be the case:
>> 1) Otherwise, we need something else to assign the chunk ranks
>> 2) This way, a new primary can determine which osds hold which
>> replicas of which chunk rank by looking at past osd maps.
>>
>> It seems to me that given an OSDMap and an object, we should know
>> immediately where all chunks should be stored since a future primary
>> may need to do that without access to the objects themselves.
>>
>> Importantly, while it may be possible for an acting set transition
>> like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
>> mode which will cause replacement to behave well for erasure codes:
>>
>> initial: [0,1,2]
>> 0 fails: [3,1,2]
>> 2 fails: [3,1,4]
>> 0 recovers: [0,1,4]
>>
>> We do, however, need to decouple primariness from position in the
>> acting set so that backfill can work well.
>> -Sam
>>
>> On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary <[email protected]> wrote:
>>> Hi Sam,
>>>
>>> I'm under the impression that
>>> https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
>>> assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
>>>
>>> The chunk rank does not need to match the OSD position in the acting set. 
>>> As long as each object chunk is stored with its rank in an attribute, 
>>> changing the order of the acting set does not require to move the chunks 
>>> around.
>>>
>>> With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on 
>>> [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute 
>>> set to their rank.
>>>
>>> If the acting set changes to [2,1,0] the read would reorder the chunk based 
>>> on their 'erasure_code_rank' attribute instead of the rank of the OSD they 
>>> originate from in the current acting set. And then be able to decode them 
>>> with the erasure code library, which requires that the chunks are provided 
>>> in a specific order.
>>>
>>> When doing a full write, the chunks are written in the same order as the 
>>> acting set. This implies that the order of the chunks of the previous 
>>> version of the object may be different but I don't see a problem with that.
>>>
>>> When doing an append, the primary must first retrieve the order in which 
>>> the objects are stored by retrieving their 'erasure_code_rank' attribute, 
>>> because the order of the acting set is not the same as the order of the 
>>> chunks. It then maps the chunks to the OSDs matching their rank and pushes 
>>> them to the OSDs.
>>>
>>> The only downside is that it may make things more complicated to implement 
>>> optimizations based on the fact that, sometimes, chunks can just be 
>>> concatenated to recover the content of the object and don't need to be 
>>> decoded ( when using systematic codes and the M data chunks are available ).
>>>
>>> Cheers
>>>
>>> On 01/08/2013 19:14, Loic Dachary wrote:
>>>>
>>>>
>>>> On 01/08/2013 18:42, Loic Dachary wrote:
>>>>> Hi Sam,
>>>>>
>>>>> When the acting set changes order two chunks for the same object may 
>>>>> co-exist in the same placement group. The key should therefore also 
>>>>> contain the chunk number.
>>>>>
>>>>> That's probably the most sensible comment I have so far. This document is 
>>>>> immensely useful (even in its current state) because it shows me your 
>>>>> perspective on the implementation.
>>>>>
>>>>> I'm puzzled by:
>>>>
>>>> I get it ( thanks to yanzheng ). Object is deleted, then created again ... 
>>>> spurious non version chunks would get in the way.
>>>>
>>>> :-)
>>>>
>>>>>
>>>>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires 
>>>>> that we retain the deleted object until all replicas have persisted the 
>>>>> deletion event. ErasureCoded backend will therefore need to store objects 
>>>>> with the version at which they were created included in the key provided 
>>>>> to the filestore. Old versions of an object can be pruned when all 
>>>>> replicas have committed up to the log event deleting the object.
>>>>>
>>>>> because I don't understand why the version would be necessary. I thought 
>>>>> that deleting an erasure coded object could be even easier than erasing a 
>>>>> replicated object because it cannot be resurrected if enough chunks are 
>>>>> lots, therefore you don't need to wait for ack from all OSDs in the up 
>>>>> set. I'm obviously missing something.
>>>>>
>>>>> I failed to understand how important the pg logs were to maintaining the 
>>>>> consistency of the PG. For some reason I thought about them only in terms 
>>>>> of being a light weight version of the operation logs. Adding a payload 
>>>>> to the pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for 
>>>>> me and I would have never thought or dared think the logs could be 
>>>>> extended in such a way. Given the recent problems with logs writes having 
>>>>> a high impact on performances ( I'm referring to what forced you to 
>>>>> introduce code to reduce the amount of logs being written to only those 
>>>>> that have been changed instead of the complete logs ) I thought about the 
>>>>> pg logs as something immutable.
>>>>>
>>>>> I'm still trying to figure out how PGBackend::perform_write / read / 
>>>>> try_rollback would fit in the current backfilling / write / read / 
>>>>> scrubbing ... code path.
>>>>>
>>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>>>>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>>>>
>>>>> Cheers
>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do 
>>> nothing.
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to [email protected]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

signature.asc
Description: OpenPGP digital signature

Re: PG Backend Proposal

Reply via email to