For the record,

Sam suggested today that the chunks of a stripe ( an object if we limit 
ourselves to full writes ) are written without deleting the chunks from a 
previous version of the object. i.e. for instance

object A1 contains "ABCDEFGHI" => version 1 of the object is written as chunks 
"ABC" "DEF" "GHI" and "XYZ" parity on OSD1, OSD2, OSD3, OSD4 respectively.
object A1 is updated to "ABCDEF123" => version 2 of the object is written as 
chunks "ABC" "DEF" "123" and "KLM" parity on OSD1, OSD2, OSD3, OSD4 
respectively.

At some point OSD3 contains both "GHI" ( chunk 3 object A1 version 1 ) and 
"123" ( chunk 3 object A1 version 2 ).

When the PG receives an update of last_complete ( which should happen when the 
PG becomes active ) it knows that all objects with a version lower than 
last_complete can be discarded. It can then trim the objects stored on the OSD 
that have a version older than last_complete. With ReplicatedPG this does not 
need to be done because the new version of the object overrides the previous 
one. It could be done together with pg_log trimming but it would waste more 
disk space because the default log size it by default 3000 meaning a chunk 
would only be deleted from disk after 3000 pg_log_entry were added to pg_log. 

The object name does not currently contain the version number and this would 
need to be changed to avoid name clashes.

Cheers

On 29/06/2013 18:56, Loic Dachary wrote:
> Hi Sage,
> 
> The level of understanding of ReplicatedPG/PG/OSD required to sketch the path 
> for implementing the erasure coding is beyond me at the moment. A few hours 
> of browsing demonstrated that a number of important areas are still unknown 
> to me. A meaningfull example is probably the logic associated with 
> 
> struct AccessMode {
> 
> https://github.com/ceph/ceph/blob/962b64a83037ff79855c5261325de0cd1541f582/src/osd/ReplicatedPG.h#L114
> 
> I suspect there are a number of similarities with the erasure code that would 
> be relevant to ensure that a stripe is fully written to disk ( i.e. in 
> relation with the "ondisk" acknowledgment probably ) before removing the 
> previous version of the same stripe from all OSDs supporting it.
> 
> The time spent during this exploration was not wasted, I learnt a few things 
> that will be useful :-) But I think it would be more useful for me to work on 
> a more modest task to move in the direction of the erasure coding 
> implementation.
> 
> Cheers
> 
> On 06/25/2013 07:41 PM, Loic Dachary wrote:
>> Hi Sage,
>>
>> Paraphrasing what you suggested today : 
>>
>> The logic for writing a stripe ( i.e. all the chunks created by the erasure 
>> encoding function for a given object or part of a given object if it exceeds 
>> the maximum size of a stripe ) for a single object is going to be done in a 
>> way that is not the same as what we currently have for replicated objects. 
>> The object is consistent when all chunks ( or at least K if K+M ) are 
>> committed to disk. It may make sense to start writing all the chunks in 
>> parallel and when they are acknowledged, send a pg_log event that says : now 
>> switch to this new version of the object. To avoid ending up with chunks 
>> that are partially for one version of the object and other chunks partially 
>> for another version of the object and we can't repair any of them. 
>>
>> I will try to sketch the path for implementing the erasure coding ( 
>> including the above ) by adding to 
>> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
>>
>> Cheers
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to