Hi,

Here is an updated description of the "Erasure encoding as a storage backend" 
proposed implementation that will be discussed during the ceph summit ( 
http://wiki.ceph.com/01Planning/Developer_Summit#Schedule ). The "strip" and 
"stripe" terms are illustrated at 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend#Proposed_model
 . 

I am well aware of the shortcomings of this proposal and it would be great to 
get feedback before the ceph summit to address the most prominent issues.

Cheers

http://pad.ceph.com/p/Erasure_encoding_as_a_storage_backend

        * PG and ReplicatedPG are reworked so that PG can be used as a base 
class for ErasureEncodedPG
                * Tests are written for ReplicatedPG to cover 100% of the LOC 
and most of the expected functionalities.
                * Code is reworked in PG and ReplicatedPG, moving from 
ReplicatedPG to PG code that is not unique to replication and from PG to 
ReplicatedPG code that is not generic enough to be useful for the 
ErasureEncodedPG base class.
        * To isolates ceph from the actual library being used ( zfec, fecpp, 
... ), a wrapper around the erasure encoding library is implemented. Each block 
is encoded into k data blocks and m parity blocks
                * encode(void* data, k, m) => void* data[k], void* parity[m]
                * decode(void* data[k], void* parity[m]) => void* data
                * repair(void* data[k], void* parity[m], 
indices_of_damaged_blocks[]) => void* data
        * The ErasureEncodePG configuration is set to encode each object into k 
data objects and m parity objects. 
                * It use the parity ('INDEP') crush mode so that placement is 
intelligent. The indep  placement avoids moving around a shard between ranks, 
because a mapping  of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if 
osd.1 fails  and the shards on 2,3,4 won't need to be copied around.
                * The ErasureEncodedPG uses k + m OSDs, numbered Do .. Dk-1 and 
C0 ... Cm-1
                * Each object is a strip
                * Each stripe has a fixed size of B bytes
        * ErasureEncodedPG implementation
                * Write offset, length
                        * read the stripes containing offset, length
                        * for each stripe, decode(void* data[k], void* 
parity[m]) => void* data and append to a bufferlist
                        * modify the bufferlist with the write request
                        * encode(void* data, k, m) => void* data[k], void* 
parity[m]
                        * write data[0] to Do, data[1] to D1 ... data[k-1] to 
Dk-1 and parity[0] to C0 ... parity[m-1] to Cm-1
                * Read offset, length
                        * read the stripes containing offset
                        * for each strip, decode(void* data[k], void* 
parity[m]) => void* data and append to a bufferlist
                * Object attributes
                        * duplicate the object attributes on each OSD
                * Scrubbing
                        * for each object, read each stripe and write back if a 
repair was necessary
                * Repair
                        * when an OSD is decomissioned, when another OSD 
replaces it, for each object contained in a ErasureEncodedPG using this OSD, 
read the object, repair each strips and write back the strip that resides on 
the new OSD


-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to