>> The first, simplest implementation is likely to be fit to use with RGW and
>> probably too slow to use with RBD. Do you think we should try to optimize
>> for RBD right now ?
> 
> Yes, RGW is the obvious best candidate for the first implementation. We don't 
> need to implement for RBD and CephFS now, but we should consider how the 
> design would handle other applications in the future. The alternative is to 
> optimize purely for RGW and provide an API/plug-in capability suggested by 
> Harvey Skinner to make way for optimized solutions for other applications.
> 

I agree that the design should make room to plug in optimizations in the 
future. I've tried to figure out where the API/plug-in should fit. 

a) pluggable placement group
b) pluggable erasure code library

The pluggable placement group capability is what I'm working on right now. It 
requires some re-architecture of the current code and the API is starting to 
emerge. The implementation should eventually be in a separate shared library ( 
say ErasureCodePG ) loaded at run time and selected with a configuration option 
when creating a pool. I suspect that experimenting with new optimization 
strategies is going to be done by hacking ErasureCodePG and create new pools 
using it. 

Let say we find a way to optimize for RBD and implement that in the 
RBDErasureCodePG placement group. And we configure the RBD pool to use this 
placement group backend while keeping the ErasureCodePG placement group backend 
for RGW. Later on it may make sense to merge the two or make sure they share 
similar code for maintainance purposes. But that probably leaves all the room 
we need to experiment until a general solution is found.

The pluggable erasure code library API will be something like what is described 
in http://pad.ceph.com/p/Erasure_encoding_as_a_storage_backend

    context(k, m, reed-solomon|...) => context* c 
    encode(context* c, void* data) => void* chunks[k+m]
    decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => 
void* data // erased chunks are not used
    repair(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => 
void* chunks[k+m] // erased chunks are rebuilt

It won't be enough for hierarchical codes but they don't seem to be considered 
attractive at the moment. It should be enough for LRC ( 
http://anrg.usc.edu/~maheswaran/Xorbas.pdf ) since it only requires an 
additional argument to the context ( the number of chunks required to do a 
local repair ).

The need for another API ( in addition to pluggable placement groups and 
pluggable erasure code library ) may appear in the future. I can't see it right 
now. I try to refrain from over-engineering while making sure we don't need to 
re-architecture because something obvious was overlooked. This discussion is 
helping a lot :-) 

What do you think ?

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to