Hi Andreas,

On 03/07/2013 18:55, Andreas-Joachim Peters wrote:> Dear Loic et. al., 
> 
> I have/had some questions about the idea's of Erasure Encoding plans and OSD 
> scalability. 
> Please forgive me that I didnt' study too much any source code or details of 
> the current CEPH implementation (yet).
> 
> Some of my questions I found now already answered here,
> 
> ( 
> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
>  )
> 
> but they also created some more ;-)
> 
> *ERASUE ENCODING*
> 
> 1.) I understand that you will cover only OSD outages with the implementation 
> and will delegate block corruption to be discovered by the file system 
> implementation (like BTRFS would do) Is that correct? 

Ceph also does scrubbing to detect block ( I assume you mean chunk ) 
corruption. The idea is to adapt the logic which is currently assuming replicas 
so that it detects corruption ( for instance more than K missing chunks if M+K 
is used ).

> 2.) Blocks would be assembled always on the OSDs (?)

Yes. 

> 3.) I understood that the (3,2) RS sketched in the Blog is the easiest to 
> implement since it can be done with simple parity(XOR) operations but do you 
> intend to have a generic (M,K) implementation?

Yes. The idea is to use the jerasure library which provides reed-solomon and 
can be configured in various ways.
 
> 4.) Would you split a 4M object into M x(4/M) objects? Would this not (even 
> more) degrade single disk performance to random IO performance when many 
> clients retrieve objects at random disk positions? Is 4M just a default or a 
> hard coded parameter of CEPHFS/S3 ?

It is just a default. I hope the updated (look for "Partials" ) 
https://github.com/dachary/ceph/blob/5efcac8fa6e08119f0deaaf1ae9919080e90cf0a/doc/dev/osd_internals/erasure-code.rst
 answers the rest of the question .

> 5.) Local Parity like in Xorbas makes sense for large M, but would a large M 
> not hit scalability limits given by a single OSD in terms of object 
> bookkeeping/scrubbing/synchronization, Network packet limitations (atleast in 
> 1GBit networks) etc ... 1 TB = 250k objects => M=10 => 2.5 Mio objects ( a 
> 100 TB disk server would have 250 Mio object fragments ?!?!) 

We are looking at M+K < 2^8 at the moment which significantly reduces the 
problem you mention as well as the CPU consumption issues.

> 6.) Does a CEPH object know something like a parent object so it could 
> understand if it is still a 'connected' object (like part of a block 
> collection implementing a block, a file or container?)

At the level where erasure coding is implemented ( librados ) there is no 
nothing of relationships between objects.

> *OSD SCALABILITY*

Please take my answers there with a grain of salt because there are many people 
with much more knowledge than I have :-)

> 1.) Are there some deployment numbers about the largest number of OSDs per 
> placement group and the number of objects you can handle well in a placement 
> group?

The acceptable range seems to be ( number of OSDs ) * 100 up to ( number of 
OSDs ) * 1000

> 2.) What is the largest number of OSDs people have ever tried out? Many 
> presentations say 10-10k nodes, but probably it should be more OSDs?

The largest deployment I'm aware of is Dream{Object,Compute} but I don't know 
the actual numbers.

> 3.) In our CC we operate disk server with up to 100 TB (25 disks) , next year 
> 200 TB (50 disks) and in the future even bigger. 
> If I remember right the recommendation is to have 2GB of memory per OSD. 
> Can the memory footprint be lowered or is it a 'feature' of the OSD 
> architecture?
> Is there in-memory information limiting scalability?

The OSD memory usage varies from from a few hundred mega bytes when running 
normal operations to about 2GB when recovering, which can be a problem if you 
have a large number of OSDs running on the same hardware. You can control this 
by grouping the disks together. For instance if your machine has 50 disks you 
could group them in 10 RAID0 including 5 physical disks each and run 10 OSD 
instead of 50. Of course it means that you'll lose 5 disks at once if one fails 
but when grouping 50 disks on a single hardware you already made a decision 
that leans in this direction.

> 4.) Today we run disk-only storage with 20k disks and 24 to 33 disk per node. 
> There is a weekly activity of repair & replacement and reboots.

I assume that's of 1,000 machines, right ? How many disk / machines do you need 
to replace on a weekly basis ? 

> A typical scenario is that after a reboot filesystem contents was not synced 
> and information is lost. Does CEPH OSD sync every block or if not use a 
> quorum on block contents when reading or it would just return the block as is 
> and only scrubbing would mark a block as corrupted?

I don't think Ceph can ever return a corrupted object as if it was not. That 
would either require a manual intervention from the operator to tamper with the 
file without notifying Ceph ( which would be the equivalent of shotting himself 
in the foot ;-) or a bug in XFS ( or the underlying file system on which 
objects are stored ) that similarly corrupts the file. And all this would have 
to happen before deep scrubbing discovers the problem.  

> 5.) When rebalancing is needed is there some time slice or scheduling 
> mechanism which regulates the block relocation with respect to the 'normal' 
> IO activity on the source and target OSD? Is there an overload protection in 
> particular on the block target OSD?

There is a reservation mechanism to avoid creating too many communication paths 
during recovery ( see 
http://ceph.com/docs/master/dev/osd_internals/backfill_reservation/ for 
instance ) and throttling to regulate the bandwidth usage ( not 100% sure how 
that works though ). In addition it is recommended when operating a large 
cluster to dedicate an interface to internal communications ( check 
http://ceph.com/docs/master/rados/configuration/network-config-ref/ for more 
information ).

Cheers

> 
> Thanks.
> 
> Andreas.
> 
> 
> 
> 
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to