Re: [ceph-users] SSD journal deployment experiences

Dan van der Ster Sat, 06 Sep 2014 07:50:54 -0700

September 6 2014 4:01 PM, "Christian Balzer" <[email protected]> wrote: 
> On Sat, 6 Sep 2014 13:07:27 +0000 Dan van der Ster wrote:
> 
>> Hi Christian,
>> 
>> Let's keep debating until a dev corrects us ;)
> 
> For the time being, I give the recent:
> 
> https://www.mail-archive.com/[email protected]/msg12203.html
> 
> And not so recent:
> http://www.spinics.net/lists/ceph-users/msg04152.html
> http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10021
> 
> And I'm not going to use BTRFS for mainly RBD backed VM images
> (fragmentation city), never mind the other stability issues that crop up
> here ever so often.



Thanks for the links... So until I learn otherwise, I better assume the OSD is 
lost when the journal fails. Even though I haven't understood exactly why :(
I'm going to UTSL to understand the consistency better. An op state diagram 
would help, but I didn't find one yet.

BTW, do you happen to know, _if_ we re-use an OSD after the journal has failed, 
are any object inconsistencies going to be found by a scrub/deep-scrub?

>> 
>> We have 4 servers in a 3U rack, then each of those servers is connected
>> to one of these enclosures with a single SAS cable.
>> 
>>>> With the current config, when I dd to all drives in parallel I can
>>>> write at 24*74MB/s = 1776MB/s.
>>> 
>>> That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0
>>> lanes, so as far as that bus goes, it can do 4GB/s.
>>> And given your storage pod I assume it is connected with 2 mini-SAS
>>> cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA
>>> bandwidth.
>> 
>> From above, we are only using 4 lanes -- so around 2GB/s is expected.
> 
> Alright, that explains that then. Any reason for not using both ports?
> 

Probably to minimize costs, and since the single 10Gig-E is a bottleneck anyway.
The whole thing is suboptimal anyway, since this hardware was not purchased for 
Ceph to begin with.
Hence retrofitting SSDs, etc...

>>> Impressive, even given your huge cluster with 1128 OSDs.
>>> However that's not really answering my question, how much data is on an
>>> average OSD and thus gets backfilled in that hour?
>> 
>> That's true -- our drives have around 300TB on them. So I guess it will
>> take longer - 3x longer - when the drives are 1TB full.
> 
> On your slides, when the crazy user filled the cluster with 250 million
> objects and thus 1PB of data, I recall seeing a 7 hour backfill time?
> 

Yeah that was fun :) It was 250 million (mostly) 4k objects, so not close to 
1PB. The point was that to fill the cluster with RBD, we'd need 250 million 
(4MB) objects. So, object-count-wise this was a full cluster, but for the real 
volume it was more like 70TB IIRC (there were some other larger objects too).

In that case, the backfilling was CPU-bound, or perhaps wbthrottle-bound, I 
don't remember... It was just that there were many tiny tiny objects to 
synchronize.

> Anyway, I guess the lesson to take away from this is that size and
> parallelism does indeed help, but even in a cluster like yours recovering
> from a 2TB loss would likely be in the 10 hour range...

Bigger clusters probably backfill faster simply because there are more OSDs 
involved in the backfilling. In our cluster we initially get 30-40 backfills in 
parallel after 1 OSD fails. That's even with max backfills = 1. The backfilling 
sorta follows an 80/20 rule -- 80% of the time is spent backfilling the last 
20% of the PGs, just because some OSDs randomly get more new PGs than the 
others.

> Again, see the "Best practice K/M-parameters EC pool" thread. ^.^

Marked that one to read, again.

Cheers, dan
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD journal deployment experiences

Reply via email to