Re: [ceph-users] SSD journal deployment experiences

Dan Van Der Ster Thu, 04 Sep 2014 11:13:15 -0700

I've just been reading the bcache docs. It's a pity the mirrored writes aren't 
implemented yet. Do you know if you can use an md RAID1 as a cache dev? And is 
the graceful failover from wb to writethrough actually working without data 
loss?


Also, write behind sure would help the filestore, since I'm pretty sure the 
same 4k blocks are being overwritten many times (from our RBD clients).

Cheers, Dan

On Sep 4, 2014 7:44 PM, Robert LeBlanc <[email protected]> wrote:
So far it was worked really well, we can raise/lower/disable/enable the cache 
in realtime and watch how the load and traffic changes. There has been some 
positive subjective results, but definitive results are still forth coming. 
bcache on CentOS 7 was not easy, makes me wish we were running Debian or 
Ubuntu. If there are enough reasons to train our admins on Debian/Ubuntu in 
addition to learning CentOS7 for customer facing boxes, we may move that way 
for Ceph and OpenStack, but I'm not sure how Red Hat purchasing Inktank will 
shift the development from Debian/Ubuntu, so we don't want to make any big 
changes until we have a better idea of what the future looks like. I think the 
Enterprise versions of Ceph (n-1 or n-2) will be a bit too old from where we 
want to be, which I'm sure will work wonderfully on Red Hat, but how will n.1, 
n.2 or n.3 run?

Robert LeBlanc


On Thu, Sep 4, 2014 at 11:22 AM, Dan Van Der Ster 
<[email protected]<mailto:[email protected]>> wrote:

Hi Robert,

That's actually a pretty good idea, since bcache would also accelerate the 
filestore flushes and leveldb. I actually wonder if an SSD-only pool would even 
be faster than such a setup... probably not.

We're using an ancient enterprise n distro, so it will be a bit of a headache 
to get the right kernel, etc .. But my colleague is planning to use bcache to 
accelerate our hypervisors' ephemeral storage, so I guess that's a solved 
problem.

Hmm...

Thanks!

Dan

On Sep 4, 2014 6:42 PM, Robert LeBlanc 
<[email protected]<mailto:[email protected]>> wrote:
We are still pretty early on in our testing of how to best use SSDs as well. 
What we are trying right now, for some of the reasons you mentioned already, is 
to use bcache as a cache for both journal and data. We have 10 spindles in our 
boxes with 2 SSDs. We created two bcaches (one for each SSD) and put five 
spindles behind it with the journals as just files on the spindle (because it 
is hot, it should stay in SSD). This should have the advantage that if the SSD 
fails, it could automatically fail to write-through mode (although I don't 
think it will help if the SSD suddenly fails). However, it seems that if any 
part of the journal is lost, the OSD is toast and needs to be rebuilt. Bcache 
was appealing to us because one SSD could front multiple backend disks and make 
the most efficient use of the SSD, it also has write around for large 
sequential writes so that cache is not evicted for large sequential writes 
which spindles are good at. Since we have a high read cache hit from KVM and 
other layers, this is primary intended to help accelerate writes more than 
reads (we are also more write heavy in our environment).

So far it seems to help, but we are going to start more in-depth testing soon. 
One drawback is that bcache devices don't seem to like partitions, so we have 
created the OSDs manually instead if using ceph-deploy.

I too am interested with other's experience with SSD and trying to 
cache/accelerate Ceph. I think the Caching pool in the long run will be the 
best option, but it can still use some performance tweaking with small reads 
before it will be really viable for us.

Robert LeBlanc


On Thu, Sep 4, 2014 at 10:21 AM, Dan Van Der Ster 
<[email protected]<mailto:[email protected]>> wrote:
Dear Cephalopods,

In a few weeks we will receive a batch of 200GB Intel DC S3700’s to augment our 
cluster, and I’d like to hear your practical experience and discuss options how 
best to deploy these.

We’ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so they 
will become 20 OSDs + 4 SSDs per server. Until recently I’ve been planning to 
use the traditional deployment: 5 journal partitions per SSD. But as SSD-day 
approaches, I growing less comfortable with the idea of 5 OSDs going down every 
time an SSD fails, so perhaps there are better options out there.

Before getting into options, I’m curious about real reliability of these drives:

1) How often are DC S3700's failing in your deployments?
2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the 
backfilling which results from an SSD failure? Have you considered tricks like 
increasing the down out interval so backfilling doesn’t happen in this case 
(leaving time for the SSD to be replaced)?

Beyond the usually 5 partitions deployment, is anyone running a RAID1 or RAID10 
for the journals? If so, are you using the raw block devices or formatting it 
and storing the journals as files on the SSD array(s)? Recent discussions seem 
to indicate that XFS is just as fast as the block dev, since these drives are 
so fast.

Next, I wonder how people with puppet/chef/… are handling the 
creation/re-creation of the SSD devices. Are you just wiping and rebuilding all 
the dependent OSDs completely when the journal dev fails? I’m not keen on 
puppetizing the re-creation of journals for OSDs...

We also have this crazy idea of failing over to a local journal file in case an 
SSD fails. In this model, when an SSD fails we’d quickly create a new journal 
either on another SSD or on the local OSD filesystem, then restart the OSDs 
before backfilling started. Thoughts?

Lastly, I would also consider using 2 of the SSDs in a data pool (with the 
other 2 SSDs to hold 20 journals — probably in a RAID1 to avoid backfilling 10 
OSDs when an SSD fails). If the 10-1 ratio of SSDs would perform adequately, 
that’d give us quite a few SSDs to build a dedicated high-IOPS pool.

I’d also appreciate any other suggestions/experiences which might be relevant.

Thanks!
Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --


_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD journal deployment experiences

Reply via email to