Dear Janne,

Am 06.05.20 um 09:18 schrieb Janne Johansson:
Den ons 6 maj 2020 kl 00:58 skrev Oliver Freyermuth <[email protected] 
<mailto:[email protected]>>:

    Dear Cephalopodians,
    seeing the recent moves of major HDD vendors to sell SMR disks targeted for 
use in consumer NAS devices (including RAID systems),
    I got curious and wonder what the current status of SMR support in 
Bluestore is.
    Of course, I'd expect disk vendors to give us host-managed SMR disks for 
data center use cases (and to tell us when actually they do so...),
    but in that case, Bluestore surely needs some new intelligence for best 
performance in the shingled ages.


I've only done filestore on SMRs, and it did work for a while, in normal cases 
for us, but it broke down horribly as soon as recovery needed to be done.
I have no idea if filestore was doing the worst ever for SMRs, or if bluestore will do 
better or if patches are going to help bluestore become useful, but all in all, I can't 
say anything else to people wanting to experiment with SMRs than "if you must use 
SMRs, make sure you test the most evil of corner cases".

Thanks for the input and especially the hands-on experience! That's very helpful (and 
"expensive" to gather), so thanks for sharing!

After my "small-scale" experiences, I would indeed have expected exactly that. 
My sincere hope is that this hardware will become useable by making use of Copy-on-Write 
semantics
to align writes into larger, consecutive batches.


As you noted, one can easily get into <1M/s with SMRs by doing something else 
than long linear writes, and you don't want to be in a place where several hundred 
TBs of data is doing recovery at that speed.

To me, SMR is a con, its a trick to sell cheap crap to people who can't or 
won't test properly. Doesn't matter if its ceph recovery/backfill, btrfs 
deletes or someones NAS raid sync job that places the final straw on the camels 
back and breaks it, it's the fact that filesystems do lots more than just easy 
nice long linear writes. No matter if it is fsck, defrags or ceph PG 
splits/reshardings, there will be disk meta-operations that needs to be done 
which includes tons of random small writes, and SMR drives will punish you for 
them when you need the drive up the most. 8-(

If I had some very special system which used cheap disks to pretend to be a 
tape device and only did 10G sized reads/writes like a tape would do, then I 
could see a use case for SMR.

I agree that in many cases SMR is not the correct hardware to use, and never will be. 
Indeed, I also agree that in most cases the "trick to sell cheap crap to people who 
can't or won't test properly"
applies, even more with disk-managed SMR which in some cases gives you zero 
control and maximum frustration.

Still, my hope would be that especially for archiving purposes (think of a pure 
Ceph-RGW cluster fed with Restic, Duplicati or similar tools), we can make good 
use of the cheaper hardware
(but then, this would of course need to be host-managed SMR, and the file 
system should know about it). I currently only know of Dropbox who are actively 
doing that
(and I guess they can easily, since they deduplicate data and probably rarely 
delete), and they seem to have developed their own file system to deal with 
this essentially.

It would be cool to have this with Ceph. You might also think about having a separate 
pool for "colder" objects which is SMR-backed
(likely coupled with SSDs / NVMes for WAL / BlockDB). In short, we'd never even 
think about using it with CephFS in our HPC cluster
(unless some admin-controllable write-once-read-many use cases evolve, which we 
could think about for centrally managed high-energy physics data),
or RBD in our virtualization cluster.
We're more interested in it for our backup cluster which mostly sees data 
ingest and the chunking into larger batches is even done client-side (Restic, 
Duplicati etc.).

Of course, your point about resharding and PG splits fully applies, so this for 
sure needs careful development (and testing!) to reduce the randomness as far 
as possible
(if we want to make use of this hardware for the use cases it may fit).

Cheers and thanks for your input,
        Oliver


--
May the most significant bit of your life be positive.


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to