Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Lionel Bouton Tue, 10 Jan 2017 11:35:46 -0800

Hi,

Le 10/01/2017 à 19:32, Brian Andrus a écrit :
> [...]
>
>
> I think the main point I'm trying to address is - as long as the
> backing OSD isn't egregiously handling large amounts of writes and it
> has a good journal in front of it (that properly handles O_DSYNC [not
> D_SYNC as Sebastien's article states]), it is unlikely inconsistencies
> will occur upon a crash and subsequent restart.


I don't see how you can guess if it is "unlikely". If you need SSDs you
are probably handling relatively large amounts of accesses (so large
amounts of writes aren't unlikely) or you would have used cheap 7200rpm
or even slower drives.

Remember that in the default configuration, if you have any 3 OSDs
failing at the same time, you have chances of losing data. For <30 OSDs
and size=3 this is highly probable as there are only a few thousands
combinations of 3 OSDs possible (and you usually have typically a
thousand or 2 of pgs picking OSDs in a more or less random pattern).

With SSDs not handling write barriers properly I wouldn't bet on
recovering the filesystems of all OSDs properly given a cluster-wide
power loss shutting down all the SSDs at the same time... In fact as the
hardware will lie about the stored data, the filesystem might not even
detect the crash properly and might apply its own journal on outdated
data leading to unexpected results.
So losing data is a possibility and testing for it is almost impossible
(you'll have to reproduce all the different access patterns your Ceph
cluster could experience at the time of a power loss and trigger the
power losses in each case).

>
> Therefore - while not ideal to rely on journals to maintain consistency,

Ceph journals aren't designed for maintaining the filestore consistency.
They *might* restrict the access patterns to the filesystems in such a
way that running fsck on them after a "let's throw away committed data"
crash might have better chances of restoring enough data but if it's the
case it's only an happy coincidence (and you will have to run these
fscks *manually* as the filesystem can't detect inconsistencies by itself).

> that is what they are there for.

No. They are here for Ceph internal consistency, not the filesystem
backing the filestore consistency. Ceph relies both on journals and
filesystems able to maintain internal consistency and supporting syncfs
to maintain consistency, if the journal or the filesystem fails the OSD
is damaged. If 3 OSDs are damaged at the same time on a size=3 pool you
enter "probable data loss" territory.

> There is a situation where "consumer-grade" SSDs could be used as
> OSDs. While not ideal, it can and has been done before, and may be
> preferable to tossing out $500k of SSDs (Seen it firsthand!)

For these I'd like to know :
- which SSD models were used ?
- how long did the SSDs survive (some consumer SSDs not only lie to the
system about write completions but they usually don't handle large
amounts of write nearly as well as DC models) ?
- how many cluster-wide power losses did the cluster survive ?
- what were the access patterns on the cluster during the power losses ?

If for a model not guaranteed for sync writes there hasn't been dozens
of power losses on clusters under large loads without any problem
detected in the week following (thing deep-scrub), using them is playing
Russian roulette with your data.

AFAIK there have only been reports of data losses and/or heavy
maintenance later when people tried to use consumer SSDs (admittedly
mainly for journals). I've yet to spot long-running robust clusters
built with consumer SSDs.

Lionel

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Reply via email to