Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Willem Jan Withagen Tue, 10 Jan 2017 16:01:59 -0800

On 10-1-2017 20:35, Lionel Bouton wrote:
> Hi,

I usually don't top post, but this time it is just to agree whole
hartedly with what you wrote. And you have again more arguements as to why.


Using SSD that don't work right is a certain recipe for losing data.

--WjW

> Le 10/01/2017 à 19:32, Brian Andrus a écrit :
>> [...]
>>
>>
>> I think the main point I'm trying to address is - as long as the
>> backing OSD isn't egregiously handling large amounts of writes and it
>> has a good journal in front of it (that properly handles O_DSYNC [not
>> D_SYNC as Sebastien's article states]), it is unlikely inconsistencies
>> will occur upon a crash and subsequent restart.
> 
> I don't see how you can guess if it is "unlikely". If you need SSDs you
> are probably handling relatively large amounts of accesses (so large
> amounts of writes aren't unlikely) or you would have used cheap 7200rpm
> or even slower drives.
> 
> Remember that in the default configuration, if you have any 3 OSDs
> failing at the same time, you have chances of losing data. For <30 OSDs
> and size=3 this is highly probable as there are only a few thousands
> combinations of 3 OSDs possible (and you usually have typically a
> thousand or 2 of pgs picking OSDs in a more or less random pattern).
> 
> With SSDs not handling write barriers properly I wouldn't bet on
> recovering the filesystems of all OSDs properly given a cluster-wide
> power loss shutting down all the SSDs at the same time... In fact as the
> hardware will lie about the stored data, the filesystem might not even
> detect the crash properly and might apply its own journal on outdated
> data leading to unexpected results.
> So losing data is a possibility and testing for it is almost impossible
> (you'll have to reproduce all the different access patterns your Ceph
> cluster could experience at the time of a power loss and trigger the
> power losses in each case).
> 
>>
>> Therefore - while not ideal to rely on journals to maintain consistency,
> 
> Ceph journals aren't designed for maintaining the filestore consistency.
> They *might* restrict the access patterns to the filesystems in such a
> way that running fsck on them after a "let's throw away committed data"
> crash might have better chances of restoring enough data but if it's the
> case it's only an happy coincidence (and you will have to run these
> fscks *manually* as the filesystem can't detect inconsistencies by itself).
> 
>> that is what they are there for.
> 
> No. They are here for Ceph internal consistency, not the filesystem
> backing the filestore consistency. Ceph relies both on journals and
> filesystems able to maintain internal consistency and supporting syncfs
> to maintain consistency, if the journal or the filesystem fails the OSD
> is damaged. If 3 OSDs are damaged at the same time on a size=3 pool you
> enter "probable data loss" territory.
> 
>> There is a situation where "consumer-grade" SSDs could be used as
>> OSDs. While not ideal, it can and has been done before, and may be
>> preferable to tossing out $500k of SSDs (Seen it firsthand!)
> 
> For these I'd like to know :
> - which SSD models were used ?
> - how long did the SSDs survive (some consumer SSDs not only lie to the
> system about write completions but they usually don't handle large
> amounts of write nearly as well as DC models) ?
> - how many cluster-wide power losses did the cluster survive ?
> - what were the access patterns on the cluster during the power losses ?
> 
> If for a model not guaranteed for sync writes there hasn't been dozens
> of power losses on clusters under large loads without any problem
> detected in the week following (thing deep-scrub), using them is playing
> Russian roulette with your data.
> 
> AFAIK there have only been reports of data losses and/or heavy
> maintenance later when people tried to use consumer SSDs (admittedly
> mainly for journals). I've yet to spot long-running robust clusters
> built with consumer SSDs.
> 
> Lionel

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Reply via email to