Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Brian Andrus Tue, 10 Jan 2017 10:32:43 -0800

On Mon, Jan 9, 2017 at 3:33 PM, Willem Jan Withagen <[email protected]> wrote:


> On 9-1-2017 23:58, Brian Andrus wrote:
> > Sorry for spam... I meant D_SYNC.
>
> That term does not run any lights in Google...
> So I would expect it has to O_DSYNC.
> (https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
> test-if-your-ssd-is-suitable-as-a-journal-device/)
>
> Now you tell me there is a SSDs that does take correct action with
> O_SYNC but not with O_DSYNC... That makes no sense to me. It is a
> typical solution in the OS as speed trade-off versus a bit less
> consistent FS.
>
> Either a device actually writes its data persistenly (either in silicon
> cells, or keeps it in RAM with a supercapacitor), or it does not.
> Something else I can not think off. Maybe my EE background is sort of in
> the way here. And I know that is rather hard to write correct SSD
> firmware, I seen lots of firmware upgrades to actually fix serious
> corner cases.
>
> Now the second thing is how hard does a drive lie when being told that
> the request write is synchronised. And Oke is only returned when data is
> in stable storage, and can not be lost.
>
> If there is a possibility that a sync write to a drive is not
> persistent, then that is a serious breach of the sync write contract.
> There will always be situations possible that these drives will lose data.
> And if data is no longer in the journal, because the writing process
> thinks the data is on stable storage it has deleted the data from the
> journal. In this case that data is permanently lost.
>
> Now you have a second chance (even a third) with Ceph, because data is
> stored multiple times. And you can go to another OSD and try to get it
> back.
>
> --WjW
>

I'm not disagreeing per se.


I think the main point I'm trying to address is - as long as the backing
OSD isn't egregiously handling large amounts of writes and it has a good
journal in front of it (that properly handles O_DSYNC [not D_SYNC as
Sebastien's article states]), it is unlikely inconsistencies will occur
upon a crash and subsequent restart.

Therefore - while not ideal to rely on journals to maintain consistency,
that is what they are there for. There is a situation where
"consumer-grade" SSDs could be used as OSDs. While not ideal, it can and
has been done before, and may be preferable to tossing out $500k of SSDs
(Seen it firsthand!)



>
> >
> > On Mon, Jan 9, 2017 at 2:56 PM, Brian Andrus <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Hi Willem, the SSDs are probably fine for backing OSDs, it's the
> >     O_DSYNC writes they tend to lie about.
> >
> >     They may have a failure rate higher than enterprise-grade SSDs, but
> >     are otherwise suitable for use as OSDs if journals are placed
> elsewhere.
> >
> >     On Mon, Jan 9, 2017 at 2:39 PM, Willem Jan Withagen <[email protected]
> >     <mailto:[email protected]>> wrote:
> >
> >         On 9-1-2017 18:46, Oliver Humpage wrote:
> >         >
> >         >> Why would you still be using journals when running fully OSDs
> on
> >         >> SSDs?
> >         >
> >         > In our case, we use cheaper large SSDs for the data (Samsung
> 850 Pro
> >         > 2TB), whose performance is excellent in the cluster, but as
> has been
> >         > pointed out in this thread can lose data if power is suddenly
> >         > removed.
> >         >
> >         > We therefore put journals onto SM863 SSDs (1 journal SSD per 3
> OSD
> >         > SSDs), which are enterprise quality and have power outage
> protection.
> >         > This seems to balance speed, capacity, reliability and budget
> fairly
> >         > well.
> >
> >         This would make me feel very uncomfortable.....
> >
> >         So you have a reliable journal, so upto there thing do work:
> >           Once in the journal you data is safe.
> >
> >         But then you async transfer the data to disk. And that is an SSD
> >         that
> >         lies to you? It will tell you that the data is written. But if
> >         you pull
> >         the power, then it turns out that the data is not really stored.
> >
> >         And then the only way to get the data consistent again, is to
> >         (deep)scrub.
> >
> >         Not a very appealing lookout??
> >
> >         --WjW
> >
> >
> >         _______________________________________________
> >         ceph-users mailing list
> >         [email protected] <mailto:[email protected]>
> >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >
> >
> >
> >
> >     --
> >     Brian Andrus
> >     Cloud Systems Engineer
> >     DreamHost, LLC
> >
> >
> >
> >
> > --
> > Brian Andrus
> > Cloud Systems Engineer
> > DreamHost, LLC
>
>


-- 
Brian Andrus
Cloud Systems Engineer
DreamHost, LLC

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Reply via email to