Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-20 Thread Anthony D'Atri
You should be protected against single component failures, yes, that's the 
point of journals.

It's important to ensure that on-disk volatile cache -- these days in the 
8-128MB range -- remain turned off, otherwise it usually presents an 
opportunity for data loss especially when power drops.  Disk manufactures tout 
cache size but are quiet about the fact that it's always turned off by default 
for just this reason.  Some LSI firmware and storcli versions have a bug that 
silently turns this on.

RAID HBA's also introduce an opportunity for data loss, with their on-card 
caches.  HBA cache not protected by a BBU / supercap has a similar 
vulnerability, and in some cases are just plain flaky and rife with hassles.

I'm not immediately finding a definitive statement about the number of journal 
writes required for ack.  

> So, which is correct, all replicas must be written or only min_size before 
> ack?  
> 
> But for me the takeaway is that writes are protected - even if the journal 
> drive crashes, I am covered.
> 
> - epk
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Anthony D'Atri
> Sent: Friday, May 20, 2016 1:32 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD 
> journals crashes
> 
> 
>> Ceph will not acknowledge a client write before all journals (replica 
>> size, 3 by default) have received the data, so loosing one journal SSD 
>> will NEVER result in an actual data loss.
> 
> Some say that all replicas must be written; others say that only min_size, 2 
> by default, must be written before ack.
> 
> --aad
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and confidential. 
> It is intended to be read only by the individual or entity to whom it is 
> addressed or by their designee. If the reader of this message is not the 
> intended recipient, you are on notice that any distribution of this message, 
> in any form, is strictly prohibited. If you have received this message in 
> error, please immediately notify the sender and delete or destroy any copy of 
> this message!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-20 Thread EP Komarla
So, which is correct, all replicas must be written or only min_size before ack? 
 

But for me the takeaway is that writes are protected - even if the journal 
drive crashes, I am covered.

- epk

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Anthony D'Atri
Sent: Friday, May 20, 2016 1:32 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD 
journals crashes


> Ceph will not acknowledge a client write before all journals (replica 
> size, 3 by default) have received the data, so loosing one journal SSD 
> will NEVER result in an actual data loss.

Some say that all replicas must be written; others say that only min_size, 2 by 
default, must be written before ack.

--aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-20 Thread Anthony D'Atri

> Ceph will not acknowledge a client write before all journals (replica
> size, 3 by default) have received the data, so loosing one journal SSD
> will NEVER result in an actual data loss.

Some say that all replicas must be written; others say that only min_size, 2 by 
default, must be written before ack.

--aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-19 Thread Christian Balzer

Hello,

On Fri, 20 May 2016 03:44:52 + EP Komarla wrote:

> Thanks Christian.  Point noted.  Going forward I will write text to make
> it easy to read.
> 
> Thanks for your response.  Losing a journal drive seems expensive as I
> will have to rebuild 5 OSDs in this eventuality.
>
Potentially, there are ways to avoid a full rebuild, but that depends on
some factors and is pretty advanced stuff.

It's expensive, but as Dyweni wrote an expected situation your cluster
should be able to handle. 

The chances of loosing a journal SSD unexpectedly are of course going be
very small if you choose the right type of SSD, Intel DC 37xx or at least
36xx for example.
 
Christian

> - epk
> 
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com] 
> Sent: Thursday, May 19, 2016 7:00 PM
> To: ceph-users@lists.ceph.com
> Cc: EP Komarla <ep.koma...@flextronics.com>
> Subject: Re: [ceph-users] Do you see a data loss if a SSD hosting
> several OSD journals crashes
> 
> 
> Hello,
> 
> first of all, wall of text. Don't do that. 
> Use returns and paragraphs liberally to make reading easy.
> I'm betting at least half of the people who could have answered you
> question took a look at this blob of text and ignored it.
> 
> Secondly, search engines are your friend.
> The first hit when googling for "ceph ssd journal failure" is this gem:
> http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/
> 
> Loosing a journal SSD will at most cost you the data on all associated
> OSDs and thus the recovery/backfill traffic, if you don't feel like
> doing what the link above describes.
> 
> Ceph will not acknowledge a client write before all journals (replica
> size, 3 by default) have received the data, so loosing one journal SSD
> will NEVER result in an actual data loss.
> 
> Christian
> 
> On Fri, 20 May 2016 01:38:08 + EP Komarla wrote:
> 
> >   *   We are trying to assess if we are going to see a data loss if an
> > SSD that is hosting journals for few OSDs crashes. In our 
> > configuration, each SSD is partitioned into 5 chunks and each chunk is 
> > mapped as a journal drive for one OSD. What I understand from the Ceph
> > documentation: "Consistency: Ceph OSD Daemons require a filesystem 
> > interface that guarantees atomic compound operations. Ceph OSD Daemons 
> > write a description of the operation to the journal and apply the 
> > operation to the filesystem. This enables atomic updates to an object 
> > (for example, placement group metadata). Every few seconds-between 
> > filestore max sync interval and filestore min sync interval-the Ceph 
> > OSD Daemon stops writes and synchronizes the journal with the 
> > filesystem, allowing Ceph OSD Daemons to trim operations from the 
> > journal and reuse the space. On failure, Ceph OSD Daemons replay the 
> > journal starting after the last synchronization operation." So, my 
> > question is what happens if an SSD fails - am I going to lose all the 
> > data that has not been written/synchronized to OSD?  In my case, am I 
> > going to lose data for all the 5 OSDs which can be bad?  This is of 
> > concern to us. What are the options to prevent any data loss at all?  
> > Is it better to have the journals on the same hard drive, i.e., to 
> > have one journal per OSD and host it on the same hard drive?  Of 
> > course, performance will not be as good as having an SSD for OSD 
> > journal. In this case, I am thinking I will not lose data as there are 
> > secondary OSDs where data is replicated (we are using triple 
> > replication).  Any thoughts?  What other solutions people have adopted 
> > for data reliability and consistency to address the case I am
> > mentioning?
> > 
> > 
> > 
> > Legal Disclaimer:
> > The information contained in this message may be privileged and 
> > confidential. It is intended to be read only by the individual or 
> > entity to whom it is addressed or by their designee. If the reader of 
> > this message is not the intended recipient, you are on notice that any 
> > distribution of this message, in any form, is strictly prohibited. If 
> > you have received this message in error, please immediately notify the 
> > sender and delete or destroy any copy of this message!
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-19 Thread Dyweni - Ceph-Users

Hi,

Yes and no, for the actual data loss.  This depends on your crush map.

If you're using the original map (which came with the installation), 
then your smallest failure domain will be the host.  If you have replica 
size and 3 hosts and 5 OSDs per host (15 OSDs total), then loosing the 
journal SSD in one host will only result in the data on that specific 
host being lost, and having to be re-created from the other two hosts.  
This is an anticipated failure.


If you changed the crush map to make the smallest failure domain to be 
the OSD, and ceph places all copies of a piece of data on the OSDs 
belonging to the SAME journal, then yes, you could end up with some 
pieces of data completely lost when that journal dies.


If I were in your shoes and I didn't want to set the the smallest 
failure domain to be the host, then I would create a new level 'ssd' and 
make that my failure domain.  This way, if I had 10 OSDs and 2 SSD 
Journals per host, my crush map would look like this:  5 hosts -> 2 
Journals/host -> 5 OSDs/Journal.  This way, if I lost a journal, I would 
be losing only one copy of my data.  Ceph would not place more than one 
copy of data per Journal (even though there are 5 OSDs behind that 
Journal).


I know this is a bit advanced, but I hope this clarifies things for you.

Dyweni




On 2016-05-19 20:59, Christian Balzer wrote:

Hello,

first of all, wall of text. Don't do that.
Use returns and paragraphs liberally to make reading easy.
I'm betting at least half of the people who could have answered you
question took a look at this blob of text and ignored it.

Secondly, search engines are your friend.
The first hit when googling for "ceph ssd journal failure" is this gem:
http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/

Loosing a journal SSD will at most cost you the data on all associated
OSDs and thus the recovery/backfill traffic, if you don't feel like 
doing

what the link above describes.

Ceph will not acknowledge a client write before all journals (replica
size, 3 by default) have received the data, so loosing one journal SSD
will NEVER result in an actual data loss.

Christian

On Fri, 20 May 2016 01:38:08 + EP Komarla wrote:


  *   We are trying to assess if we are going to see a data loss if an
SSD that is hosting journals for few OSDs crashes. In our 
configuration,

each SSD is partitioned into 5 chunks and each chunk is mapped as a
journal drive for one OSD. What I understand from the Ceph
documentation: "Consistency: Ceph OSD Daemons require a filesystem
interface that guarantees atomic compound operations. Ceph OSD Daemons
write a description of the operation to the journal and apply the
operation to the filesystem. This enables atomic updates to an object
(for example, placement group metadata). Every few seconds-between
filestore max sync interval and filestore min sync interval-the Ceph 
OSD

Daemon stops writes and synchronizes the journal with the filesystem,
allowing Ceph OSD Daemons to trim operations from the journal and 
reuse

the space. On failure, Ceph OSD Daemons replay the journal starting
after the last synchronization operation." So, my question is what
happens if an SSD fails - am I going to lose all the data that has not
been written/synchronized to OSD?  In my case, am I going to lose data
for all the 5 OSDs which can be bad?  This is of concern to us. What 
are

the options to prevent any data loss at all?  Is it better to have the
journals on the same hard drive, i.e., to have one journal per OSD and
host it on the same hard drive?  Of course, performance will not be as
good as having an SSD for OSD journal. In this case, I am thinking I
will not lose data as there are secondary OSDs where data is 
replicated
(we are using triple replication).  Any thoughts?  What other 
solutions
people have adopted for data reliability and consistency to address 
the

case I am mentioning?



Legal Disclaimer:
The information contained in this message may be privileged and
confidential. It is intended to be read only by the individual or 
entity

to whom it is addressed or by their designee. If the reader of this
message is not the intended recipient, you are on notice that any
distribution of this message, in any form, is strictly prohibited. If
you have received this message in error, please immediately notify the
sender and delete or destroy any copy of this message!



--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-19 Thread Christian Balzer

Hello,

first of all, wall of text. Don't do that. 
Use returns and paragraphs liberally to make reading easy.
I'm betting at least half of the people who could have answered you
question took a look at this blob of text and ignored it.

Secondly, search engines are your friend.
The first hit when googling for "ceph ssd journal failure" is this gem:
http://ceph.com/planet/ceph-recover-osds-after-ssd-journal-failure/

Loosing a journal SSD will at most cost you the data on all associated
OSDs and thus the recovery/backfill traffic, if you don't feel like doing
what the link above describes.

Ceph will not acknowledge a client write before all journals (replica
size, 3 by default) have received the data, so loosing one journal SSD
will NEVER result in an actual data loss.

Christian

On Fri, 20 May 2016 01:38:08 + EP Komarla wrote:

>   *   We are trying to assess if we are going to see a data loss if an
> SSD that is hosting journals for few OSDs crashes. In our configuration,
> each SSD is partitioned into 5 chunks and each chunk is mapped as a
> journal drive for one OSD. What I understand from the Ceph
> documentation: "Consistency: Ceph OSD Daemons require a filesystem
> interface that guarantees atomic compound operations. Ceph OSD Daemons
> write a description of the operation to the journal and apply the
> operation to the filesystem. This enables atomic updates to an object
> (for example, placement group metadata). Every few seconds-between
> filestore max sync interval and filestore min sync interval-the Ceph OSD
> Daemon stops writes and synchronizes the journal with the filesystem,
> allowing Ceph OSD Daemons to trim operations from the journal and reuse
> the space. On failure, Ceph OSD Daemons replay the journal starting
> after the last synchronization operation." So, my question is what
> happens if an SSD fails - am I going to lose all the data that has not
> been written/synchronized to OSD?  In my case, am I going to lose data
> for all the 5 OSDs which can be bad?  This is of concern to us. What are
> the options to prevent any data loss at all?  Is it better to have the
> journals on the same hard drive, i.e., to have one journal per OSD and
> host it on the same hard drive?  Of course, performance will not be as
> good as having an SSD for OSD journal. In this case, I am thinking I
> will not lose data as there are secondary OSDs where data is replicated
> (we are using triple replication).  Any thoughts?  What other solutions
> people have adopted for data reliability and consistency to address the
> case I am mentioning?
> 
> 
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and
> confidential. It is intended to be read only by the individual or entity
> to whom it is addressed or by their designee. If the reader of this
> message is not the intended recipient, you are on notice that any
> distribution of this message, in any form, is strictly prohibited. If
> you have received this message in error, please immediately notify the
> sender and delete or destroy any copy of this message!


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com