Re: [ceph-users] Failed Disk simulation question

2019-05-24 Thread solarflow99
I think a deep scrub would eventually catch this right?


On Wed, May 22, 2019 at 2:56 AM Eugen Block  wrote:

> Hi Alex,
>
> > The cluster has been idle at the moment being new and all.  I
> > noticed some disk related errors in dmesg but that was about it.
> > It looked to me for the next 20 - 30 minutes the failure has not
> > been detected.  All osds were up and in and health was OK. OSD logs
> > had no smoking gun either.
> > After 30 minutes, I restarted the OSD container and it failed to
> > start as expected.
>
> if the cluster doesn't have to read or write to specific OSDs (or
> sectors on that OSD) the failure won't be detected immediately. We had
> an issue last year where one of the SSDs (used for rocksdb and wal)
> had a failure, but that was never reported. We discovered that when we
> tried to migrate the lvm to a new device and got read errors.
>
> > Later on, I performed the same operation during the fio bench mark
> > and OSD failed immediately.
>
> This confirms our experience, if there's data to read/write on that
> disk the failure will be detected.
> Please note that this was in a Luminous cluster, I don't know if and
> how Nautilus has improved in sensing disk failures.
>
> Regards,
> Eugen
>
>
> Zitat von Alex Litvak :
>
> > Hello cephers,
> >
> > I know that there was similar question posted 5 years ago.  However
> > the answer was inconclusive for me.
> > I installed a new Nautilus 14.2.1 cluster and started pre-production
> > testing.  I followed RedHat document and simulated a soft disk
> > failure by
> >
> > #  echo 1 > /sys/block/sdc/device/delete
> >
> > The cluster has been idle at the moment being new and all.  I
> > noticed some disk related errors in dmesg but that was about it.
> > It looked to me for the next 20 - 30 minutes the failure has not
> > been detected.  All osds were up and in and health was OK. OSD logs
> > had no smoking gun either.
> > After 30 minutes, I restarted the OSD container and it failed to
> > start as expected.
> >
> > Later on, I performed the same operation during the fio bench mark
> > and OSD failed immediately.
> >
> > My question is:  Should the disk problem have been detected quick
> > enough even on the idle cluster? I thought Nautilus has the means to
> > sense failure before intensive IO hit the disk.
> > Am I wrong to expect that?
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed Disk simulation question

2019-05-22 Thread Eugen Block

Hi Alex,

The cluster has been idle at the moment being new and all.  I  
noticed some disk related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not  
been detected.  All osds were up and in and health was OK. OSD logs  
had no smoking gun either.
After 30 minutes, I restarted the OSD container and it failed to  
start as expected.


if the cluster doesn't have to read or write to specific OSDs (or  
sectors on that OSD) the failure won't be detected immediately. We had  
an issue last year where one of the SSDs (used for rocksdb and wal)  
had a failure, but that was never reported. We discovered that when we  
tried to migrate the lvm to a new device and got read errors.


Later on, I performed the same operation during the fio bench mark  
and OSD failed immediately.


This confirms our experience, if there's data to read/write on that  
disk the failure will be detected.
Please note that this was in a Luminous cluster, I don't know if and  
how Nautilus has improved in sensing disk failures.


Regards,
Eugen


Zitat von Alex Litvak :


Hello cephers,

I know that there was similar question posted 5 years ago.  However  
the answer was inconclusive for me.
I installed a new Nautilus 14.2.1 cluster and started pre-production  
testing.  I followed RedHat document and simulated a soft disk  
failure by


#  echo 1 > /sys/block/sdc/device/delete

The cluster has been idle at the moment being new and all.  I  
noticed some disk related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not  
been detected.  All osds were up and in and health was OK. OSD logs  
had no smoking gun either.
After 30 minutes, I restarted the OSD container and it failed to  
start as expected.


Later on, I performed the same operation during the fio bench mark  
and OSD failed immediately.


My question is:  Should the disk problem have been detected quick  
enough even on the idle cluster? I thought Nautilus has the means to  
sense failure before intensive IO hit the disk.

Am I wrong to expect that?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Failed Disk simulation question

2019-05-21 Thread Alex Litvak

Hello cephers,

I know that there was similar question posted 5 years ago.  However the answer 
was inconclusive for me.
I installed a new Nautilus 14.2.1 cluster and started pre-production testing.  
I followed RedHat document and simulated a soft disk failure by

#  echo 1 > /sys/block/sdc/device/delete

The cluster has been idle at the moment being new and all.  I noticed some disk 
related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not been detected. 
 All osds were up and in and health was OK. OSD logs had no smoking gun either.
After 30 minutes, I restarted the OSD container and it failed to start as 
expected.

Later on, I performed the same operation during the fio bench mark and OSD 
failed immediately.

My question is:  Should the disk problem have been detected quick enough even 
on the idle cluster? I thought Nautilus has the means to sense failure before 
intensive IO hit the disk.
Am I wrong to expect that?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com