Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

Christian Balzer Thu, 28 May 2015 01:57:12 -0700

On Thu, 28 May 2015 10:32:18 +0200 Jan Schermer wrote:

> Can you check the capacitor reading on the S3700 with smartctl ?


I suppose you mean this?
---
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       
-       648 (2 2862)
---

Never mind that these are brand new.

>This
> drive has non-volatile cache which *should* get flushed when power is
> lost, depending on what hardware does on reboot it might get flushed
> even when rebooting. 
>
That would probably trigger an increase in the "unsafe shutdown count"
SMART value. 
I will have to test that from a known starting point, since the current
values are likely from earlier tests and actual shutdowns. 
I'd be surprised if a reboot would drop power to the drives, but it is a
possibility of course.

However I'm VERY unconvinced that this could result in data loss, with the
SSDs in perfect CAPS health. 

>I just got this drive for testing yesterday and
> it’s a beast, but some things were peculiar - for example my fio
> benchmark slowed down (35K IOPS -> 5K IOPS) after several GB (random -
> 5-40) written, and then it would creep back up over time even under
> load. Disabling write cache helps, no idea why.
> 
I haven't seen that behavior with DC S3700s, but with 5xx ones and
some Samsung, yes.

Christian

> Z.
> 
> 
> > On 28 May 2015, at 09:22, Christian Balzer <[email protected]> wrote:
> > 
> > 
> > Hello Greg,
> > 
> > On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote:
> > 
> >> The description of the logging abruptly ending and the journal being
> >> bad really sounds like part of the disk is going back in time. I'm not
> >> sure if XFS internally is set up in such a way that something like
> >> losing part of its journal would allow that?
> >> 
> > I'm special. ^o^
> > No XFS, EXT4. As stated in the original thread, below.
> > And the (OSD) journal is a raw partition on a DC S3700.
> > 
> > And since there was at least a 30 seconds pause between the completion
> > of the "/etc/init.d/ceph stop" and issuing of the shutdown command, the
> > logging abruptly ending seems to be unlikely related to the shutdown at
> > all.
> > 
> >> If any of the OSD developers have the time it's conceivable a copy of
> >> the OSD journal would be enlightening (if e.g. the header offsets are
> >> wrong but there are a bunch of valid journal entries), but this is two
> >> reports of this issue from you and none very similar from anybody
> >> else. I'm still betting on something in the software or hardware stack
> >> misbehaving. (There aren't that many people running Debian; there are
> >> lots of people running Ubuntu and we find bad XFS kernels there not
> >> infrequently; I think you're hitting something like that.)
> >> 
> > There should be no file system involved with the raw partition SSD
> > journal, n'est-ce pas?
> > 
> > The hardware is vastly different, the previous case was on an AMD
> > system with onboard SATA (SP5100), this one is a SM storage goat with
> > LSI 3008.
> > 
> > The only thing they have in common is the Ceph version 0.80.7 (via the
> > Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16
> > (though there were minor updates on that between those incidents,
> > backported fixes)
> > 
> > A copy of the journal would consist of the entire 10GB partition,
> > since we don't know where in loop it was at the time, right?
> > 
> > Christian
> >> 
> >> On Sun, May 24, 2015 at 7:26 PM, Christian Balzer <[email protected]>
> >> wrote:
> >>> 
> >>> Hello again (marvel at my elephantine memory and thread necromancy)
> >>> 
> >>> Firstly, this happened again, details below.
> >>> Secondly, as I changed things to sysv-init AND did a
> >>> "/etc/init.d/ceph stop" which dutifully listed all OSDs as being
> >>> killed/stopped BEFORE rebooting the node.
> >>> 
> >>> This is completely new node with significantly different HW than the
> >>> example below.
> >>> But the same SW versions as before (Debian Jessie, Ceph 0.80.7).
> >>> And just like below/before the logs for that OSD have nothing in them
> >>> indicating it did shut down properly (no "journal flush done") and
> >>> when coming back on reboot we get the dreaded:
> >>> ---
> >>> 2015-05-25 10:32:55.439492 7f568aa157c0  1 journal
> >>> _open /var/lib/ceph/osd/ceph-30/journal fd 23: 10000269312 bytes,
> >>> block size 4096 bytes, directio = 1, aio = 1 2015-05-25
> >>> 10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding
> >>> journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1
> >>> filestore(/var/lib/ceph/osd/ceph-30) mount failed to open
> >>> journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument
> >>> 2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable
> >>> to mount object store ---
> >>> 
> >>> I see nothing in the changelogs for 0.80.8 and .9 that seems related
> >>> to this, never mind that from the looks of it the repository at Ceph
> >>> has only Wheezy (bpo70) packages and Debian Jessie is still stuck at
> >>> 0.80.7 (Sid just went to .9 last week)
> >>> 
> >>> I'm preserving the state of things as they are for a few days, so if
> >>> any developer would like a peek or more details, speak up now.
> >>> 
> >>> I'd open an issue, but I don't have a reliable way to reproduce this
> >>> and even less desire to do so on this production cluster. ^_-
> >>> 
> >>> Christian
> >>> 
> >>> On Sat, 6 Dec 2014 12:48:25 +0900 Christian Balzer wrote:
> >>> 
> >>>> On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote:
> >>>> 
> >>>>> On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer <[email protected]>
> >>>>> wrote:
> >>>>>> 
> >>>>>> Hello,
> >>>>>> 
> >>>>>> This morning I decided to reboot a storage node (Debian Jessie,
> >>>>>> thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals)
> >>>>>> after applying some changes.
> >>>>>> 
> >>>>>> It came back up one OSD short, the last log lines before the
> >>>>>> reboot are: ---
> >>>>>> 2014-12-05 09:35:27.700330 7f87e789c700  2 --
> >>>>>> 10.0.8.21:6823/29520 >> 10.0.8.22:0/5161 pipe(0x7f881b772580
> >>>>>> sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0)
> >>>>>> Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4
> >>>>>> pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289
> >>>>>> n=8 ec=5 les/c 289/289 288/288/288) [8,4,16] r=1 lpr=288
> >>>>>> pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active]
> >>>>>> cancel_copy_ops ---
> >>>>>> 
> >>>>>> Quite obviously it didn't complete its shutdown, so
> >>>>>> unsurprisingly we get: ---
> >>>>>> 2014-12-05 09:37:40.278128 7f218a7037c0  1 journal
> >>>>>> _open /var/lib/ceph/osd/ceph-4/journal fd 24: 10000269312 bytes,
> >>>>>> block size 4096 bytes, directio = 1, aio = 1 2014-12-05
> >>>>>> 09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding
> >>>>>> journal header 2014-12-05 09:37:40.278479 7f218a7037c0 -1
> >>>>>> filestore(/var/lib/ceph/osd/ceph-4) mount failed to open
> >>>>>> journal /var/lib/ceph/osd/ceph-4/journal: (22) Invalid argument
> >>>>>> 2014-12-05 09:37:40.776203 7f218a7037c0 -1 osd.4 0 OSD:init:
> >>>>>> unable to mount object store 2014-12-05 09:37:40.776223
> >>>>>> 7f218a7037c0 -1 ESC[0;31m ** ERROR: osd init failed: (22) Invalid
> >>>>>> argument ESC[0m ---
> >>>>>> 
> >>>>>> Thankfully this isn't production yet and I was eventually able to
> >>>>>> recover the OSD by re-creating the journal ("ceph-osd -i 4
> >>>>>> --mkjournal"), but it leaves me with a rather bad taste in my
> >>>>>> mouth.
> >>>>>> 
> >>>>>> So the pertinent questions would be:
> >>>>>> 
> >>>>>> 1. What caused this?
> >>>>>> My bet is on the evil systemd just pulling the plug before the
> >>>>>> poor OSD had finished its shutdown job.
> >>>>>> 
> >>>>>> 2. How to prevent it from happening again?
> >>>>>> Is there something the Ceph developers can do with regards to init
> >>>>>> scripts? Or is this something to be brought up with the Debian
> >>>>>> maintainer? Debian is transiting from sysv-init to systemd (booo!)
> >>>>>> with Jessie, but the OSDs still have a sysvinit magic file in
> >>>>>> their top directory. Could this have an affect on things?
> >>>>>> 
> >>>>>> 3. Is it really that easy to trash your OSDs?
> >>>>>> In the case a storage node crashes, am I to expect most if not all
> >>>>>> OSDs or at least their journals to require manual loving?
> >>>>> 
> >>>>> So this "can't happen".
> >>>> 
> >>>> Good thing you quoted that, as it clearly did. ^o^
> >>>> 
> >>>> Now the question of how exactly remains to be answered.
> >>>> 
> >>>>> Being force killed definitely can't kill the
> >>>>> OSD's disk state; that's the whole point of the journaling.
> >>>> 
> >>>> The other OSDs got to the point where they logged "journal flush
> >>>> done", this one didn't. Coincidence? I think not.
> >>>> 
> >>>> Totally agree about the point of journaling being to prevent this
> >>>> kind of situation of course.
> >>>> 
> >>>>> The error
> >>>>> message indicates that the header written on disk is nonsense to
> >>>>> the OSD, which means that the local filesystem or disk lost
> >>>>> something somehow (assuming you haven't done something silly like
> >>>>> downgrading the software version it's running) and doesn't know it
> >>>>> (if there had been a read error the output would be different).
> >>>> 
> >>>> The journal is on an SSD, as stated.
> >>>> And before you ask it's on an Intel DC S3700.
> >>>> 
> >>>> This was created on 0.80.7 just a day before, so no version games.
> >>>> 
> >>>>> I'd double-check
> >>>>> your disk settings etc just to be sure, and check for known issues
> >>>>> with xfs on Jessie.
> >>>>> 
> >>>> I'm using ext4, but that shouldn't be an issue here to begin with,
> >>>> as the journal is a raw SSD partition.
> >>>> 
> >>>> Christian
> >>> 
> >>> 
> >>> --
> >>> Christian Balzer        Network/Systems Engineer
> >>> [email protected]           Global OnLine Japan/Fusion Communications
> >>> http://www.gol.com/
> >> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > [email protected]       Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
[email protected]           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

Reply via email to