On Thu, 28 May 2015 10:32:18 +0200 Jan Schermer wrote: > Can you check the capacitor reading on the S3700 with smartctl ?
I suppose you mean this? --- 175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 648 (2 2862) --- Never mind that these are brand new. >This > drive has non-volatile cache which *should* get flushed when power is > lost, depending on what hardware does on reboot it might get flushed > even when rebooting. > That would probably trigger an increase in the "unsafe shutdown count" SMART value. I will have to test that from a known starting point, since the current values are likely from earlier tests and actual shutdowns. I'd be surprised if a reboot would drop power to the drives, but it is a possibility of course. However I'm VERY unconvinced that this could result in data loss, with the SSDs in perfect CAPS health. >I just got this drive for testing yesterday and > it’s a beast, but some things were peculiar - for example my fio > benchmark slowed down (35K IOPS -> 5K IOPS) after several GB (random - > 5-40) written, and then it would creep back up over time even under > load. Disabling write cache helps, no idea why. > I haven't seen that behavior with DC S3700s, but with 5xx ones and some Samsung, yes. Christian > Z. > > > > On 28 May 2015, at 09:22, Christian Balzer <[email protected]> wrote: > > > > > > Hello Greg, > > > > On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote: > > > >> The description of the logging abruptly ending and the journal being > >> bad really sounds like part of the disk is going back in time. I'm not > >> sure if XFS internally is set up in such a way that something like > >> losing part of its journal would allow that? > >> > > I'm special. ^o^ > > No XFS, EXT4. As stated in the original thread, below. > > And the (OSD) journal is a raw partition on a DC S3700. > > > > And since there was at least a 30 seconds pause between the completion > > of the "/etc/init.d/ceph stop" and issuing of the shutdown command, the > > logging abruptly ending seems to be unlikely related to the shutdown at > > all. > > > >> If any of the OSD developers have the time it's conceivable a copy of > >> the OSD journal would be enlightening (if e.g. the header offsets are > >> wrong but there are a bunch of valid journal entries), but this is two > >> reports of this issue from you and none very similar from anybody > >> else. I'm still betting on something in the software or hardware stack > >> misbehaving. (There aren't that many people running Debian; there are > >> lots of people running Ubuntu and we find bad XFS kernels there not > >> infrequently; I think you're hitting something like that.) > >> > > There should be no file system involved with the raw partition SSD > > journal, n'est-ce pas? > > > > The hardware is vastly different, the previous case was on an AMD > > system with onboard SATA (SP5100), this one is a SM storage goat with > > LSI 3008. > > > > The only thing they have in common is the Ceph version 0.80.7 (via the > > Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16 > > (though there were minor updates on that between those incidents, > > backported fixes) > > > > A copy of the journal would consist of the entire 10GB partition, > > since we don't know where in loop it was at the time, right? > > > > Christian > >> > >> On Sun, May 24, 2015 at 7:26 PM, Christian Balzer <[email protected]> > >> wrote: > >>> > >>> Hello again (marvel at my elephantine memory and thread necromancy) > >>> > >>> Firstly, this happened again, details below. > >>> Secondly, as I changed things to sysv-init AND did a > >>> "/etc/init.d/ceph stop" which dutifully listed all OSDs as being > >>> killed/stopped BEFORE rebooting the node. > >>> > >>> This is completely new node with significantly different HW than the > >>> example below. > >>> But the same SW versions as before (Debian Jessie, Ceph 0.80.7). > >>> And just like below/before the logs for that OSD have nothing in them > >>> indicating it did shut down properly (no "journal flush done") and > >>> when coming back on reboot we get the dreaded: > >>> --- > >>> 2015-05-25 10:32:55.439492 7f568aa157c0 1 journal > >>> _open /var/lib/ceph/osd/ceph-30/journal fd 23: 10000269312 bytes, > >>> block size 4096 bytes, directio = 1, aio = 1 2015-05-25 > >>> 10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding > >>> journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1 > >>> filestore(/var/lib/ceph/osd/ceph-30) mount failed to open > >>> journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument > >>> 2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable > >>> to mount object store --- > >>> > >>> I see nothing in the changelogs for 0.80.8 and .9 that seems related > >>> to this, never mind that from the looks of it the repository at Ceph > >>> has only Wheezy (bpo70) packages and Debian Jessie is still stuck at > >>> 0.80.7 (Sid just went to .9 last week) > >>> > >>> I'm preserving the state of things as they are for a few days, so if > >>> any developer would like a peek or more details, speak up now. > >>> > >>> I'd open an issue, but I don't have a reliable way to reproduce this > >>> and even less desire to do so on this production cluster. ^_- > >>> > >>> Christian > >>> > >>> On Sat, 6 Dec 2014 12:48:25 +0900 Christian Balzer wrote: > >>> > >>>> On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote: > >>>> > >>>>> On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer <[email protected]> > >>>>> wrote: > >>>>>> > >>>>>> Hello, > >>>>>> > >>>>>> This morning I decided to reboot a storage node (Debian Jessie, > >>>>>> thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals) > >>>>>> after applying some changes. > >>>>>> > >>>>>> It came back up one OSD short, the last log lines before the > >>>>>> reboot are: --- > >>>>>> 2014-12-05 09:35:27.700330 7f87e789c700 2 -- > >>>>>> 10.0.8.21:6823/29520 >> 10.0.8.22:0/5161 pipe(0x7f881b772580 > >>>>>> sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0) > >>>>>> Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4 > >>>>>> pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289 > >>>>>> n=8 ec=5 les/c 289/289 288/288/288) [8,4,16] r=1 lpr=288 > >>>>>> pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active] > >>>>>> cancel_copy_ops --- > >>>>>> > >>>>>> Quite obviously it didn't complete its shutdown, so > >>>>>> unsurprisingly we get: --- > >>>>>> 2014-12-05 09:37:40.278128 7f218a7037c0 1 journal > >>>>>> _open /var/lib/ceph/osd/ceph-4/journal fd 24: 10000269312 bytes, > >>>>>> block size 4096 bytes, directio = 1, aio = 1 2014-12-05 > >>>>>> 09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding > >>>>>> journal header 2014-12-05 09:37:40.278479 7f218a7037c0 -1 > >>>>>> filestore(/var/lib/ceph/osd/ceph-4) mount failed to open > >>>>>> journal /var/lib/ceph/osd/ceph-4/journal: (22) Invalid argument > >>>>>> 2014-12-05 09:37:40.776203 7f218a7037c0 -1 osd.4 0 OSD:init: > >>>>>> unable to mount object store 2014-12-05 09:37:40.776223 > >>>>>> 7f218a7037c0 -1 ESC[0;31m ** ERROR: osd init failed: (22) Invalid > >>>>>> argument ESC[0m --- > >>>>>> > >>>>>> Thankfully this isn't production yet and I was eventually able to > >>>>>> recover the OSD by re-creating the journal ("ceph-osd -i 4 > >>>>>> --mkjournal"), but it leaves me with a rather bad taste in my > >>>>>> mouth. > >>>>>> > >>>>>> So the pertinent questions would be: > >>>>>> > >>>>>> 1. What caused this? > >>>>>> My bet is on the evil systemd just pulling the plug before the > >>>>>> poor OSD had finished its shutdown job. > >>>>>> > >>>>>> 2. How to prevent it from happening again? > >>>>>> Is there something the Ceph developers can do with regards to init > >>>>>> scripts? Or is this something to be brought up with the Debian > >>>>>> maintainer? Debian is transiting from sysv-init to systemd (booo!) > >>>>>> with Jessie, but the OSDs still have a sysvinit magic file in > >>>>>> their top directory. Could this have an affect on things? > >>>>>> > >>>>>> 3. Is it really that easy to trash your OSDs? > >>>>>> In the case a storage node crashes, am I to expect most if not all > >>>>>> OSDs or at least their journals to require manual loving? > >>>>> > >>>>> So this "can't happen". > >>>> > >>>> Good thing you quoted that, as it clearly did. ^o^ > >>>> > >>>> Now the question of how exactly remains to be answered. > >>>> > >>>>> Being force killed definitely can't kill the > >>>>> OSD's disk state; that's the whole point of the journaling. > >>>> > >>>> The other OSDs got to the point where they logged "journal flush > >>>> done", this one didn't. Coincidence? I think not. > >>>> > >>>> Totally agree about the point of journaling being to prevent this > >>>> kind of situation of course. > >>>> > >>>>> The error > >>>>> message indicates that the header written on disk is nonsense to > >>>>> the OSD, which means that the local filesystem or disk lost > >>>>> something somehow (assuming you haven't done something silly like > >>>>> downgrading the software version it's running) and doesn't know it > >>>>> (if there had been a read error the output would be different). > >>>> > >>>> The journal is on an SSD, as stated. > >>>> And before you ask it's on an Intel DC S3700. > >>>> > >>>> This was created on 0.80.7 just a day before, so no version games. > >>>> > >>>>> I'd double-check > >>>>> your disk settings etc just to be sure, and check for known issues > >>>>> with xfs on Jessie. > >>>>> > >>>> I'm using ext4, but that shouldn't be an issue here to begin with, > >>>> as the journal is a raw SSD partition. > >>>> > >>>> Christian > >>> > >>> > >>> -- > >>> Christian Balzer Network/Systems Engineer > >>> [email protected] Global OnLine Japan/Fusion Communications > >>> http://www.gol.com/ > >> > > > > > > -- > > Christian Balzer Network/Systems Engineer > > [email protected] Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Christian Balzer Network/Systems Engineer [email protected] Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
