Re: [ceph-users] rocksdb: Corruption: missing start of fragmented record

2017-11-20 Thread Gregory Farnum
On Mon, Nov 20, 2017 at 9:27 AM, Michael Schmid  wrote:
> Gregory Farnum wrote:
>> Your hardware and configuration is very relevant.
>> [...]
>> I'd look at whether you have a writeback cache somewhere that isn't
>> reflecting ordering requirements, or if your disk passes a crash consistency
>> tester. (No, I don't know one off-hand. But many disks lie horribly even
>> about stuff like flushes.)
> It may certainly have had something to do with how I managed to end up with
> the broken rocksdb WAL log. Maybe this is not the best possible behavior
> possible when one simulates a crash or drive disconnect. Perhaps if I can
> get this OSD back in action, and the issue occurs entirely predictably on
> another test, I'll eventually start to see a pattern where / how it happens
> & maybe even find out what hardware / configuration changes might be needed
> to prevent the WAL from corrupting. Perhaps.
>
> --
>
> However, my actual & immediate Ceph + ceph-users relevant problem with this
> is basically only that I cannot seem to figure out how one could deal with
> such an already broken rocksdb WAL log.
>
> 1. Ceph's tooling and rocksdb don't *appear* to be capable to deal be able
> to deal with this corrupted WAL file once it has been corrupted, certainly
> not with the commands that I tried.
> I initially had hoped for some tool to be able to do something - drop the
> log, revert to an earlier backup of a consistent db - any option like that
> that I might have missed. Judging by this ML so far, I'm going to guess
> there is no such thing? So the subsequent problem is:

The error is pretty clear: "Corruption: missing start of fragmented record(2)"
What that says to me is that rocksdb has a journal entry saying that
record *does* exist, but it's missing the opening block or something.
ie, during an atomic write it (1) wrote down a lookaside block, (2)
flushed that block to disk, (3), journaled that it had written the
block. But now on restart, it's finding out that (2) apparently didn't
happen.

>
> 2. I do not know how I can get manual, filewise access to the rocksdb WAL
> logs. This may be immensely simple, but I simply don't know how.
> I don't have any indication that either 1. or 2. is failing due to hardware
> or configuration specifics (...beyond having this broken WAL log) so far.

Rocksdb may offer repair tools for this (I have no idea), but the
fundamental issue is that as far as the program can tell, the
underlying hardware lied, the disk state is corrupted, and it has no
idea what data it can trust or not at this point. Same with Ceph; the
OSD has no desire to believe anything a corrupted disk tells it since
that can break all of our invariants.
BlueStore is a custom block device-managing system; we have a way to
mount and poke at it via FUSE but that assumes the data on disk makes
any sense. In this case, it doesn't (RocksDB stores the disk layout
metadata.) Somebody more familiar with bluestore development may know
if there's a way to mount only the "BlueFS" portion that RocksDB
writes its own data to; if there is it's just a bunch of .ldb files or
whatever, but those are again a custom data format that you'll need
rocksdb expertise to do anything with...

Toss this disk and let Ceph do its recovery thing. Look hard at what
your hardware configuration is doing to make sure it doesn't happen
again. *shrug*
-Greg

>
>> As you note, the WAL should be able to handle being incompletely-written
> Yes, I'd also have thought so? But it apparently just isn't able to deal
> with this log file corruption. Maybe it is not an extremely specific bug.
> Maybe a lot of possible WAL corruptions might throw a comparable error and
> prevent replay.
>
>> and both Ceph and RocksDB are designed to handle failures mid-write.
> As far as I can tell, not in this WAL log case, no. It would certainly be
> really interesting to see at this point if just moving or deleting that WAL
> log allows everything to continue and the OSD to go online, and if then
> doing a scrub fixes the entirety of this issue. Maybe everything is
> essentially fine apart from JUST the WAL log replay and maybe one or another
> bit of a page on the OSD.
>
>> That RocksDB *isn't* doing that here implies either 1) there's a fatal bug
>> in rocksdb
> Not so sure. Ultimately rocksdb does seem to throw a fairly indicative
> error: "db/001005.log: dropping 3225 bytes; Corruption: missing start of
> fragmented record(2)".
> Maybe they intend that users use a repair tool at
> https://github.com/facebook/rocksdb/wiki/RocksDB-Repairer . Or maybe it's a
> case for manual interaction with the file.
>
> But my point 2. -namely that I don't even understand how to get filewise
> access to rocksdb's files- has so far prevented me from trying either.
>
>
>
> Thanks for your input!
>
> -Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb: Corruption: missing start of fragmented record

2017-11-19 Thread Gregory Farnum
On Thu, Nov 2, 2017 at 8:55 AM Michael  wrote:

> Christian Balzer wrote:
> > Your exact system configuration (HW, drives, controller, settings, etc)
> > would be interesting as I can think of plenty scenarios on how to corrupt
> > things that normally shouldn't be affected by such actions
> Oh, the hardware in question is consumer grade and not new. Some old i7
> machine. But my current guess is that the specific hardware is
> semi-unrelated.
>
> I think there probably is just a WAL log or entry in that that wasn't
> finished writing to disk or corrupted. But WAL aren't something that
> should be fully assumed to be complete and correct in the case of some
> failure, right? They are WAL.
>
> If this can't be fixed automatically with some command, I would simply
> like to have a look at & tinker trivially the with these DB files if
> possible (which I still haven't figured a way how to do).
>

Your hardware and configuration is very relevant. As you note, the WAL
should be able to handle being incompletely-written, and both Ceph and
RocksDB are designed to handle failures mid-write. That RocksDB *isn't*
doing that here implies either
1) there's a fatal bug in rocksdb, or
2) your hardware configuration is not honoring the storage consistency APIs
correctly.

Given past experience, (2) is far more likely than (1) is. The RocksDB
community may be able to give you more immediate guidance to what exactly
went wrong, but I'd look at whether you have a writeback cache somewhere
that isn't reflecting ordering requirements, or if your disk passes a crash
consistency tester. (No, I don't know one off-hand. But many disks lie
horribly even about stuff like flushes.)
-Greg


>
>
> Christian Balzer wrote:
> > Now that bit is quite disconcerting, though you're one release behind the
> > curve and from what I read .2 has plenty more bug fixes coming.
> Fair point. I just tried with 12.2.1 (on pre-release Ubuntu bionic now).
>
> Doesn't change anything - fsck doesn't fix rocksdb, the bluestore won't
> mount, the OSD won't activate and the error is the same.
>
> Is there any fix in .2 that might address this, or do you just mean that
> in general there will be bug fixes?
>
>
> Thanks for your response!
>
> - Michael
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb: Corruption: missing start of fragmented record

2017-11-13 Thread Konstantin Shalygin

Which isn't released yet, yes. I could try building the development
repository if you think that has a chance of resolving the issue?

For tests - yes...
This  ML tells 
that 12.2.2 should be based on commit 
1071fdcf73faa387d0df18489ab7b0359a0c0afb.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb: Corruption: missing start of fragmented record

2017-11-13 Thread Michael

Konstantin Shalygin wrote:
> I think Christian talks about version 12.2.2, not 12.2.*

Which isn't released yet, yes. I could try building the development 
repository if you think that has a chance of resolving the issue?


Although I'd still like to know how I could theoretically get my hands 
at these rocksdb files manually, if anyone knows how to do that? I still 
have no idea how.


I also reported this as a bug last week, in case anyone has information 
or the same issue:

http://tracker.ceph.com/issues/22044
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb: Corruption: missing start of fragmented record

2017-11-12 Thread Konstantin Shalygin

Fair point. I just tried with 12.2.1 (on pre-release Ubuntu bionic now).

Doesn't change anything - fsck doesn't fix rocksdb, the bluestore won't
mount, the OSD won't activate and the error is the same.

Is there any fix in .2 that might address this, or do you just mean that
in general there will be bug fixes?


I think Christian talks about version 12.2.2, not 12.2.*

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb: Corruption: missing start of fragmented record

2017-11-01 Thread Michael

Christian Balzer wrote:

Your exact system configuration (HW, drives, controller, settings, etc)
would be interesting as I can think of plenty scenarios on how to corrupt
things that normally shouldn't be affected by such actions
Oh, the hardware in question is consumer grade and not new. Some old i7 
machine. But my current guess is that the specific hardware is 
semi-unrelated.


I think there probably is just a WAL log or entry in that that wasn't 
finished writing to disk or corrupted. But WAL aren't something that 
should be fully assumed to be complete and correct in the case of some 
failure, right? They are WAL.


If this can't be fixed automatically with some command, I would simply 
like to have a look at & tinker trivially the with these DB files if 
possible (which I still haven't figured a way how to do).



Christian Balzer wrote:

Now that bit is quite disconcerting, though you're one release behind the
curve and from what I read .2 has plenty more bug fixes coming.

Fair point. I just tried with 12.2.1 (on pre-release Ubuntu bionic now).

Doesn't change anything - fsck doesn't fix rocksdb, the bluestore won't 
mount, the OSD won't activate and the error is the same.


Is there any fix in .2 that might address this, or do you just mean that 
in general there will be bug fixes?



Thanks for your response!

- Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb: Corruption: missing start of fragmented record

2017-11-01 Thread Christian Balzer

Hello,

On Wed, 1 Nov 2017 09:30:06 +0100 Michael wrote:

> Hello everyone,
> 
> I've conducted some crash tests (unplugging drives, the machine, 

Your exact system configuration (HW, drives, controller, settings, etc)
would be interesting as I can think of plenty scenarios on how to corrupt
things that normally shouldn't be affected by such actions.

> terminating and restarting ceph systemd services) with Ceph 12.2.0 on

Now that bit is quite disconcerting, though you're one release behind the
curve and from what I read .2 has plenty more bug fixes coming.
 
Christian

> Ubuntu and quite easily managed to corrupt what appears to be rocksdb's 
> log replay on a bluestore OSD:
> 
> # ceph-bluestore-tool fsck  --path /var/lib/ceph/osd/ceph-2/
> [...]
> 4 rocksdb: 
> [/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/version_set.cc:2859] 
> Recovered from manifest file:db/MANIFEST-000975 
> succeeded,manifest_file_number is 975, next_file_number is 1008, 
> last_sequence is 51965907, log_number is 0,prev_log_number is 
> 0,max_column_family is 0
> 4 rocksdb: 
> [/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/version_set.cc:2867] 
> Column family [default] (ID 0), log number is 1005
> 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1509298585082794, "job": 1, 
> "event": "recovery_started", "log_files": [1003, 1005]}
> 4 rocksdb: 
> [/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:482] 
> Recovering log #1003 mode 0
> 4 rocksdb: 
> [/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:482] 
> Recovering log #1005 mode 0
> 3 rocksdb: 
> [/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:424] 
> db/001005.log: dropping 3225 bytes; Corruption: missing start of 
> fragmented record(2)
> 4 rocksdb: 
> [/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl.cc:217] Shutdown: 
> canceling all background work
> 4 rocksdb: 
> [/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl.cc:343] Shutdown 
> complete
> -1 rocksdb: Corruption: missing start of fragmented record(2)
> -1 bluestore(/var/lib/ceph/osd/ceph-2/) _open_db erroring opening db:
> 1 bluefs umount
> 1 bdev(0x557f5b6a4240 /var/lib/ceph/osd/ceph-2//block) close
> 
> If I understand this right, rocksdb is  just trying to replay WAL type 
> logs, of which presumably "001005.log" is corrupted. It then throws an 
> error that stops everything.
> 
> I did try to mount the bluestore, as I was assuming that would probably 
> where I'd find the rocksdb's files somewhere, but that also doesn't seem 
> possible:
> 
> #ceph-objectstore-tool --op fsck --data-path /var/lib/ceph/osd/ceph-2/ 
> --mountpoint /mnt/bluestore-repair/
> fsck failed: (5) Input/output error
> # ceph-objectstore-tool --op fuse --data-path /var/lib/ceph/osd/ceph-2 
> --mountpoint /mnt/bluestore-repair/
> Mount failed with '(5) Input/output error'
> # ceph-objectstore-tool --op fuse --force --skip-journal-replay 
> --data-path /var/lib/ceph/osd/ceph-2 --mountpoint /mnt/bluestore-repair/
> Mount failed with '(5) Input/output error'
> 
> Adding --debug shows the ultimate culprit is just the above rocksdb 
> error again.
> 
> Q: Is there some way in which I can tell rockdb to truncate or delete / 
> skip the respective log entries? Or can I get access to rocksdb('s 
> files) in some other way to just manipulate it or delete corrupted WAL 
> files manually?
> 
> -Michael
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rocksdb: Corruption: missing start of fragmented record

2017-11-01 Thread Michael

Hello everyone,

I've conducted some crash tests (unplugging drives, the machine, 
terminating and restarting ceph systemd services) with Ceph 12.2.0 on 
Ubuntu and quite easily managed to corrupt what appears to be rocksdb's 
log replay on a bluestore OSD:


# ceph-bluestore-tool fsck  --path /var/lib/ceph/osd/ceph-2/
[...]
4 rocksdb: 
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/version_set.cc:2859] 
Recovered from manifest file:db/MANIFEST-000975 
succeeded,manifest_file_number is 975, next_file_number is 1008, 
last_sequence is 51965907, log_number is 0,prev_log_number is 
0,max_column_family is 0
4 rocksdb: 
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/version_set.cc:2867] 
Column family [default] (ID 0), log number is 1005
4 rocksdb: EVENT_LOG_v1 {"time_micros": 1509298585082794, "job": 1, 
"event": "recovery_started", "log_files": [1003, 1005]}
4 rocksdb: 
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:482] 
Recovering log #1003 mode 0
4 rocksdb: 
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:482] 
Recovering log #1005 mode 0
3 rocksdb: 
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl_open.cc:424] 
db/001005.log: dropping 3225 bytes; Corruption: missing start of 
fragmented record(2)
4 rocksdb: 
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl.cc:217] Shutdown: 
canceling all background work
4 rocksdb: 
[/build/ceph-pKGC1D/ceph-12.2.0/src/rocksdb/db/db_impl.cc:343] Shutdown 
complete

-1 rocksdb: Corruption: missing start of fragmented record(2)
-1 bluestore(/var/lib/ceph/osd/ceph-2/) _open_db erroring opening db:
1 bluefs umount
1 bdev(0x557f5b6a4240 /var/lib/ceph/osd/ceph-2//block) close

If I understand this right, rocksdb is  just trying to replay WAL type 
logs, of which presumably "001005.log" is corrupted. It then throws an 
error that stops everything.


I did try to mount the bluestore, as I was assuming that would probably 
where I'd find the rocksdb's files somewhere, but that also doesn't seem 
possible:


#ceph-objectstore-tool --op fsck --data-path /var/lib/ceph/osd/ceph-2/ 
--mountpoint /mnt/bluestore-repair/

fsck failed: (5) Input/output error
# ceph-objectstore-tool --op fuse --data-path /var/lib/ceph/osd/ceph-2 
--mountpoint /mnt/bluestore-repair/

Mount failed with '(5) Input/output error'
# ceph-objectstore-tool --op fuse --force --skip-journal-replay 
--data-path /var/lib/ceph/osd/ceph-2 --mountpoint /mnt/bluestore-repair/

Mount failed with '(5) Input/output error'

Adding --debug shows the ultimate culprit is just the above rocksdb 
error again.


Q: Is there some way in which I can tell rockdb to truncate or delete / 
skip the respective log entries? Or can I get access to rocksdb('s 
files) in some other way to just manipulate it or delete corrupted WAL 
files manually?


-Michael

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com