Re: [ceph-users] Bug in OSD Maps

Gregory Farnum Thu, 25 May 2017 16:23:31 -0700

On Thu, May 25, 2017 at 8:39 AM Stuart Harland <
[email protected]> wrote:


> Has no-one any idea about this? If needed I can produce more information
> or diagnostics on request. I find it hard to believe that we are the only
> people experiencing this, and thus far we have lost about 40 OSDs to
> corruption due to this.
>
> Regards
>
> Stuart Harland
>
>
>
> On 24 May 2017, at 10:32, Stuart Harland <[email protected]>
> wrote:
>
> Hello
>
> I think I’m running into a bug that is described at
> http://tracker.ceph.com/issues/14213 for Hammer.
>
> However I’m running the latest version of Jewel 10.2.7, although I’m in
> the middle of upgrading the cluster (from 10.2.5). At first it was on a
> couple of nodes, but now it seems to be more pervasive.
>
> I have seen this issue with osd_map_cache_size set to 20 as well as 500,
> which I increased to try and compensate for it.
>
> My two questions, are
>
> 1) is this fixed, if so in which version.
>
> The only person who's reported this remotely recently was working on a
FreeBSD port. Other than the one tracker bug you found, errors like this
are usually the result of failing disks, buggy local filesystems, or
incorrect configuration (like turning off barriers).
I assume you didn't just upgrade from a pre-Jewel release that might have
been susceptible to that tracker.



> 2) is there a way to recover the damaged OSD metadata, as I really don’t
> want to keep having to rebuild large numbers of disks based on something
> arbitrary.
>
>
I saw somewhere (check the list archives?) that you may be able to get
around it by removing just the PG which is causing the crash, assuming it
has replicas elsewhere.

But more generally you want to figure out how this is happening. Either
you've got disk state which was previously broken and undetected (which, if
you've been running 10.2.5 on all your OSDs, I don't think is possible), or
you've experienced recent failures which are unlikely Ceph software bugs.
(They might be! But you'd be the only to report them anywhere I can see.)
-Greg


>
>
> SEEK_HOLE is disabled via 'filestore seek data hole' config option
>    -31> 2017-05-24 10:23:10.152349 7f24035e2800  0
> genericfilestorebackend(/var/lib/ceph/osd/txc1-1908) detect_features:
> splice is s
> upported
>    -30> 2017-05-24 10:23:10.182065 7f24035e2800  0
> genericfilestorebackend(/var/lib/ceph/osd/txc1-1908) detect_features:
> syncfs(2) s
> yscall fully supported (by glibc and kernel)
>    -29> 2017-05-24 10:23:10.182112 7f24035e2800  0
> xfsfilestorebackend(/var/lib/ceph/osd/txc1-1908) detect_feature: extsize is
> disab
> led by conf
>    -28> 2017-05-24 10:23:10.182839 7f24035e2800  1 leveldb: Recovering log
> #23079
>    -27> 2017-05-24 10:23:10.284173 7f24035e2800  1 leveldb: Delete type=0
> #23079
>
>    -26> 2017-05-24 10:23:10.284223 7f24035e2800  1 leveldb: Delete type=3
> #23078
>
>    -25> 2017-05-24 10:23:10.284807 7f24035e2800  0
> filestore(/var/lib/ceph/osd/txc1-1908) mount: enabling WRITEAHEAD journal
> mode: c
> heckpoint is not enabled
>    -24> 2017-05-24 10:23:10.285581 7f24035e2800  2 journal open
> /var/lib/ceph/osd/txc1-1908/journal fsid 8dada68b-0d1c-4f2a-bc96-1d8
> 61577bc98 fs_op_seq 20363902
>    -23> 2017-05-24 10:23:10.289523 7f24035e2800  1 journal _open
> /var/lib/ceph/osd/txc1-1908/journal fd 18: 5367660544 bytes, block
> size 4096 bytes, directio = 1, aio = 1
>    -22> 2017-05-24 10:23:10.293733 7f24035e2800  2 journal open advancing
> committed_seq 20363681 to fs op_seq 20363902
>    -21> 2017-05-24 10:23:10.293743 7f24035e2800  2 journal read_entry --
> not readable
>    -20> 2017-05-24 10:23:10.293744 7f24035e2800  2 journal read_entry --
> not readable
>    -19> 2017-05-24 10:23:10.293745 7f24035e2800  3 journal journal_replay:
> end of journal, done.
>    -18> 2017-05-24 10:23:10.297605 7f24035e2800  1 journal _open
> /var/lib/ceph/osd/txc1-1908/journal fd 18: 5367660544 bytes, block
> size 4096 bytes, directio = 1, aio = 1
>    -17> 2017-05-24 10:23:10.298470 7f24035e2800  1
> filestore(/var/lib/ceph/osd/txc1-1908) upgrade
>    -16> 2017-05-24 10:23:10.298509 7f24035e2800  2 osd.1908 0 boot
>    -15> 2017-05-24 10:23:10.300096 7f24035e2800  1 <cls>
> cls/replica_log/cls_replica_log.cc:141: Loaded replica log class!
>    -14> 2017-05-24 10:23:10.300384 7f24035e2800  1 <cls>
> cls/user/cls_user.cc:375: Loaded user class!
>    -13> 2017-05-24 10:23:10.300617 7f24035e2800  0 <cls>
> cls/hello/cls_hello.cc:305: loading cls_hello
>    -12> 2017-05-24 10:23:10.303748 7f24035e2800  1 <cls>
> cls/refcount/cls_refcount.cc:232: Loaded refcount class!
>    -11> 2017-05-24 10:23:10.304120 7f24035e2800  1 <cls>
> cls/version/cls_version.cc:228: Loaded version class!
>    -10> 2017-05-24 10:23:10.304439 7f24035e2800  1 <cls>
> cls/log/cls_log.cc:317: Loaded log class!
>     -9> 2017-05-24 10:23:10.307437 7f24035e2800  1 <cls>
> cls/rgw/cls_rgw.cc:3359: Loaded rgw class!
>     -8> 2017-05-24 10:23:10.307768 7f24035e2800  1 <cls>
> cls/timeindex/cls_timeindex.cc:259: Loaded timeindex class!
>     -7> 2017-05-24 10:23:10.307927 7f24035e2800  0 <cls>
> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>     -6> 2017-05-24 10:23:10.308086 7f24035e2800  1 <cls>
> cls/statelog/cls_statelog.cc:306: Loaded log class!
>     -5> 2017-05-24 10:23:10.315241 7f24035e2800  0 osd.1908 863035 crush
> map has features 2234490552320, adjusting msgr requires for
>  clients
>     -4> 2017-05-24 10:23:10.315258 7f24035e2800  0 osd.1908 863035 crush
> map has features 2234490552320 was 8705, adjusting msgr req
> uires for mons
>     -3> 2017-05-24 10:23:10.315267 7f24035e2800  0 osd.1908 863035 crush
> map has features 2234490552320, adjusting msgr requires for
>  osds
>     -2> 2017-05-24 10:23:10.441444 7f24035e2800  0 osd.1908 863035 load_pgs
>     -1> 2017-05-24 10:23:10.442608 7f24035e2800 -1 osd.1908 863035
> load_pgs: have pgid 11.3f5a at epoch 863078, but missing map.  Crashing.
>      0> 2017-05-24 10:23:10.444151 7f24035e2800 -1 osd/OSD.cc
> <http://osd.cc/>: In function 'void OSD::load_pgs()' thread 7f24035e2800
> time 2017-05-24 10:23:10.442617
> osd/OSD.cc <http://osd.cc/>: 3189: FAILED assert(0 == "Missing map in
> load_pgs")
>
>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0x55d1874be6db]
>  2: (OSD::load_pgs()+0x1f9b) [0x55d186e6a26b]
>  3: (OSD::init()+0x1f74) [0x55d186e7aec4]
>  4: (main()+0x29d1) [0x55d186de1d71]
>  5: (__libc_start_main()+0xf5) [0x7f24004fdf45]
>  6: (()+0x356a47) [0x55d186e2aa47]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> Regards
>
> Stuart Harland
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bug in OSD Maps

Reply via email to