On Thu, Jun 7, 2018 at 4:31 PM Alfredo Deza <ad...@redhat.com> wrote:
>
> On Thu, Jun 7, 2018 at 10:23 AM, Dan van der Ster <d...@vanderster.com> wrote:
> > Hi all,
> >
> > We have an intermittent issue where bluestore osds sometimes fail to
> > start after a reboot.
> > The osds all fail the same way [see 2], failing to open the superblock.
> > One one particular host, there are 24 osds and 4 SSDs partitioned for
> > the block.db's. The affected non-starting OSDs all have block.db on
> > the same ssd (/dev/sdaa).
> >
> > The osds are all running 12.2.5 on latest centos 7.5 and were created
> > by ceph-volume lvm, e.g. see [1].
> >
> > This seems like a permissions or similar issue related to the
> > ceph-volume tooling.
> > Any clues how to debug this further?
>
> There are useful logs in both /var/log/ceph/ceph-volume.log and
> /var/log/ceph-volume-systemd.log
>
> This is odd because the OSD is attempting to start, so the logs will
> just mention everything was done accordingly and then the OSD was
> started at the end of the (successful) setup.
>

ceph-volume.log:
[2018-06-07 16:32:58,265][ceph_volume.devices.lvm.activate][DEBUG ]
Found block device (osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5)
with encryption: False
[2018-06-07 16:32:58,266][ceph_volume.process][INFO  ] Running
command: /usr/sbin/blkid -t
PARTUUID="3381a121-1c1b-4e45-a986-c1871c363edc" -o device
[2018-06-07 16:32:58,386][ceph_volume.process][INFO  ] stdout /dev/sdaa1
[2018-06-07 16:32:58,387][ceph_volume.devices.lvm.activate][DEBUG ]
Found block device (osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5)
with encryption: False
[2018-06-07 16:32:58,387][ceph_volume.process][INFO  ] Running
command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
--path /var/lib/ceph/osd/ceph-48
[2018-06-07 16:32:58,441][ceph_volume.process][INFO  ] Running
command: ln -snf
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
/var/lib/ceph/osd/ceph-48/block
[2018-06-07 16:32:58,445][ceph_volume.process][INFO  ] Running
command: chown -R ceph:ceph /dev/dm-4
[2018-06-07 16:32:58,448][ceph_volume.process][INFO  ] Running
command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-48
[2018-06-07 16:32:58,451][ceph_volume.process][INFO  ] Running
command: ln -snf /dev/sdaa1 /var/lib/ceph/osd/ceph-48/block.db
[2018-06-07 16:32:58,453][ceph_volume.process][INFO  ] Running
command: chown -R ceph:ceph /dev/sdaa1
[2018-06-07 16:32:58,456][ceph_volume.process][INFO  ] Running
command: systemctl enable
ceph-volume@lvm-48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
[2018-06-07 16:32:58,580][ceph_volume.process][INFO  ] Running
command: systemctl start ceph-osd@48


ceph-volume-systemd.log reports nothing for attempts at ceph-volume
lvm activate. But if we start the ceph-volume unit manually we get:

[2018-06-07 16:38:16,952][systemd][INFO  ] raw systemd input received:
lvm-48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
[2018-06-07 16:38:16,952][systemd][INFO  ] parsed sub-command: lvm,
extra data: 48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
[2018-06-07 16:38:16,977][ceph_volume.process][INFO  ] Running
command: ceph-volume lvm trigger
48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
[2018-06-07 16:38:17,961][ceph_volume.process][INFO  ] stdout Running
command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
--path /var/lib/ceph/osd/ceph-48
Running command: ln -snf
/dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
/var/lib/ceph/osd/ceph-48/block
Running command: chown -R ceph:ceph /dev/dm-4
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-48
Running command: ln -snf /dev/sdaa1 /var/lib/ceph/osd/ceph-48/block.db
Running command: chown -R ceph:ceph /dev/sdaa1
Running command: systemctl enable
ceph-volume@lvm-48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
Running command: systemctl start ceph-osd@48
--> ceph-volume lvm activate successful for osd ID: 48
[2018-06-07 16:38:17,968][systemd][INFO  ] successfully trggered
activation for: 48-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5


Everything *looks* correct... but the OSDs always fail
`bluefs_types.h: 54: FAILED assert(pos <= end)`

-- dan






>
> >
> > Thanks!
> >
> > Dan
> >
> > [1]
> >
> > ====== osd.48 ======
> >
> >   [block]    
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >
> >       type                      block
> >       osd id                    48
> >       cluster fsid              dd535a7e-4647-4bee-853d-f34112615f81
> >       cluster name              ceph
> >       osd fsid                  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >       db device                 /dev/sdaa1
> >       encrypted                 0
> >       db uuid                   3381a121-1c1b-4e45-a986-c1871c363edc
> >       cephx lockbox secret
> >       block uuid                FQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
> >       block device
> > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >       crush device class        None
> >
> >   [  db]    /dev/sdaa1
> >
> >       PARTUUID                  3381a121-1c1b-4e45-a986-c1871c363edc
> >
> >
> >
> > [2]
> >    -11> 2018-06-07 16:12:16.138407 7fba30fb4d80  1 -- - start start
> >    -10> 2018-06-07 16:12:16.138516 7fba30fb4d80  1
> > bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os
> > d/ceph-48
> >     -9> 2018-06-07 16:12:16.138801 7fba30fb4d80  1 bdev create path
> > /var/lib/ceph/osd/ceph-48/block type kernel
> >     -8> 2018-06-07 16:12:16.138808 7fba30fb4d80  1 bdev(0x55eb46433a00
> > /var/lib/ceph/osd/ceph-48/block) open path /v
> > ar/lib/ceph/osd/ceph-48/block
> >     -7> 2018-06-07 16:12:16.138999 7fba30fb4d80  1 bdev(0x55eb46433a00
> > /var/lib/ceph/osd/ceph-48/block) open size 60
> > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 B) rotational
> >     -6> 2018-06-07 16:12:16.139188 7fba30fb4d80  1
> > bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes cache_size
> > 134217728 meta 0.01 kv 0.99 data 0
> >     -5> 2018-06-07 16:12:16.139275 7fba30fb4d80  1 bdev create path
> > /var/lib/ceph/osd/ceph-48/block type kernel
> >     -4> 2018-06-07 16:12:16.139281 7fba30fb4d80  1 bdev(0x55eb46433c00
> > /var/lib/ceph/osd/ceph-48/block) open path /v
> > ar/lib/ceph/osd/ceph-48/block
> >     -3> 2018-06-07 16:12:16.139454 7fba30fb4d80  1 bdev(0x55eb46433c00
> > /var/lib/ceph/osd/ceph-48/block) open size 60
> > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 B) rotational
> >     -2> 2018-06-07 16:12:16.139464 7fba30fb4d80  1 bluefs
> > add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo
> > ck size 5589 GB
> >     -1> 2018-06-07 16:12:16.139510 7fba30fb4d80  1 bluefs mount
> >      0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA
> > BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o
> > s/bluestore/bluefs_types.h: In function 'static void
> > bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, __u8
> > *, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 2018-06-07
> > 16:12:16.139666
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h:
> > 54: FAILED assert(pos <= end)
> >
> >  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
> > luminous (stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x110) [0x55eb3b597780]
> >  2: (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776)
> > [0x55eb3b52db36]
> >  3: (BlueFS::_open_super()+0xfe) [0x55eb3b50cede]
> >  4: (BlueFS::mount()+0xe3) [0x55eb3b5250c3]
> >  5: (BlueStore::_open_db(bool)+0x173d) [0x55eb3b43ebcd]
> >  6: (BlueStore::_mount(bool)+0x40e) [0x55eb3b47025e]
> >  7: (OSD::init()+0x3bd) [0x55eb3b02a1cd]
> >  8: (main()+0x2d07) [0x55eb3af2f977]
> >  9: (__libc_start_main()+0xf5) [0x7fba2d47b445]
> >  10: (()+0x4b7033) [0x55eb3afce033]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> > needed to interpret this.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to