On Thu, Jun 7, 2018 at 6:58 PM Alfredo Deza <ad...@redhat.com> wrote:
>
> On Thu, Jun 7, 2018 at 12:09 PM, Sage Weil <sw...@redhat.com> wrote:
> > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster <d...@vanderster.com> 
> >> wrote:
> >> >
> >> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil <sw...@redhat.com> wrote:
> >> > >
> >> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil <sw...@redhat.com> wrote:
> >> > > > >
> >> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil <sw...@redhat.com> 
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
> >> > > > > > > > Hi all,
> >> > > > > > > >
> >> > > > > > > > We have an intermittent issue where bluestore osds sometimes 
> >> > > > > > > > fail to
> >> > > > > > > > start after a reboot.
> >> > > > > > > > The osds all fail the same way [see 2], failing to open the 
> >> > > > > > > > superblock.
> >> > > > > > > > One one particular host, there are 24 osds and 4 SSDs 
> >> > > > > > > > partitioned for
> >> > > > > > > > the block.db's. The affected non-starting OSDs all have 
> >> > > > > > > > block.db on
> >> > > > > > > > the same ssd (/dev/sdaa).
> >> > > > > > > >
> >> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and 
> >> > > > > > > > were created
> >> > > > > > > > by ceph-volume lvm, e.g. see [1].
> >> > > > > > > >
> >> > > > > > > > This seems like a permissions or similar issue related to the
> >> > > > > > > > ceph-volume tooling.
> >> > > > > > > > Any clues how to debug this further?
> >> > > > > > >
> >> > > > > > > I take it the OSDs start up if you try again?
> >> > > > > >
> >> > > > > > Hey.
> >> > > > > > No, they don't. For example, we do this `ceph-volume lvm 
> >> > > > > > activate 48
> >> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the 
> >> > > > > > same
> >> > > > > > mount failure every time.
> >> > > > >
> >> > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can 
> >> > > > > you
> >> > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
> >> > > > > debug bluestore = 20)
> >> > > > >
> >> > > >
> >> > > > Here: https://pastebin.com/TJXZhfcY
> >> > > >
> >> > > > Is it supposed to print something about the block.db at some 
> >> > > > point????
> >> > >
> >> > > Can you dump the bluefs superblock for me?
> >> > >
> >> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> >> > > hexdump -C /tmp/foo
> >> > >
> >> >
> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
> >> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
> >> > 1+0 records in
> >> > 1+0 records out
> >> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C 
> >> > /tmp/foo
> >> > 00000000  01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  
> >> > |..]......MC1J...|
> >> > 00000010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  
> >> > |....r....6.MK...|
> >> > 00000020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  
> >> > |................|
> >> > 00000030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  
> >> > |....+......@....|
> >> > 00000040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  
> >> > |................|
> >> > 00000050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  
> >> > |................|
> >> > 00000060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. 
> >> > .am...........|
> >> > 00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> >> > |................|
> >> > *
> >> > 00001000
> >> >
> >> >
> >>
> >> Wait, we found something!!!
> >>
> >> In the 1st 4k on the block we found the block.db pointing at the wrong
> >> device (/dev/sdc1 instead of /dev/sdaa1)
> >>
> >> 00000130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  
> >> |k5y+g==....path_|
> >> 00000140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  
> >> |block.db..../dev|
> >> 00000150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  
> >> |/sdc1....ready..|
> >> 00000160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  
> >> |..ready....whoam|
> >> 00000170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  
> >> |i....48.........|
> >> 00000180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> >> |................|
> >>
> >> It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
> >> instead of /dev/sdaa2).
> >> And for the osds that are running, that block.db is correct!
> >>
> >> So.... the block.db device is persisted in the block header? But after
> >> a reboot it gets a new name. (sd* naming is famously chaotic).
> >> ceph-volume creates a softlink to the correct db dev, but it seems not 
> >> used?
> >
> > Aha, yes.. the bluestore startup code looks for the value in the
> > superblock before the on in the directory.
> >
> > We can either (1) reverse that order, (and/)or (2) make ceph-volume use a
> > stable path for the device name when creating the bluestore.  And/or (3)
> > use ceph-bluestore-tool set-label-key to fix it if it doesn't match (this
> > would repair old superblocks... permanently if we use the stable path
> > name).
>
> ceph-volume does not require a stable/persistent path name at all. For
> partitions we store the partuuid, and *always* make sure that we have
> the right device, because we query blkid.
>

ceph-disk didn't require stable path names either, so it's good to
hear that this feature is still there in ceph-volume lvm.
It's been a bit unclear how to map the concept of a partitioned SSD
for filestore journals to the bluestore world.
As I understand it now, what we did is correct and fully supported?
(see the parted routine earlier in the thread...)

Thanks!

Dan


> In addition to that, in the case of bluestore, we go by each device
> and ensure that whatever link was done by bluestore is corrected
> before attempting to start the OSD [0]
>
> IMO bluestore should do #1, because this is already solved in the
> ceph-volume code (we knew dev names could change), but #2 and #3 are
> OK to help with this issue today.
>
> Another option would be to just avoid using the partition for block.db
> and just use an LV.
>
> [0] 
> https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/activate.py#L155-L168
>
>
> >
> > sage
> >
> >
> >>
> >> ...
> >> Dan & Teo
> >>
> >>
> >>
> >>
> >>
> >> >
> >> > -- dan
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > > Thanks!
> >> > > sage
> >> > >
> >> > > >
> >> > > > Here's the osd dir:
> >> > > >
> >> > > > # ls -l /var/lib/ceph/osd/ceph-48/
> >> > > > total 24
> >> > > > lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >> > > > lrwxrwxrwx. 1 root root 10 Jun  7 16:46 block.db -> /dev/sdaa1
> >> > > > -rw-------. 1 ceph ceph 37 Jun  7 16:46 ceph_fsid
> >> > > > -rw-------. 1 ceph ceph 37 Jun  7 16:46 fsid
> >> > > > -rw-------. 1 ceph ceph 56 Jun  7 16:46 keyring
> >> > > > -rw-------. 1 ceph ceph  6 Jun  7 16:46 ready
> >> > > > -rw-------. 1 ceph ceph 10 Jun  7 16:46 type
> >> > > > -rw-------. 1 ceph ceph  3 Jun  7 16:46 whoami
> >> > > >
> >> > > > # ls -l 
> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >> > > > lrwxrwxrwx. 1 root root 7 Jun  7 16:46
> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >> > > > -> ../dm-4
> >> > > >
> >> > > > # ls -l /dev/dm-4
> >> > > > brw-rw----. 1 ceph ceph 253, 4 Jun  7 16:46 /dev/dm-4
> >> > > >
> >> > > >
> >> > > >   --- Logical volume ---
> >> > > >   LV Path
> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >> > > >   LV Name                
> >> > > > osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >> > > >   VG Name                ceph-34f24306-d90c-49ff-bafb-2657a6a18010
> >> > > >   LV UUID                FQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
> >> > > >   LV Write Access        read/write
> >> > > >   LV Creation host, time p06253939y61826.cern.ch, 2018-03-15 
> >> > > > 10:57:37 +0100
> >> > > >   LV Status              available
> >> > > >   # open                 0
> >> > > >   LV Size                <5.46 TiB
> >> > > >   Current LE             1430791
> >> > > >   Segments               1
> >> > > >   Allocation             inherit
> >> > > >   Read ahead sectors     auto
> >> > > >   - currently set to     256
> >> > > >   Block device           253:4
> >> > > >
> >> > > >   --- Physical volume ---
> >> > > >   PV Name               /dev/sda
> >> > > >   VG Name               ceph-34f24306-d90c-49ff-bafb-2657a6a18010
> >> > > >   PV Size               <5.46 TiB / not usable <2.59 MiB
> >> > > >   Allocatable           yes (but full)
> >> > > >   PE Size               4.00 MiB
> >> > > >   Total PE              1430791
> >> > > >   Free PE               0
> >> > > >   Allocated PE          1430791
> >> > > >   PV UUID               WP0Z7C-ejSh-fpSa-a73N-H2Hz-yC78-qBezcI
> >> > > >
> >> > > > (sorry for wall o' lvm)
> >> > > >
> >> > > > -- dan
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > > Thanks!
> >> > > > > sage
> >> > > > >
> >> > > > >
> >> > > > > > -- dan
> >> > > > > >
> >> > > > > >
> >> > > > > > >
> >> > > > > > > sage
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > Thanks!
> >> > > > > > > >
> >> > > > > > > > Dan
> >> > > > > > > >
> >> > > > > > > > [1]
> >> > > > > > > >
> >> > > > > > > > ====== osd.48 ======
> >> > > > > > > >
> >> > > > > > > >   [block]    
> >> > > > > > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >> > > > > > > >
> >> > > > > > > >       type                      block
> >> > > > > > > >       osd id                    48
> >> > > > > > > >       cluster fsid              
> >> > > > > > > > dd535a7e-4647-4bee-853d-f34112615f81
> >> > > > > > > >       cluster name              ceph
> >> > > > > > > >       osd fsid                  
> >> > > > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >> > > > > > > >       db device                 /dev/sdaa1
> >> > > > > > > >       encrypted                 0
> >> > > > > > > >       db uuid                   
> >> > > > > > > > 3381a121-1c1b-4e45-a986-c1871c363edc
> >> > > > > > > >       cephx lockbox secret
> >> > > > > > > >       block uuid                
> >> > > > > > > > FQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
> >> > > > > > > >       block device
> >> > > > > > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
> >> > > > > > > >       crush device class        None
> >> > > > > > > >
> >> > > > > > > >   [  db]    /dev/sdaa1
> >> > > > > > > >
> >> > > > > > > >       PARTUUID                  
> >> > > > > > > > 3381a121-1c1b-4e45-a986-c1871c363edc
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > [2]
> >> > > > > > > >    -11> 2018-06-07 16:12:16.138407 7fba30fb4d80  1 -- - 
> >> > > > > > > > start start
> >> > > > > > > >    -10> 2018-06-07 16:12:16.138516 7fba30fb4d80  1
> >> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-48) _mount path 
> >> > > > > > > > /var/lib/ceph/os
> >> > > > > > > > d/ceph-48
> >> > > > > > > >     -9> 2018-06-07 16:12:16.138801 7fba30fb4d80  1 bdev 
> >> > > > > > > > create path
> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block type kernel
> >> > > > > > > >     -8> 2018-06-07 16:12:16.138808 7fba30fb4d80  1 
> >> > > > > > > > bdev(0x55eb46433a00
> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open path /v
> >> > > > > > > > ar/lib/ceph/osd/ceph-48/block
> >> > > > > > > >     -7> 2018-06-07 16:12:16.138999 7fba30fb4d80  1 
> >> > > > > > > > bdev(0x55eb46433a00
> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open size 60
> >> > > > > > > > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 
> >> > > > > > > > B) rotational
> >> > > > > > > >     -6> 2018-06-07 16:12:16.139188 7fba30fb4d80  1
> >> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes 
> >> > > > > > > > cache_size
> >> > > > > > > > 134217728 meta 0.01 kv 0.99 data 0
> >> > > > > > > >     -5> 2018-06-07 16:12:16.139275 7fba30fb4d80  1 bdev 
> >> > > > > > > > create path
> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block type kernel
> >> > > > > > > >     -4> 2018-06-07 16:12:16.139281 7fba30fb4d80  1 
> >> > > > > > > > bdev(0x55eb46433c00
> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open path /v
> >> > > > > > > > ar/lib/ceph/osd/ceph-48/block
> >> > > > > > > >     -3> 2018-06-07 16:12:16.139454 7fba30fb4d80  1 
> >> > > > > > > > bdev(0x55eb46433c00
> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open size 60
> >> > > > > > > > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 
> >> > > > > > > > B) rotational
> >> > > > > > > >     -2> 2018-06-07 16:12:16.139464 7fba30fb4d80  1 bluefs
> >> > > > > > > > add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo
> >> > > > > > > > ck size 5589 GB
> >> > > > > > > >     -1> 2018-06-07 16:12:16.139510 7fba30fb4d80  1 bluefs 
> >> > > > > > > > mount
> >> > > > > > > >      0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1
> >> > > > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA
> >> > > > > > > > BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o
> >> > > > > > > > s/bluestore/bluefs_types.h: In function 'static void
> >> > > > > > > > bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, 
> >> > > > > > > > __u8
> >> > > > > > > > *, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 
> >> > > > > > > > 2018-06-07
> >> > > > > > > > 16:12:16.139666
> >> > > > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h:
> >> > > > > > > > 54: FAILED assert(pos <= end)
> >> > > > > > > >
> >> > > > > > > >  ceph version 12.2.5 
> >> > > > > > > > (cad919881333ac92274171586c827e01f554a70a)
> >> > > > > > > > luminous (stable)
> >> > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> >> > > > > > > > char
> >> > > > > > > > const*)+0x110) [0x55eb3b597780]
> >> > > > > > > >  2: 
> >> > > > > > > > (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776)
> >> > > > > > > > [0x55eb3b52db36]
> >> > > > > > > >  3: (BlueFS::_open_super()+0xfe) [0x55eb3b50cede]
> >> > > > > > > >  4: (BlueFS::mount()+0xe3) [0x55eb3b5250c3]
> >> > > > > > > >  5: (BlueStore::_open_db(bool)+0x173d) [0x55eb3b43ebcd]
> >> > > > > > > >  6: (BlueStore::_mount(bool)+0x40e) [0x55eb3b47025e]
> >> > > > > > > >  7: (OSD::init()+0x3bd) [0x55eb3b02a1cd]
> >> > > > > > > >  8: (main()+0x2d07) [0x55eb3af2f977]
> >> > > > > > > >  9: (__libc_start_main()+0xf5) [0x7fba2d47b445]
> >> > > > > > > >  10: (()+0x4b7033) [0x55eb3afce033]
> >> > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS 
> >> > > > > > > > <executable>` is
> >> > > > > > > > needed to interpret this.
> >> > > > > > > > _______________________________________________
> >> > > > > > > > ceph-users mailing list
> >> > > > > > > > ceph-users@lists.ceph.com
> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > _______________________________________________
> >> > > > > > ceph-users mailing list
> >> > > > > > ceph-users@lists.ceph.com
> >> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > > > >
> >> > > > > >
> >> > > >
> >> > > >
> >>
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to