Hi,

we're trying to give ceph a try on our compute cluster Initial stress
tests passed without problems, but over the weekend a couple of cosd
processes died and now access to the ceph mount point blocks and
mounting the ceph dir fails with

        mount: 192.168.1.141:6789,192.168.1.145:6789,192.168.1.150:6789:/: 
can't read superblock

Attempts to restart the cosd on the affected storage nodes fails with

        # /usr/local/bin/cosd -f -i 6 -c /etc/ceph/ceph.conf
         ** WARNING: Ceph is still under heavy development, and is only 
suitable for **
         **          testing and review.  Do not trust it with important data.  
     **
        starting osd6 at 0.0.0.0:6800/2685 osd_data /var/ceph/osd6 
/var/ceph/osd6/journal
        terminate called after throwing an instance of 'std::bad_alloc'
          what():  std::bad_alloc
        Aborted

Stracing the cosd process shows that it calls mmap() with silly values
for the "fd" and the "length" parameter:

        mmap(NULL, 18446744073709436928, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)

I briefly looked at the source code and noticed that raw_mmap_pages()
in include/buffer.h of seems to call mmap() with an unsigned int
rather than with a size_t as the second (length) parameter. Since

        18446744073709436928 = 0xfffffffffffe4000

this looks like an integer overflow. But maybe it is just uninitialized
garbage.

I've tried the v0.20.2 and the testing branch of the ceph git
repo. Both versions of cosd show the same behaviour.

Our ceph file system 5.5T large, we have 7 cosds, 3 cmons and 3 cmds,
see the ceph.conf below for details.

Any idea how to get back the data? If you need further debugging info,
don't hesitate to ask.

Thanks
Andre
---

[global]
        ; enable secure authentication
        auth supported = cephx
        osd journal size = 100    ; measured in MB 

; You need at least one monitor. You need at least three if you want to
; tolerate any node failures. Always create an odd number.
[mon]
        mon data = /var/ceph/mon$id
        ; some minimal logging (just message traffic) to aid debugging
        debug ms = 1
[mon0]
        host = node141
        mon addr = 192.168.1.141:6789
[mon1]
        host = node145
        mon addr = 192.168.1.145:6789
[mon2]
        host = node150
        mon addr = 192.168.1.150:6789

; You need at least one mds. Define two to get a standby.
[mds]
        ; where the mds keeps it's secret encryption keys
        keyring = /var/ceph/keyring.$name
[mds0]
        host = node141
[mds1]
        host = node145
[mds2]
        host = node150

; osd
;  You need at least one.  Two if you want data to be replicated.
;  Define as many as you like.
[osd]
        ; This is where the btrfs volume will be mounted.
        osd data = /var/ceph/osd$id

        ; Ideally, make this a separate disk or partition.  A few GB
        ; is usually enough; more if you have fast disks.  You can use
        ; a file under the osd data dir if need be
        ; (e.g. /data/osd$id/journal), but it will be slower than a
        ; separate disk or partition.
        osd journal = /var/ceph/osd$id/journal

[osd0]
        host = node141
[osd1]
        host = node145
[osd2]
        host = node150
[osd3]
        host = node146
[osd4]
        host = node147
[osd5]
        host = node149
[osd6]
        host = node142
-- 
The only person who always got his work done by Friday was Robinson Crusoe

Attachment: signature.asc
Description: Digital signature

Reply via email to