Re: OSD dies after seconds

2013-02-14 Thread Jesus Cuenca
I upgraded to ceph 0.56-3 but the problem persist...

OSD starts but after a second it finishes:

2013-02-14 12:18:34.504391 7fae613ea760 10 journal _open journal is
not a block device, NOT checking disk write cache on
'/var/lib/ceph/osd/ceph-0/jour
nal'
2013-02-14 12:18:34.504400 7fae613ea760  1 journal _open
/var/lib/ceph/osd/ceph-0/journal fd 17: 1048576 bytes, block size
4096 bytes, directio = 1
, aio = 0
2013-02-14 12:18:34.504458 7fae613ea760 10 journal journal_start
2013-02-14 12:18:34.504506 7fae5d3c6700 10 journal write_thread_entry start
2013-02-14 12:18:34.504515 7fae5d3c6700 20 journal write_thread_entry
going to sleep
2013-02-14 12:18:34.504706 7fae5cbc5700 10 journal
write_finish_thread_entry enter
2013-02-14 12:18:34.504716 7fae5cbc5700 20 journal
write_finish_thread_entry sleeping
2013-02-14 12:18:34.504893 7fae567fc700 20
filestore(/var/lib/ceph/osd/ceph-0) flusher_entry start
2013-02-14 12:18:34.504903 7fae567fc700 20
filestore(/var/lib/ceph/osd/ceph-0) flusher_entry sleeping
2013-02-14 12:18:34.505013 7fae613ea760  5
filestore(/var/lib/ceph/osd/ceph-0) umount /var/lib/ceph/osd/ceph-0
2013-02-14 12:18:34.505036 7fae567fc700 20
filestore(/var/lib/ceph/osd/ceph-0) flusher_entry awoke
2013-02-14 12:18:34.505044 7fae567fc700 20
filestore(/var/lib/ceph/osd/ceph-0) flusher_entry finish
2013-02-14 12:18:34.505113 7fae5dbc7700 20
filestore(/var/lib/ceph/osd/ceph-0) sync_entry force_sync set
2013-02-14 12:18:34.505129 7fae5dbc7700 10 journal commit_start
max_applied_seq 2, open_ops 0
2013-02-14 12:18:34.505136 7fae5dbc7700 10 journal commit_start
blocked, all open_ops have completed
2013-02-14 12:18:34.505138 7fae5dbc7700 10 journal commit_start nothing to do
2013-02-14 12:18:34.505141 7fae5dbc7700 10 journal commit_start
2013-02-14 12:18:34.505506 7fae613ea760 10 journal journal_stop
2013-02-14 12:18:34.505698 7fae613ea760  1 journal close
/var/lib/ceph/osd/ceph-0/journal
2013-02-14 12:18:34.505787 7fae5d3c6700 20 journal write_thread_entry woke up
2013-02-14 12:18:34.505796 7fae5d3c6700 10 journal write_thread_entry finish
2013-02-14 12:18:34.505845 7fae5cbc5700 10 journal
write_finish_thread_entry exit


On Wed, Feb 13, 2013 at 6:28 PM, Jesus Cuenca jcue...@cnb.csic.es wrote:
 thanks for the fast answer.

 no, it does not segfault:

 gdb --args /usr/local/bin/ceph-osd -i 0
 ...
 (gdb) run
 Starting program: /usr/local/bin/ceph-osd -i 0
 [Thread debugging using libthread_db enabled]
 [New Thread 0x75fce700 (LWP 8920)]
 starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0
 /var/lib/ceph/osd/ceph-0/journal
 [Thread 0x75fce700 (LWP 8920) exited]

 Program exited normally.

 --



 On Wed, Feb 13, 2013 at 6:21 PM, Sage Weil s...@inktank.com wrote:
 On Wed, 13 Feb 2013, Jesus Cuenca wrote:
 Hi,

 I'm setting up a small ceph 0.56.2 cluster on 3 64-bit Debian 6
 servers with kernel 3.7.2.

 This might be

 http://tracker.ceph.com/issues/3595

 which is problems with google perftools (which we use by default) and the
 version in squeeze, which is buggy.  This doesn't seem to affect all
 squeeze users.

 Does it seg fault?

 sage



 My problem is that OSD die. First I try to start them with the init script:

  /etc/init.d/ceph start osd.0
 ...
 starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0
 /var/lib/ceph/osd/ceph-0/journal

  ps -ef | grep ceph
 (No ceph-osd process)

 I then run with debugging:

  ceph-osd -i 0 --debug_ms 20 --debug_osd 20 --debug_filestore 20 
  --debug_journal 20 -d
 starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0
 /var/lib/ceph/osd/ceph-0/journal
 2013-02-13 18:04:40.351830 7fe98cd8a760 10 -- :/0 rank.bind :/0
 2013-02-13 18:04:40.351895 7fe98cd8a760 10 accepter.accepter.bind
 2013-02-13 18:04:40.351910 7fe98cd8a760 10 accepter.accepter.bind
 bound on random port 0.0.0.0:6800/0
 2013-02-13 18:04:40.351919 7fe98cd8a760 10 accepter.accepter.bind
 bound to 0.0.0.0:6800/0
 2013-02-13 18:04:40.351930 7fe98cd8a760  1 accepter.accepter.bind
 my_inst.addr is 0.0.0.0:6800/8438 need_addr=1
 2013-02-13 18:04:40.351935 7fe98cd8a760 10 -- :/0 rank.bind :/0
 2013-02-13 18:04:40.351938 7fe98cd8a760 10 accepter.accepter.bind
 2013-02-13 18:04:40.351943 7fe98cd8a760 10 accepter.accepter.bind
 bound on random port 0.0.0.0:6801/0
 2013-02-13 18:04:40.351946 7fe98cd8a760 10 accepter.accepter.bind
 bound to 0.0.0.0:6801/0
 2013-02-13 18:04:40.351952 7fe98cd8a760  1 accepter.accepter.bind
 my_inst.addr is 0.0.0.0:6801/8438 need_addr=1
 2013-02-13 18:04:40.351959 7fe98cd8a760 10 -- :/0 rank.bind :/0
 2013-02-13 18:04:40.351961 7fe98cd8a760 10 accepter.accepter.bind
 2013-02-13 18:04:40.351966 7fe98cd8a760 10 accepter.accepter.bind
 bound on random port 0.0.0.0:6802/0
 2013-02-13 18:04:40.351969 7fe98cd8a760 10 accepter.accepter.bind
 bound to 0.0.0.0:6802/0
 2013-02-13 18:04:40.351975 7fe98cd8a760  1 accepter.accepter.bind
 my_inst.addr is 0.0.0.0:6802/8438 need_addr=1
 2013-02-13 18:04:40.352636 7fe98cd8a760  5
 filestore(/var/lib/ceph/osd/ceph-0) basedir /var

OSD dies after seconds

2013-02-13 Thread Jesus Cuenca
Hi,

I'm setting up a small ceph 0.56.2 cluster on 3 64-bit Debian 6
servers with kernel 3.7.2.

My problem is that OSD die. First I try to start them with the init script:

 /etc/init.d/ceph start osd.0
...
starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0
/var/lib/ceph/osd/ceph-0/journal

 ps -ef | grep ceph
(No ceph-osd process)

I then run with debugging:

 ceph-osd -i 0 --debug_ms 20 --debug_osd 20 --debug_filestore 20 
 --debug_journal 20 -d
starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0
/var/lib/ceph/osd/ceph-0/journal
2013-02-13 18:04:40.351830 7fe98cd8a760 10 -- :/0 rank.bind :/0
2013-02-13 18:04:40.351895 7fe98cd8a760 10 accepter.accepter.bind
2013-02-13 18:04:40.351910 7fe98cd8a760 10 accepter.accepter.bind
bound on random port 0.0.0.0:6800/0
2013-02-13 18:04:40.351919 7fe98cd8a760 10 accepter.accepter.bind
bound to 0.0.0.0:6800/0
2013-02-13 18:04:40.351930 7fe98cd8a760  1 accepter.accepter.bind
my_inst.addr is 0.0.0.0:6800/8438 need_addr=1
2013-02-13 18:04:40.351935 7fe98cd8a760 10 -- :/0 rank.bind :/0
2013-02-13 18:04:40.351938 7fe98cd8a760 10 accepter.accepter.bind
2013-02-13 18:04:40.351943 7fe98cd8a760 10 accepter.accepter.bind
bound on random port 0.0.0.0:6801/0
2013-02-13 18:04:40.351946 7fe98cd8a760 10 accepter.accepter.bind
bound to 0.0.0.0:6801/0
2013-02-13 18:04:40.351952 7fe98cd8a760  1 accepter.accepter.bind
my_inst.addr is 0.0.0.0:6801/8438 need_addr=1
2013-02-13 18:04:40.351959 7fe98cd8a760 10 -- :/0 rank.bind :/0
2013-02-13 18:04:40.351961 7fe98cd8a760 10 accepter.accepter.bind
2013-02-13 18:04:40.351966 7fe98cd8a760 10 accepter.accepter.bind
bound on random port 0.0.0.0:6802/0
2013-02-13 18:04:40.351969 7fe98cd8a760 10 accepter.accepter.bind
bound to 0.0.0.0:6802/0
2013-02-13 18:04:40.351975 7fe98cd8a760  1 accepter.accepter.bind
my_inst.addr is 0.0.0.0:6802/8438 need_addr=1
2013-02-13 18:04:40.352636 7fe98cd8a760  5
filestore(/var/lib/ceph/osd/ceph-0) basedir /var/lib/ceph/osd/ceph-0
journal /var/lib/ceph/osd/ceph-0/journa
l
2013-02-13 18:04:40.352664 7fe98cd8a760 10
filestore(/var/lib/ceph/osd/ceph-0) mount fsid is
0ab92be4-3b42-47bc-bd88-b0e11da5b450
2013-02-13 18:04:40.426222 7fe98cd8a760  0
filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is supported
and appears to work
2013-02-13 18:04:40.426234 7fe98cd8a760  0
filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option
2013-02-13 18:04:40.426567 7fe98cd8a760  0
filestore(/var/lib/ceph/osd/ceph-0) mount did NOT detect btrfs
2013-02-13 18:04:40.426575 7fe98cd8a760  0
filestore(/var/lib/ceph/osd/ceph-0) mount syncfs(2) syscall not
supported
2013-02-13 18:04:40.426630 7fe98cd8a760  0
filestore(/var/lib/ceph/osd/ceph-0) mount no syncfs(2), must use
sync(2).
2013-02-13 18:04:40.426631 7fe98cd8a760  0
filestore(/var/lib/ceph/osd/ceph-0) mount WARNING: multiple ceph-osd
daemons on the same host will be slow
2013-02-13 18:04:40.426701 7fe98cd8a760  0
filestore(/var/lib/ceph/osd/ceph-0) mount found snaps 
2013-02-13 18:04:40.426719 7fe98cd8a760  5
filestore(/var/lib/ceph/osd/ceph-0) mount op_seq is 2
2013-02-13 18:04:40.515151 7fe98cd8a760 20 filestore (init)dbobjectmap: seq is 1
2013-02-13 18:04:40.515217 7fe98cd8a760 10
filestore(/var/lib/ceph/osd/ceph-0) open_journal at
/var/lib/ceph/osd/ceph-0/journal
2013-02-13 18:04:40.515243 7fe98cd8a760  0
filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal
mode: btrfs not detected
2013-02-13 18:04:40.515252 7fe98cd8a760 10
filestore(/var/lib/ceph/osd/ceph-0) list_collections
2013-02-13 18:04:40.515352 7fe98cd8a760 10 journal journal_replay fs op_seq 2
2013-02-13 18:04:40.515359 7fe98cd8a760  2 journal open
/var/lib/ceph/osd/ceph-0/journal fsid
0ab92be4-3b42-47bc-bd88-b0e11da5b450 fs_op_seq 2
2013-02-13 18:04:40.515373 7fe98cd8a760 10 journal _open journal is
not a block device, NOT checking disk write cache on
'/var/lib/ceph/osd/ceph-0/jour
nal'
2013-02-13 18:04:40.515385 7fe98cd8a760  1 journal _open
/var/lib/ceph/osd/ceph-0/journal fd 17: 1048576 bytes, block size
4096 bytes, directio = 1
, aio = 0
2013-02-13 18:04:40.515393 7fe98cd8a760 10 journal read_header
2013-02-13 18:04:40.515409 7fe98cd8a760 10 journal header: block_size
4096 alignment 4096 max_size 1048576
2013-02-13 18:04:40.515411 7fe98cd8a760 10 journal header: start 4096
2013-02-13 18:04:40.515412 7fe98cd8a760 10 journal  write_pos 4096
2013-02-13 18:04:40.515415 7fe98cd8a760 10 journal open header.fsid =
0ab92be4-3b42-47bc-bd88-b0e11da5b450
2013-02-13 18:04:40.515434 7fe98cd8a760  2 journal read_entry 4096 :
seq 2 424 bytes
2013-02-13 18:04:40.515439 7fe98cd8a760  2 journal read_entry 8192 :
bad header magic, end of journal
2013-02-13 18:04:40.515443 7fe98cd8a760 10 journal open reached end of journal.
2013-02-13 18:04:40.515446 7fe98cd8a760  2 journal read_entry 8192 :
bad header magic, end of journal
2013-02-13 18:04:40.515447 7fe98cd8a760  3 journal journal_replay: end
of journal, done.
2013-02-13 18:04:40.515444 7fe989567700 20