On Wed, 2011-03-02 at 19:26 -0700, Colin McCabe wrote:
> Hi Jim,
>
> We have seen this problem before. The usual suspects are the oom
> killer (grep for "out of memory" in syslog).
> Unfortunately, SIGKILL is uncatchable and that's what the OOM killer sends.
>
> Another problem that can prevent core files from being generated is
> bad ulimit -c settings or a bad setting for core_pattern and friends.
> One problem I have a lot too is that the partition I'm writing core
> files to fills up.
>
> If none of that works, it's possible that someone is calling exit()
> somewhere. You can attach a gdb to the process and put a breakpoint on
> exit() to see if this is going on. There's a lot of "your foo is not
> bar enough, I hate your config, exit(1)" type code that gets executed
> while the daemon is starting up. It sounds like you should be past
> that point, though.
I've finally gotten a little info, using a variant of
your gdb idea: I waited until many of the OSD instances
had died, then I attached gdb to several that were left,
and waited.
Two of them died the same way, like this:
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7fd7888c8940 (LWP 28693)]
0x00007fd7a9b82f2b in sendmsg () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fd7a9b82f2b in sendmsg () from /lib64/libpthread.so.0
#1 0x0000000000672e0b in SimpleMessenger::Pipe::do_sendmsg (
this=0x7fd799b67c20, sd=13, msg=0x7fd7888c7f20, len=251237, more=false)
at msg/SimpleMessenger.cc:1994
#2 0x00000000006739d3 in SimpleMessenger::Pipe::write_message (
this=0x7fd799b67c20, m=0x7fd79b2dcb70) at msg/SimpleMessenger.cc:2217
#3 0x000000000067e74a in SimpleMessenger::Pipe::writer (this=0x7fd799b67c20)
at msg/SimpleMessenger.cc:1734
#4 0x000000000066fa2b in SimpleMessenger::Pipe::Writer::entry (
this=0x7fd799b67e70) at msg/SimpleMessenger.h:204
#5 0x000000000068282e in Thread::_entry_func (arg=0x7fd799b67e70)
at ./common/Thread.h:41
#6 0x00007fd7a9b7b73d in start_thread (arg=<value optimized out>)
at pthread_create.c:301
#7 0x00007fd7a8a91f6d in clone () from /lib64/libc.so.6
(gdb)
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f1aed7f3940 (LWP 28726)]
0x00007f1b01238f2b in sendmsg () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007f1b01238f2b in sendmsg () from /lib64/libpthread.so.0
#1 0x0000000000672e0b in SimpleMessenger::Pipe::do_sendmsg (
this=0x7f1af15c94d0, sd=114, msg=0x7f1aed7f2f20, len=126728, more=false)
at msg/SimpleMessenger.cc:1994
#2 0x00000000006739d3 in SimpleMessenger::Pipe::write_message (
this=0x7f1af15c94d0, m=0x23d3010) at msg/SimpleMessenger.cc:2217
#3 0x000000000067e74a in SimpleMessenger::Pipe::writer (this=0x7f1af15c94d0)
at msg/SimpleMessenger.cc:1734
#4 0x000000000066fa2b in SimpleMessenger::Pipe::Writer::entry (
this=0x7f1af15c9720) at msg/SimpleMessenger.h:204
#5 0x000000000068282e in Thread::_entry_func (arg=0x7f1af15c9720)
at ./common/Thread.h:41
#6 0x00007f1b0123173d in start_thread (arg=<value optimized out>)
at pthread_create.c:301
#7 0x00007f1b00147f6d in clone () from /lib64/libc.so.6
The third also got
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f531fefe940 (LWP 28700)]
0x00007f533ffeaf2b in sendmsg () from /lib64/libpthread.so.0
(gdb)
but something was a little different and I didn't get a
backtrace from it.
-- Jim
>
> Colin
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html