Re: [ceph-users] radosgw crash - Infernalis

Ben Hines Wed, 27 Apr 2016 22:09:51 -0700

Got it again - however, the stack is exactly the same, no symbols -
debuginfo didn't resolve. Do i need to do something to enable that?


The server in 'debug ms=10' this time, so there is a bit more spew:

   -14> 2016-04-27 21:59:58.811919 7f9e817fa700  1 -- 10.30.1.8:0/3291985349
--> 10.30.2.13:6805/27519 -- osd_op(client.44936150.0:223
obj_delete_at_hint.0000000055 [call timeindex.list] 10.2c88dbcf
ack+read+known_if_redirected e100564) v6 -- ?+0 0x7f9f140dc5f0 con
0x7f9f1410ed10
   -13> 2016-04-27 21:59:58.812039 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
   -12> 2016-04-27 21:59:58.812096 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
   -11> 2016-04-27 21:59:58.814343 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).reader wants 211 from dispatch throttler
0/104857600
   -10> 2016-04-27 21:59:58.814375 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).aborted = 0
    -9> 2016-04-27 21:59:58.814405 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).reader got message 2 0x7f9ec0009250
osd_op_reply(223 obj_delete_at_hint.0000000055 [call] v0'0 uv1448004 ondisk
= 0) v6
    -8> 2016-04-27 21:59:58.814428 7f9e3f96a700  1 -- 10.30.1.8:0/3291985349
<== osd.6 10.30.2.13:6805/27519 2 ==== osd_op_reply(223
obj_delete_at_hint.0000000055 [call] v0'0 uv1448004 ondisk = 0) v6 ====
196+0+15 (3849172018 0 2149983739) 0x7f9ec0009250 con 0x7f9f1410ed10
    -7> 2016-04-27 21:59:58.814472 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349
dispatch_throttle_release 211 to dispatch throttler 211/104857600
    -6> 2016-04-27 21:59:58.814470 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
    -5> 2016-04-27 21:59:58.814511 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).write_ack 2
    -4> 2016-04-27 21:59:58.814528 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
    -3> 2016-04-27 21:59:58.814607 7f9e817fa700  1 -- 10.30.1.8:0/3291985349
--> 10.30.2.13:6805/27519 -- osd_op(client.44936150.0:224
obj_delete_at_hint.0000000055 [call lock.unlock] 10.2c88dbcf
ondisk+write+known_if_redirected e100564) v6 -- ?+0 0x7f9f140dc5f0 con
0x7f9f1410ed10
    -2> 2016-04-27 21:59:58.814718 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
    -1> 2016-04-27 21:59:58.814778 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349
>> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914
cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
     0> 2016-04-27 21:59:58.826494 7f9e7e7f4700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f9e7e7f4700

 ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
 1: (()+0x30b0a2) [0x7fa11c5030a2]
 2: (()+0xf100) [0x7fa1183fe100]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

--- logging levels ---
<snip>


On Wed, Apr 27, 2016 at 9:39 PM, Ben Hines <[email protected]> wrote:

> Yes, CentOS 7.2. Happened twice in a row, both times shortly after a
> restart, so i expect i'll be able to reproduce it. However, i've now tried
> a bunch of times and it's not happening again.
>
> In any case i have glibc + ceph-debuginfo installed so we can get more
> info if it does happen.
>
> thanks!
>
> On Wed, Apr 27, 2016 at 8:40 PM, Brad Hubbard <[email protected]> wrote:
>
>> ----- Original Message -----
>> > From: "Karol Mroz" <[email protected]>
>> > To: "Ben Hines" <[email protected]>
>> > Cc: "ceph-users" <[email protected]>
>> > Sent: Wednesday, 27 April, 2016 7:06:56 PM
>> > Subject: Re: [ceph-users] radosgw crash - Infernalis
>> >
>> > On Tue, Apr 26, 2016 at 10:17:31PM -0700, Ben Hines wrote:
>> > [...]
>> > > --> 10.30.1.6:6800/10350 -- osd_op(client.44852756.0:79
>> > > default.42048218.<redacted> [getxattrs,stat,read 0~524288] 12.aa730416
>> > > ack+read+known_if_redirected e100207) v6 -- ?+0 0x7f49c41880b0 con
>> > > 0x7f49c4145eb0
>> > >      0> 2016-04-26 22:07:59.685615 7f49a07f0700 -1 *** Caught signal
>> > > (Segmentation fault) **
>> > >  in thread 7f49a07f0700
>> > >
>> > >  ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
>> > >  1: (()+0x30b0a2) [0x7f4c4907f0a2]
>> > >  2: (()+0xf100) [0x7f4c44f7a100]
>> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed
>> > > to interpret this.
>> >
>> > Hi Ben,
>> >
>> > I sense a pretty badly corrupted stack. From the radosgw-9.2.1
>> (obtained from
>> > a downloaded rpm):
>> >
>> > 000000000030a810 <_Z13pidfile_writePK11md_config_t@@Base>:
>> > ...
>> >   30b09d:       e8 0e 40 e4 ff          callq  14f0b0 <backtrace@plt>
>> >   30b0a2:       4c 89 ef                mov    %r13,%rdi
>> >   -------
>> > ...
>> >
>> > So either we tripped backtrace() code from pidfile_write() _or_ we can't
>> > trust the stack. From the log snippet, it looks that we're far past the
>> point
>> > at which we would write a pidfile to disk (ie. at process start during
>> > global_init()).
>> > Rather, we're actually handling a request and outputting some bit of
>> debug
>> > message
>> > via MSDOp::print() and beyond...
>>
>> It would help to know what binary this is and what OS.
>>
>> We know the offset into the function is 0x30b0a2 but we don't know which
>> function yet AFAICT. Karol, how did you arrive at pidfile_write? Purely
>> from
>> the offset? I'm not sure that would be reliable...
>>
>> This is a segfault so the address of the frame where we crashed should be
>> the
>> exact instruction where we crashed. I don't believe a mov from one
>> register to
>> another that does not involve a dereference ((%r13) as opposed to %r13)
>> can
>> cause a segfault so I don't think we are on the right instruction but
>> then, as
>> you say, the stack may be corrupt.
>>
>> >
>> > Is this something you're able to easily reproduce? More logs with
>> higher log
>> > levels
>> > would be helpful... a coredump with radosgw compiled with -g would be
>> > excellent :)
>>
>> Agreed, although if this is an rpm based system it should be sufficient to
>> run the following.
>>
>> # debuginfo-install ceph glibc
>>
>> That may give us the name of the function depending on where we are (if
>> we are
>> in a library it may require the debuginfo for that library be loaded.
>>
>> Karol is right that a coredump would be a good idea in this case and will
>> give
>> us maximum information about the issue you are seeing.
>>
>> Cheers,
>> Brad
>>
>> >
>> > --
>> > Regards,
>> > Karol
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > [email protected]
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw crash - Infernalis

Reply via email to