Interestingly, the down OSD does not get marked out after 5 minutes.
Probably that is already fixed by http://tracker.ceph.com/issues/4822.


On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden <[email protected]> wrote:

> Hi Sam,
>
> I was prepared to write in and say that the problem had gone away.  I
> tried restarting several OSDs last night in the hopes of capturing the
> problem on and OSD that hadn't failed yet, but didn't have any luck.  So I
> did indeed re-create the cluster from scratch (using mkcephfs), and what do
> you know -- everything worked.  I got everything in a nice stable state,
> then decided to do a full cluster restart, just to be sure.  Sure enough,
> one OSD failed to come up, and has the same stack trace.  So I believe I
> have the log you want -- just from the OSD that failed, right?
>
> Question -- any feeling for what parts of the log you need?  It's 688MB
> uncompressed (two hours!), so I'd like to be able to trim some off for you
> before making it available.  Do you only need/want the part from after the
> OSD was restarted?  Or perhaps the corruption happens on OSD shutdown and
> you need some before that?  If you are fine with that large of a file, I
> can just make that available too.  Let me know.
>
>  - Travis
>
>
> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden <[email protected]> wrote:
>
>> Hi Sam,
>>
>> No problem, I'll leave that debugging turned up high, and do a mkcephfs
>> from scratch and see what happens.  Not sure if it will happen again or
>> not.  =)
>>
>> Thanks again.
>>
>>  - Travis
>>
>>
>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <[email protected]>wrote:
>>
>>> Hmm, I need logging from when the corruption happened.  If this is
>>> reproducible, can you enable that logging on a clean osd (or better, a
>>> clean cluster) until the assert occurs?
>>> -Sam
>>>
>>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <[email protected]>
>>> wrote:
>>> > Also, I can note that it does not take a full cluster restart to
>>> trigger
>>> > this.  If I just restart an OSD that was up/in previously, the same
>>> error
>>> > can happen (though not every time).  So restarting OSD's for me is a
>>> bit
>>> > like Russian roullette.  =)  Even though restarting an OSD may not also
>>> > result in the error, it seems that once it happens that OSD is gone for
>>> > good.  No amount of restart has brought any of the dead ones back.
>>> >
>>> > I'd really like to get to the bottom of it.  Let me know if I can do
>>> > anything to help.
>>> >
>>> > I may also have to try completely wiping/rebuilding to see if I can
>>> make
>>> > this thing usable.
>>> >
>>> >
>>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <[email protected]>
>>> wrote:
>>> >>
>>> >> Hi Sam,
>>> >>
>>> >> Thanks for being willing to take a look.
>>> >>
>>> >> I applied the debug settings on one host that 3 out of 3 OSDs with
>>> this
>>> >> problem.  Then tried to start them up.  Here are the resulting logs:
>>> >>
>>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz
>>> >>
>>> >>  - Travis
>>> >>
>>> >>
>>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <[email protected]>
>>> wrote:
>>> >>>
>>> >>> You appear to be missing pg metadata for some reason.  If you can
>>> >>> reproduce it with
>>> >>> debug osd = 20
>>> >>> debug filestore = 20
>>> >>> debug ms = 1
>>> >>> on all of the OSDs, I should be able to track it down.
>>> >>>
>>> >>> I created a bug: #4855.
>>> >>>
>>> >>> Thanks!
>>> >>> -Sam
>>> >>>
>>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <[email protected]>
>>> wrote:
>>> >>> > Thanks Greg.
>>> >>> >
>>> >>> > I quit playing with it because every time I restarted the cluster
>>> >>> > (service
>>> >>> > ceph -a restart), I lost more OSDs..  First time it was 1, 2nd 10,
>>> 3rd
>>> >>> > time
>>> >>> > 13...  All 13 down OSDs all show the same stacktrace.
>>> >>> >
>>> >>> >  - Travis
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum <[email protected]
>>> >
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> This sounds vaguely familiar to me, and I see
>>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't
>>> >>> >> reproduce" — I think maybe this is fixed in "next" and "master",
>>> but
>>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or Sam.
>>> >>> >> -Greg
>>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> >>> >>
>>> >>> >>
>>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden <[email protected]
>>> >
>>> >>> >> wrote:
>>> >>> >> > Hey folks,
>>> >>> >> >
>>> >>> >> > I'm helping put together a new test/experimental cluster, and
>>> hit
>>> >>> >> > this
>>> >>> >> > today
>>> >>> >> > when bringing the cluster up for the first time (using
>>> mkcephfs).
>>> >>> >> >
>>> >>> >> > After doing the normal "service ceph -a start", I noticed one
>>> OSD
>>> >>> >> > was
>>> >>> >> > down,
>>> >>> >> > and a lot of PGs were stuck creating.  I tried restarting the
>>> down
>>> >>> >> > OSD,
>>> >>> >> > but
>>> >>> >> > it would come up.  It always had this error:
>>> >>> >> >
>>> >>> >> >     -1> 2013-04-27 18:11:56.179804 b6fcd000  2 osd.1 0 boot
>>> >>> >> >      0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In
>>> >>> >> > function
>>> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t,
>>> hobject_t&,
>>> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27
>>> 18:11:56.399089
>>> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1)
>>> >>> >> >
>>> >>> >> >  ceph version 0.60-401-g17a3859
>>> >>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04)
>>> >>> >> >  1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
>>> >>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a]
>>> >>> >> >  2: (OSD::load_pgs()+0x357) [0x28cba0]
>>> >>> >> >  3: (OSD::init()+0x741) [0x290a16]
>>> >>> >> >  4: (main()+0x1427) [0x2155c0]
>>> >>> >> >  5: (__libc_start_main()+0x99) [0xb69bcf42]
>>> >>> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>`
>>> is
>>> >>> >> > needed to
>>> >>> >> > interpret this.
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > I then did a full cluster restart, and now I have ten OSDs down
>>> --
>>> >>> >> > each
>>> >>> >> > showing the same exception/failed assert.
>>> >>> >> >
>>> >>> >> > Anybody seen this?
>>> >>> >> >
>>> >>> >> > I know I'm running a weird version -- it's compiled from
>>> source, and
>>> >>> >> > was
>>> >>> >> > provided to me.  The OSDs are all on ARM, and the mon is x86_64.
>>> >>> >> > Just
>>> >>> >> > looking to see if anyone has seen this particular stack trace of
>>> >>> >> > load_pgs()/peek_map_epoch() before....
>>> >>> >> >
>>> >>> >> >  - Travis
>>> >>> >> >
>>> >>> >> > _______________________________________________
>>> >>> >> > ceph-users mailing list
>>> >>> >> > [email protected]
>>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>> >> >
>>> >>> >
>>> >>> >
>>> >>
>>> >>
>>> >
>>>
>>
>>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to