Re: [ceph-users] Failed assert when starting new OSDs in 0.60

Mr. NPP Tue, 30 Apr 2013 23:55:02 -0700

I'm getting the same issue with one of my OSD's.

Calculating dependencies... done!
[ebuild   R   ~] app-arch/snappy-1.1.0  USE="-static-libs" 0 kB
[ebuild   R   ~] dev-libs/leveldb-1.9.0-r5  USE="snappy -static-libs" 0 kB
[ebuild   R   ~] sys-cluster/ceph-0.60-r1  USE="-debug -fuse -gtk
-libatomic -radosgw -static-libs -tcmalloc" 0 kB


below is my log
https://docs.google.com/file/d/0BwQnRodV8Actd2NQT25FSnA2cjg/edit?usp=sharing

thanks
mr.npp


On Tue, Apr 30, 2013 at 9:17 AM, Travis Rhoden <[email protected]> wrote:

> On the OSD node:
>
> root@cepha0:~# lsb_release -a
> No LSB modules are available.
> Distributor ID:    Ubuntu
> Description:    Ubuntu 12.10
> Release:    12.10
> Codename:    quantal
> root@cepha0:~# dpkg -l "*leveldb*"
> Desired=Unknown/Install/Remove/Purge/Hold
> |
> Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> ||/ Name                                   Version
> Architecture             Description
>
> +++-======================================-========================-========================-==================================================================================
> ii  libleveldb1:armhf                      0+20120530.gitdd0d562-2
> armhf                    fast key-value storage library
> root@cepha0:~# uname -a
> Linux cepha0 3.5.0-27-highbank #46-Ubuntu SMP Mon Mar 25 23:19:40 UTC 2013
> armv7l armv7l armv7l GNU/Linux
>
>
> On the MON node:
> # lsb_release -a
> No LSB modules are available.
> Distributor ID:    Ubuntu
> Description:    Ubuntu 12.10
> Release:    12.10
> Codename:    quantal
> # uname -a
> Linux  3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64
> x86_64 x86_64 GNU/Linux
> # dpkg -l "*leveldb*"
> Desired=Unknown/Install/Remove/Purge/Hold
> |
> Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> ||/ Name                                   Version
> Architecture             Description
>
> +++-======================================-========================-========================-==================================================================================
> un  leveldb-doc
> <none>                                            (no description available)
> ii  libleveldb-dev:amd64                   0+20120530.gitdd0d562-2
> amd64                    fast key-value storage library (development files)
> ii  libleveldb1:amd64                      0+20120530.gitdd0d562-2
> amd64                    fast key-value storage library
>
>
> On Tue, Apr 30, 2013 at 12:11 PM, Samuel Just <[email protected]>wrote:
>
>> What version of leveldb is installed?  Ubuntu/version?
>> -Sam
>>
>> On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden <[email protected]> wrote:
>> > Interestingly, the down OSD does not get marked out after 5 minutes.
>> > Probably that is already fixed by http://tracker.ceph.com/issues/4822.
>> >
>> >
>> > On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden <[email protected]>
>> wrote:
>> >>
>> >> Hi Sam,
>> >>
>> >> I was prepared to write in and say that the problem had gone away.  I
>> >> tried restarting several OSDs last night in the hopes of capturing the
>> >> problem on and OSD that hadn't failed yet, but didn't have any luck.
>>  So I
>> >> did indeed re-create the cluster from scratch (using mkcephfs), and
>> what do
>> >> you know -- everything worked.  I got everything in a nice stable
>> state,
>> >> then decided to do a full cluster restart, just to be sure.  Sure
>> enough,
>> >> one OSD failed to come up, and has the same stack trace.  So I believe
>> I
>> >> have the log you want -- just from the OSD that failed, right?
>> >>
>> >> Question -- any feeling for what parts of the log you need?  It's 688MB
>> >> uncompressed (two hours!), so I'd like to be able to trim some off for
>> you
>> >> before making it available.  Do you only need/want the part from after
>> the
>> >> OSD was restarted?  Or perhaps the corruption happens on OSD shutdown
>> and
>> >> you need some before that?  If you are fine with that large of a file,
>> I can
>> >> just make that available too.  Let me know.
>> >>
>> >>  - Travis
>> >>
>> >>
>> >> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden <[email protected]>
>> wrote:
>> >>>
>> >>> Hi Sam,
>> >>>
>> >>> No problem, I'll leave that debugging turned up high, and do a
>> mkcephfs
>> >>> from scratch and see what happens.  Not sure if it will happen again
>> or not.
>> >>> =)
>> >>>
>> >>> Thanks again.
>> >>>
>> >>>  - Travis
>> >>>
>> >>>
>> >>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>> Hmm, I need logging from when the corruption happened.  If this is
>> >>>> reproducible, can you enable that logging on a clean osd (or better,
>> a
>> >>>> clean cluster) until the assert occurs?
>> >>>> -Sam
>> >>>>
>> >>>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <[email protected]>
>> >>>> wrote:
>> >>>> > Also, I can note that it does not take a full cluster restart to
>> >>>> > trigger
>> >>>> > this.  If I just restart an OSD that was up/in previously, the same
>> >>>> > error
>> >>>> > can happen (though not every time).  So restarting OSD's for me is
>> a
>> >>>> > bit
>> >>>> > like Russian roullette.  =)  Even though restarting an OSD may not
>> >>>> > also
>> >>>> > result in the error, it seems that once it happens that OSD is gone
>> >>>> > for
>> >>>> > good.  No amount of restart has brought any of the dead ones back.
>> >>>> >
>> >>>> > I'd really like to get to the bottom of it.  Let me know if I can
>> do
>> >>>> > anything to help.
>> >>>> >
>> >>>> > I may also have to try completely wiping/rebuilding to see if I can
>> >>>> > make
>> >>>> > this thing usable.
>> >>>> >
>> >>>> >
>> >>>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <[email protected]>
>> >>>> > wrote:
>> >>>> >>
>> >>>> >> Hi Sam,
>> >>>> >>
>> >>>> >> Thanks for being willing to take a look.
>> >>>> >>
>> >>>> >> I applied the debug settings on one host that 3 out of 3 OSDs with
>> >>>> >> this
>> >>>> >> problem.  Then tried to start them up.  Here are the resulting
>> logs:
>> >>>> >>
>> >>>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz
>> >>>> >>
>> >>>> >>  - Travis
>> >>>> >>
>> >>>> >>
>> >>>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <
>> [email protected]>
>> >>>> >> wrote:
>> >>>> >>>
>> >>>> >>> You appear to be missing pg metadata for some reason.  If you can
>> >>>> >>> reproduce it with
>> >>>> >>> debug osd = 20
>> >>>> >>> debug filestore = 20
>> >>>> >>> debug ms = 1
>> >>>> >>> on all of the OSDs, I should be able to track it down.
>> >>>> >>>
>> >>>> >>> I created a bug: #4855.
>> >>>> >>>
>> >>>> >>> Thanks!
>> >>>> >>> -Sam
>> >>>> >>>
>> >>>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <
>> [email protected]>
>> >>>> >>> wrote:
>> >>>> >>> > Thanks Greg.
>> >>>> >>> >
>> >>>> >>> > I quit playing with it because every time I restarted the
>> cluster
>> >>>> >>> > (service
>> >>>> >>> > ceph -a restart), I lost more OSDs..  First time it was 1, 2nd
>> 10,
>> >>>> >>> > 3rd
>> >>>> >>> > time
>> >>>> >>> > 13...  All 13 down OSDs all show the same stacktrace.
>> >>>> >>> >
>> >>>> >>> >  - Travis
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum
>> >>>> >>> > <[email protected]>
>> >>>> >>> > wrote:
>> >>>> >>> >>
>> >>>> >>> >> This sounds vaguely familiar to me, and I see
>> >>>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as
>> "Can't
>> >>>> >>> >> reproduce" — I think maybe this is fixed in "next" and
>> "master",
>> >>>> >>> >> but
>> >>>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or
>> >>>> >>> >> Sam.
>> >>>> >>> >> -Greg
>> >>>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> >>>> >>> >>
>> >>>> >>> >>
>> >>>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden
>> >>>> >>> >> <[email protected]>
>> >>>> >>> >> wrote:
>> >>>> >>> >> > Hey folks,
>> >>>> >>> >> >
>> >>>> >>> >> > I'm helping put together a new test/experimental cluster,
>> and
>> >>>> >>> >> > hit
>> >>>> >>> >> > this
>> >>>> >>> >> > today
>> >>>> >>> >> > when bringing the cluster up for the first time (using
>> >>>> >>> >> > mkcephfs).
>> >>>> >>> >> >
>> >>>> >>> >> > After doing the normal "service ceph -a start", I noticed
>> one
>> >>>> >>> >> > OSD
>> >>>> >>> >> > was
>> >>>> >>> >> > down,
>> >>>> >>> >> > and a lot of PGs were stuck creating.  I tried restarting
>> the
>> >>>> >>> >> > down
>> >>>> >>> >> > OSD,
>> >>>> >>> >> > but
>> >>>> >>> >> > it would come up.  It always had this error:
>> >>>> >>> >> >
>> >>>> >>> >> >     -1> 2013-04-27 18:11:56.179804 b6fcd000  2 osd.1 0 boot
>> >>>> >>> >> >      0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In
>> >>>> >>> >> > function
>> >>>> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t,
>> >>>> >>> >> > hobject_t&,
>> >>>> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27
>> >>>> >>> >> > 18:11:56.399089
>> >>>> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1)
>> >>>> >>> >> >
>> >>>> >>> >> >  ceph version 0.60-401-g17a3859
>> >>>> >>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04)
>> >>>> >>> >> >  1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
>> >>>> >>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a]
>> >>>> >>> >> >  2: (OSD::load_pgs()+0x357) [0x28cba0]
>> >>>> >>> >> >  3: (OSD::init()+0x741) [0x290a16]
>> >>>> >>> >> >  4: (main()+0x1427) [0x2155c0]
>> >>>> >>> >> >  5: (__libc_start_main()+0x99) [0xb69bcf42]
>> >>>> >>> >> >  NOTE: a copy of the executable, or `objdump -rdS
>> <executable>`
>> >>>> >>> >> > is
>> >>>> >>> >> > needed to
>> >>>> >>> >> > interpret this.
>> >>>> >>> >> >
>> >>>> >>> >> >
>> >>>> >>> >> > I then did a full cluster restart, and now I have ten OSDs
>> down
>> >>>> >>> >> > --
>> >>>> >>> >> > each
>> >>>> >>> >> > showing the same exception/failed assert.
>> >>>> >>> >> >
>> >>>> >>> >> > Anybody seen this?
>> >>>> >>> >> >
>> >>>> >>> >> > I know I'm running a weird version -- it's compiled from
>> >>>> >>> >> > source, and
>> >>>> >>> >> > was
>> >>>> >>> >> > provided to me.  The OSDs are all on ARM, and the mon is
>> >>>> >>> >> > x86_64.
>> >>>> >>> >> > Just
>> >>>> >>> >> > looking to see if anyone has seen this particular stack
>> trace
>> >>>> >>> >> > of
>> >>>> >>> >> > load_pgs()/peek_map_epoch() before....
>> >>>> >>> >> >
>> >>>> >>> >> >  - Travis
>> >>>> >>> >> >
>> >>>> >>> >> > _______________________________________________
>> >>>> >>> >> > ceph-users mailing list
>> >>>> >>> >> > [email protected]
>> >>>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>> >>> >> >
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>
>> >>>> >>
>> >>>> >
>> >>>
>> >>>
>> >>
>> >
>>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Failed assert when starting new OSDs in 0.60

Reply via email to