I'm getting the same issue with one of my OSD's. Calculating dependencies... done! [ebuild R ~] app-arch/snappy-1.1.0 USE="-static-libs" 0 kB [ebuild R ~] dev-libs/leveldb-1.9.0-r5 USE="snappy -static-libs" 0 kB [ebuild R ~] sys-cluster/ceph-0.60-r1 USE="-debug -fuse -gtk -libatomic -radosgw -static-libs -tcmalloc" 0 kB
below is my log https://docs.google.com/file/d/0BwQnRodV8Actd2NQT25FSnA2cjg/edit?usp=sharing thanks mr.npp On Tue, Apr 30, 2013 at 9:17 AM, Travis Rhoden <[email protected]> wrote: > On the OSD node: > > root@cepha0:~# lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 12.10 > Release: 12.10 > Codename: quantal > root@cepha0:~# dpkg -l "*leveldb*" > Desired=Unknown/Install/Remove/Purge/Hold > | > Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend > |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) > ||/ Name Version > Architecture Description > > +++-======================================-========================-========================-================================================================================== > ii libleveldb1:armhf 0+20120530.gitdd0d562-2 > armhf fast key-value storage library > root@cepha0:~# uname -a > Linux cepha0 3.5.0-27-highbank #46-Ubuntu SMP Mon Mar 25 23:19:40 UTC 2013 > armv7l armv7l armv7l GNU/Linux > > > On the MON node: > # lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 12.10 > Release: 12.10 > Codename: quantal > # uname -a > Linux 3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64 > x86_64 x86_64 GNU/Linux > # dpkg -l "*leveldb*" > Desired=Unknown/Install/Remove/Purge/Hold > | > Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend > |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) > ||/ Name Version > Architecture Description > > +++-======================================-========================-========================-================================================================================== > un leveldb-doc > <none> (no description available) > ii libleveldb-dev:amd64 0+20120530.gitdd0d562-2 > amd64 fast key-value storage library (development files) > ii libleveldb1:amd64 0+20120530.gitdd0d562-2 > amd64 fast key-value storage library > > > On Tue, Apr 30, 2013 at 12:11 PM, Samuel Just <[email protected]>wrote: > >> What version of leveldb is installed? Ubuntu/version? >> -Sam >> >> On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden <[email protected]> wrote: >> > Interestingly, the down OSD does not get marked out after 5 minutes. >> > Probably that is already fixed by http://tracker.ceph.com/issues/4822. >> > >> > >> > On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden <[email protected]> >> wrote: >> >> >> >> Hi Sam, >> >> >> >> I was prepared to write in and say that the problem had gone away. I >> >> tried restarting several OSDs last night in the hopes of capturing the >> >> problem on and OSD that hadn't failed yet, but didn't have any luck. >> So I >> >> did indeed re-create the cluster from scratch (using mkcephfs), and >> what do >> >> you know -- everything worked. I got everything in a nice stable >> state, >> >> then decided to do a full cluster restart, just to be sure. Sure >> enough, >> >> one OSD failed to come up, and has the same stack trace. So I believe >> I >> >> have the log you want -- just from the OSD that failed, right? >> >> >> >> Question -- any feeling for what parts of the log you need? It's 688MB >> >> uncompressed (two hours!), so I'd like to be able to trim some off for >> you >> >> before making it available. Do you only need/want the part from after >> the >> >> OSD was restarted? Or perhaps the corruption happens on OSD shutdown >> and >> >> you need some before that? If you are fine with that large of a file, >> I can >> >> just make that available too. Let me know. >> >> >> >> - Travis >> >> >> >> >> >> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden <[email protected]> >> wrote: >> >>> >> >>> Hi Sam, >> >>> >> >>> No problem, I'll leave that debugging turned up high, and do a >> mkcephfs >> >>> from scratch and see what happens. Not sure if it will happen again >> or not. >> >>> =) >> >>> >> >>> Thanks again. >> >>> >> >>> - Travis >> >>> >> >>> >> >>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <[email protected]> >> >>> wrote: >> >>>> >> >>>> Hmm, I need logging from when the corruption happened. If this is >> >>>> reproducible, can you enable that logging on a clean osd (or better, >> a >> >>>> clean cluster) until the assert occurs? >> >>>> -Sam >> >>>> >> >>>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <[email protected]> >> >>>> wrote: >> >>>> > Also, I can note that it does not take a full cluster restart to >> >>>> > trigger >> >>>> > this. If I just restart an OSD that was up/in previously, the same >> >>>> > error >> >>>> > can happen (though not every time). So restarting OSD's for me is >> a >> >>>> > bit >> >>>> > like Russian roullette. =) Even though restarting an OSD may not >> >>>> > also >> >>>> > result in the error, it seems that once it happens that OSD is gone >> >>>> > for >> >>>> > good. No amount of restart has brought any of the dead ones back. >> >>>> > >> >>>> > I'd really like to get to the bottom of it. Let me know if I can >> do >> >>>> > anything to help. >> >>>> > >> >>>> > I may also have to try completely wiping/rebuilding to see if I can >> >>>> > make >> >>>> > this thing usable. >> >>>> > >> >>>> > >> >>>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <[email protected]> >> >>>> > wrote: >> >>>> >> >> >>>> >> Hi Sam, >> >>>> >> >> >>>> >> Thanks for being willing to take a look. >> >>>> >> >> >>>> >> I applied the debug settings on one host that 3 out of 3 OSDs with >> >>>> >> this >> >>>> >> problem. Then tried to start them up. Here are the resulting >> logs: >> >>>> >> >> >>>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz >> >>>> >> >> >>>> >> - Travis >> >>>> >> >> >>>> >> >> >>>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just < >> [email protected]> >> >>>> >> wrote: >> >>>> >>> >> >>>> >>> You appear to be missing pg metadata for some reason. If you can >> >>>> >>> reproduce it with >> >>>> >>> debug osd = 20 >> >>>> >>> debug filestore = 20 >> >>>> >>> debug ms = 1 >> >>>> >>> on all of the OSDs, I should be able to track it down. >> >>>> >>> >> >>>> >>> I created a bug: #4855. >> >>>> >>> >> >>>> >>> Thanks! >> >>>> >>> -Sam >> >>>> >>> >> >>>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden < >> [email protected]> >> >>>> >>> wrote: >> >>>> >>> > Thanks Greg. >> >>>> >>> > >> >>>> >>> > I quit playing with it because every time I restarted the >> cluster >> >>>> >>> > (service >> >>>> >>> > ceph -a restart), I lost more OSDs.. First time it was 1, 2nd >> 10, >> >>>> >>> > 3rd >> >>>> >>> > time >> >>>> >>> > 13... All 13 down OSDs all show the same stacktrace. >> >>>> >>> > >> >>>> >>> > - Travis >> >>>> >>> > >> >>>> >>> > >> >>>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum >> >>>> >>> > <[email protected]> >> >>>> >>> > wrote: >> >>>> >>> >> >> >>>> >>> >> This sounds vaguely familiar to me, and I see >> >>>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as >> "Can't >> >>>> >>> >> reproduce" — I think maybe this is fixed in "next" and >> "master", >> >>>> >>> >> but >> >>>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or >> >>>> >>> >> Sam. >> >>>> >>> >> -Greg >> >>>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> >>>> >>> >> >> >>>> >>> >> >> >>>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden >> >>>> >>> >> <[email protected]> >> >>>> >>> >> wrote: >> >>>> >>> >> > Hey folks, >> >>>> >>> >> > >> >>>> >>> >> > I'm helping put together a new test/experimental cluster, >> and >> >>>> >>> >> > hit >> >>>> >>> >> > this >> >>>> >>> >> > today >> >>>> >>> >> > when bringing the cluster up for the first time (using >> >>>> >>> >> > mkcephfs). >> >>>> >>> >> > >> >>>> >>> >> > After doing the normal "service ceph -a start", I noticed >> one >> >>>> >>> >> > OSD >> >>>> >>> >> > was >> >>>> >>> >> > down, >> >>>> >>> >> > and a lot of PGs were stuck creating. I tried restarting >> the >> >>>> >>> >> > down >> >>>> >>> >> > OSD, >> >>>> >>> >> > but >> >>>> >>> >> > it would come up. It always had this error: >> >>>> >>> >> > >> >>>> >>> >> > -1> 2013-04-27 18:11:56.179804 b6fcd000 2 osd.1 0 boot >> >>>> >>> >> > 0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In >> >>>> >>> >> > function >> >>>> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, >> >>>> >>> >> > hobject_t&, >> >>>> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27 >> >>>> >>> >> > 18:11:56.399089 >> >>>> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1) >> >>>> >>> >> > >> >>>> >>> >> > ceph version 0.60-401-g17a3859 >> >>>> >>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04) >> >>>> >>> >> > 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, >> >>>> >>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a] >> >>>> >>> >> > 2: (OSD::load_pgs()+0x357) [0x28cba0] >> >>>> >>> >> > 3: (OSD::init()+0x741) [0x290a16] >> >>>> >>> >> > 4: (main()+0x1427) [0x2155c0] >> >>>> >>> >> > 5: (__libc_start_main()+0x99) [0xb69bcf42] >> >>>> >>> >> > NOTE: a copy of the executable, or `objdump -rdS >> <executable>` >> >>>> >>> >> > is >> >>>> >>> >> > needed to >> >>>> >>> >> > interpret this. >> >>>> >>> >> > >> >>>> >>> >> > >> >>>> >>> >> > I then did a full cluster restart, and now I have ten OSDs >> down >> >>>> >>> >> > -- >> >>>> >>> >> > each >> >>>> >>> >> > showing the same exception/failed assert. >> >>>> >>> >> > >> >>>> >>> >> > Anybody seen this? >> >>>> >>> >> > >> >>>> >>> >> > I know I'm running a weird version -- it's compiled from >> >>>> >>> >> > source, and >> >>>> >>> >> > was >> >>>> >>> >> > provided to me. The OSDs are all on ARM, and the mon is >> >>>> >>> >> > x86_64. >> >>>> >>> >> > Just >> >>>> >>> >> > looking to see if anyone has seen this particular stack >> trace >> >>>> >>> >> > of >> >>>> >>> >> > load_pgs()/peek_map_epoch() before.... >> >>>> >>> >> > >> >>>> >>> >> > - Travis >> >>>> >>> >> > >> >>>> >>> >> > _______________________________________________ >> >>>> >>> >> > ceph-users mailing list >> >>>> >>> >> > [email protected] >> >>>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>>> >>> >> > >> >>>> >>> > >> >>>> >>> > >> >>>> >> >> >>>> >> >> >>>> > >> >>> >> >>> >> >> >> > >> > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
