Hi Sam, No problem, I'll leave that debugging turned up high, and do a mkcephfs from scratch and see what happens. Not sure if it will happen again or not. =)
Thanks again. - Travis On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <[email protected]> wrote: > Hmm, I need logging from when the corruption happened. If this is > reproducible, can you enable that logging on a clean osd (or better, a > clean cluster) until the assert occurs? > -Sam > > On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <[email protected]> wrote: > > Also, I can note that it does not take a full cluster restart to trigger > > this. If I just restart an OSD that was up/in previously, the same error > > can happen (though not every time). So restarting OSD's for me is a bit > > like Russian roullette. =) Even though restarting an OSD may not also > > result in the error, it seems that once it happens that OSD is gone for > > good. No amount of restart has brought any of the dead ones back. > > > > I'd really like to get to the bottom of it. Let me know if I can do > > anything to help. > > > > I may also have to try completely wiping/rebuilding to see if I can make > > this thing usable. > > > > > > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <[email protected]> > wrote: > >> > >> Hi Sam, > >> > >> Thanks for being willing to take a look. > >> > >> I applied the debug settings on one host that 3 out of 3 OSDs with this > >> problem. Then tried to start them up. Here are the resulting logs: > >> > >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz > >> > >> - Travis > >> > >> > >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <[email protected]> > wrote: > >>> > >>> You appear to be missing pg metadata for some reason. If you can > >>> reproduce it with > >>> debug osd = 20 > >>> debug filestore = 20 > >>> debug ms = 1 > >>> on all of the OSDs, I should be able to track it down. > >>> > >>> I created a bug: #4855. > >>> > >>> Thanks! > >>> -Sam > >>> > >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <[email protected]> > wrote: > >>> > Thanks Greg. > >>> > > >>> > I quit playing with it because every time I restarted the cluster > >>> > (service > >>> > ceph -a restart), I lost more OSDs.. First time it was 1, 2nd 10, > 3rd > >>> > time > >>> > 13... All 13 down OSDs all show the same stacktrace. > >>> > > >>> > - Travis > >>> > > >>> > > >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum <[email protected]> > >>> > wrote: > >>> >> > >>> >> This sounds vaguely familiar to me, and I see > >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't > >>> >> reproduce" — I think maybe this is fixed in "next" and "master", but > >>> >> I'm not sure. For more than that I'd have to defer to Sage or Sam. > >>> >> -Greg > >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com > >>> >> > >>> >> > >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden <[email protected]> > >>> >> wrote: > >>> >> > Hey folks, > >>> >> > > >>> >> > I'm helping put together a new test/experimental cluster, and hit > >>> >> > this > >>> >> > today > >>> >> > when bringing the cluster up for the first time (using mkcephfs). > >>> >> > > >>> >> > After doing the normal "service ceph -a start", I noticed one OSD > >>> >> > was > >>> >> > down, > >>> >> > and a lot of PGs were stuck creating. I tried restarting the down > >>> >> > OSD, > >>> >> > but > >>> >> > it would come up. It always had this error: > >>> >> > > >>> >> > -1> 2013-04-27 18:11:56.179804 b6fcd000 2 osd.1 0 boot > >>> >> > 0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In > >>> >> > function > >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, > hobject_t&, > >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27 > 18:11:56.399089 > >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1) > >>> >> > > >>> >> > ceph version 0.60-401-g17a3859 > >>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04) > >>> >> > 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, > >>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a] > >>> >> > 2: (OSD::load_pgs()+0x357) [0x28cba0] > >>> >> > 3: (OSD::init()+0x741) [0x290a16] > >>> >> > 4: (main()+0x1427) [0x2155c0] > >>> >> > 5: (__libc_start_main()+0x99) [0xb69bcf42] > >>> >> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > >>> >> > needed to > >>> >> > interpret this. > >>> >> > > >>> >> > > >>> >> > I then did a full cluster restart, and now I have ten OSDs down -- > >>> >> > each > >>> >> > showing the same exception/failed assert. > >>> >> > > >>> >> > Anybody seen this? > >>> >> > > >>> >> > I know I'm running a weird version -- it's compiled from source, > and > >>> >> > was > >>> >> > provided to me. The OSDs are all on ARM, and the mon is x86_64. > >>> >> > Just > >>> >> > looking to see if anyone has seen this particular stack trace of > >>> >> > load_pgs()/peek_map_epoch() before.... > >>> >> > > >>> >> > - Travis > >>> >> > > >>> >> > _______________________________________________ > >>> >> > ceph-users mailing list > >>> >> > [email protected] > >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> >> > > >>> > > >>> > > >> > >> > > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
