Interestingly, the down OSD does not get marked out after 5 minutes. Probably that is already fixed by http://tracker.ceph.com/issues/4822.
On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden <[email protected]> wrote: > Hi Sam, > > I was prepared to write in and say that the problem had gone away. I > tried restarting several OSDs last night in the hopes of capturing the > problem on and OSD that hadn't failed yet, but didn't have any luck. So I > did indeed re-create the cluster from scratch (using mkcephfs), and what do > you know -- everything worked. I got everything in a nice stable state, > then decided to do a full cluster restart, just to be sure. Sure enough, > one OSD failed to come up, and has the same stack trace. So I believe I > have the log you want -- just from the OSD that failed, right? > > Question -- any feeling for what parts of the log you need? It's 688MB > uncompressed (two hours!), so I'd like to be able to trim some off for you > before making it available. Do you only need/want the part from after the > OSD was restarted? Or perhaps the corruption happens on OSD shutdown and > you need some before that? If you are fine with that large of a file, I > can just make that available too. Let me know. > > - Travis > > > On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden <[email protected]> wrote: > >> Hi Sam, >> >> No problem, I'll leave that debugging turned up high, and do a mkcephfs >> from scratch and see what happens. Not sure if it will happen again or >> not. =) >> >> Thanks again. >> >> - Travis >> >> >> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <[email protected]>wrote: >> >>> Hmm, I need logging from when the corruption happened. If this is >>> reproducible, can you enable that logging on a clean osd (or better, a >>> clean cluster) until the assert occurs? >>> -Sam >>> >>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <[email protected]> >>> wrote: >>> > Also, I can note that it does not take a full cluster restart to >>> trigger >>> > this. If I just restart an OSD that was up/in previously, the same >>> error >>> > can happen (though not every time). So restarting OSD's for me is a >>> bit >>> > like Russian roullette. =) Even though restarting an OSD may not also >>> > result in the error, it seems that once it happens that OSD is gone for >>> > good. No amount of restart has brought any of the dead ones back. >>> > >>> > I'd really like to get to the bottom of it. Let me know if I can do >>> > anything to help. >>> > >>> > I may also have to try completely wiping/rebuilding to see if I can >>> make >>> > this thing usable. >>> > >>> > >>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <[email protected]> >>> wrote: >>> >> >>> >> Hi Sam, >>> >> >>> >> Thanks for being willing to take a look. >>> >> >>> >> I applied the debug settings on one host that 3 out of 3 OSDs with >>> this >>> >> problem. Then tried to start them up. Here are the resulting logs: >>> >> >>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz >>> >> >>> >> - Travis >>> >> >>> >> >>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <[email protected]> >>> wrote: >>> >>> >>> >>> You appear to be missing pg metadata for some reason. If you can >>> >>> reproduce it with >>> >>> debug osd = 20 >>> >>> debug filestore = 20 >>> >>> debug ms = 1 >>> >>> on all of the OSDs, I should be able to track it down. >>> >>> >>> >>> I created a bug: #4855. >>> >>> >>> >>> Thanks! >>> >>> -Sam >>> >>> >>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <[email protected]> >>> wrote: >>> >>> > Thanks Greg. >>> >>> > >>> >>> > I quit playing with it because every time I restarted the cluster >>> >>> > (service >>> >>> > ceph -a restart), I lost more OSDs.. First time it was 1, 2nd 10, >>> 3rd >>> >>> > time >>> >>> > 13... All 13 down OSDs all show the same stacktrace. >>> >>> > >>> >>> > - Travis >>> >>> > >>> >>> > >>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum <[email protected] >>> > >>> >>> > wrote: >>> >>> >> >>> >>> >> This sounds vaguely familiar to me, and I see >>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't >>> >>> >> reproduce" — I think maybe this is fixed in "next" and "master", >>> but >>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or Sam. >>> >>> >> -Greg >>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> >> >>> >>> >> >>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden <[email protected] >>> > >>> >>> >> wrote: >>> >>> >> > Hey folks, >>> >>> >> > >>> >>> >> > I'm helping put together a new test/experimental cluster, and >>> hit >>> >>> >> > this >>> >>> >> > today >>> >>> >> > when bringing the cluster up for the first time (using >>> mkcephfs). >>> >>> >> > >>> >>> >> > After doing the normal "service ceph -a start", I noticed one >>> OSD >>> >>> >> > was >>> >>> >> > down, >>> >>> >> > and a lot of PGs were stuck creating. I tried restarting the >>> down >>> >>> >> > OSD, >>> >>> >> > but >>> >>> >> > it would come up. It always had this error: >>> >>> >> > >>> >>> >> > -1> 2013-04-27 18:11:56.179804 b6fcd000 2 osd.1 0 boot >>> >>> >> > 0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In >>> >>> >> > function >>> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, >>> hobject_t&, >>> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27 >>> 18:11:56.399089 >>> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1) >>> >>> >> > >>> >>> >> > ceph version 0.60-401-g17a3859 >>> >>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04) >>> >>> >> > 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, >>> >>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a] >>> >>> >> > 2: (OSD::load_pgs()+0x357) [0x28cba0] >>> >>> >> > 3: (OSD::init()+0x741) [0x290a16] >>> >>> >> > 4: (main()+0x1427) [0x2155c0] >>> >>> >> > 5: (__libc_start_main()+0x99) [0xb69bcf42] >>> >>> >> > NOTE: a copy of the executable, or `objdump -rdS <executable>` >>> is >>> >>> >> > needed to >>> >>> >> > interpret this. >>> >>> >> > >>> >>> >> > >>> >>> >> > I then did a full cluster restart, and now I have ten OSDs down >>> -- >>> >>> >> > each >>> >>> >> > showing the same exception/failed assert. >>> >>> >> > >>> >>> >> > Anybody seen this? >>> >>> >> > >>> >>> >> > I know I'm running a weird version -- it's compiled from >>> source, and >>> >>> >> > was >>> >>> >> > provided to me. The OSDs are all on ARM, and the mon is x86_64. >>> >>> >> > Just >>> >>> >> > looking to see if anyone has seen this particular stack trace of >>> >>> >> > load_pgs()/peek_map_epoch() before.... >>> >>> >> > >>> >>> >> > - Travis >>> >>> >> > >>> >>> >> > _______________________________________________ >>> >>> >> > ceph-users mailing list >>> >>> >> > [email protected] >>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> > >>> >>> > >>> >>> > >>> >> >>> >> >>> > >>> >> >> >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
