Re: osd crash when deep-scrubbing
Jiaying Ren gmail.com> writes: > > Hi, cephers: > > I've encountered a problem that a pg stuck in inconsistent status: > > $ ceph -s > cluster 27d39faa-48ae-4356-a8e3-19d5b81e179e > health HEALTH_ERR 1 pgs inconsistent; 34 near full osd(s); 1 > scrub errors; noout flag(s) set > monmap e4: 3 mons at > {server-61.0..x.in=10.8.0.61:6789/0,server-62.0..x.i n=10.8.0.62:6789/0,server-63.0..x.in=10.8.0.63:6789/0}, > election epoch 6706, quorum 0,1,2 > server-61.0..x.in,server-62.0..x.in,server-63.0. .x.in > osdmap e87808: 180 osds: 180 up, 180 in > flags noout > pgmap v29322850: 35026 pgs, 15 pools, 27768 GB data, 1905 kobjects > 83575 GB used, 114 TB / 196 TB avail >35025 active+clean >1 active+clean+inconsistent > client io 120 kB/s rd, 216 MB/s wr, 6398 op/s > > `pg repair` cmd doesn't work, so I manually repaired a inconsistent object(pool > size is 3,I removed the object different from other two copys).after that pg > still in inconsistent status: > > $ ceph pg dump | grep active+clean+inconsistent > dumped all in format plain > 3.d70 290 0 0 0 4600869888 30503050 > stale+active+clean+inconsistent 2015-10-18 13:05:43.320451 > 87798'7631234 87798:10758311[131,119,132] 131 > [131,119,132] 131 85161'7599152 2015-10-16 14:34:21.283303 > 85161'7599152 2015-10-16 14:34:21.283303 > > And after restarted osd.131, the primary osd osd.131 would crash,the straceback: > > 1: /usr/bin/ceph-osd() [0x9c6de1] > 2: (()+0xf790) [0x7f384b6b8790] > 3: (gsignal()+0x35) [0x7f384a58a625] > 4: (abort()+0x175) [0x7f384a58be05] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f384ae44a5d] > 6: (()+0xbcbe6) [0x7f384ae42be6] > 7: (()+0xbcc13) [0x7f384ae42c13] > 8: (()+0xbcd0e) [0x7f384ae42d0e] > 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x9cd0de] > 10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x81) [0x7dfaf1] > 11: (PG::_scan_snaps(ScrubMap&)+0x394) [0x84b8c4] > 12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, > ThreadPool::TPHandle&)+0x27b) [0x84cdab] > 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x5c4) [0x85c1b4] > 14: (PG::scrub(ThreadPool::TPHandle&)+0x181) [0x85d691] > 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x1c) [0x6737cc] > 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x53d) [0x9e05dd] > 17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760] > 18: (()+0x7a51) [0x7f384b6b0a51] > 19: (clone()+0x6d) [0x7f384a6409ad] > > ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70` would also > cause osd crash. > > Any ideas? or did I missed some logs necessary for further investigation? > > Thx. > > -- > Best Regards! > Jiaying Ren(mikulely) > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > I have met a problem when run 'ceph pg deep-scrub' command. It also causes osd crash. And finally i find some sector of the disk have corrupted .so please check dmesg info to check weather there is some disk errors -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
osd crash when deep-scrubbing
Hi, cephers: I've encountered a problem that a pg stuck in inconsistent status: $ ceph -s cluster 27d39faa-48ae-4356-a8e3-19d5b81e179e health HEALTH_ERR 1 pgs inconsistent; 34 near full osd(s); 1 scrub errors; noout flag(s) set monmap e4: 3 mons at {server-61.0..x.in=10.8.0.61:6789/0,server-62.0..x.in=10.8.0.62:6789/0,server-63.0..x.in=10.8.0.63:6789/0}, election epoch 6706, quorum 0,1,2 server-61.0..x.in,server-62.0..x.in,server-63.0..x.in osdmap e87808: 180 osds: 180 up, 180 in flags noout pgmap v29322850: 35026 pgs, 15 pools, 27768 GB data, 1905 kobjects 83575 GB used, 114 TB / 196 TB avail 35025 active+clean 1 active+clean+inconsistent client io 120 kB/s rd, 216 MB/s wr, 6398 op/s `pg repair` cmd doesn't work, so I manually repaired a inconsistent object(pool size is 3,I removed the object different from other two copys).after that pg still in inconsistent status: $ ceph pg dump | grep active+clean+inconsistent dumped all in format plain 3.d70 290 0 0 0 4600869888 30503050 stale+active+clean+inconsistent 2015-10-18 13:05:43.320451 87798'7631234 87798:10758311[131,119,132] 131 [131,119,132] 131 85161'7599152 2015-10-16 14:34:21.283303 85161'7599152 2015-10-16 14:34:21.283303 And after restarted osd.131, the primary osd osd.131 would crash,the straceback: 1: /usr/bin/ceph-osd() [0x9c6de1] 2: (()+0xf790) [0x7f384b6b8790] 3: (gsignal()+0x35) [0x7f384a58a625] 4: (abort()+0x175) [0x7f384a58be05] 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f384ae44a5d] 6: (()+0xbcbe6) [0x7f384ae42be6] 7: (()+0xbcc13) [0x7f384ae42c13] 8: (()+0xbcd0e) [0x7f384ae42d0e] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x9cd0de] 10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x81) [0x7dfaf1] 11: (PG::_scan_snaps(ScrubMap&)+0x394) [0x84b8c4] 12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x27b) [0x84cdab] 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x5c4) [0x85c1b4] 14: (PG::scrub(ThreadPool::TPHandle&)+0x181) [0x85d691] 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x1c) [0x6737cc] 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x53d) [0x9e05dd] 17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760] 18: (()+0x7a51) [0x7f384b6b0a51] 19: (clone()+0x6d) [0x7f384a6409ad] ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70` would also cause osd crash. Any ideas? or did I missed some logs necessary for further investigation? Thx. -- Best Regards! Jiaying Ren(mikulely) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
Hi Sage, Did you get a chance to look at the crash? Regards Srikanth On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, I saw the crash again here is the output after adding the debug message from wip-newstore-debuglist -31 2015-06-03 20:28:18.864496 7fd95976b700 -1 newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is --.7fff..!!!. Here is the id of the file I posted. ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804 Let me know if you need anything else. Regards Srikanth On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what
Re: osd crash with object store set to newstore
On Fri, 5 Jun 2015, Srikanth Madugundi wrote: Hi Sage, Did you get a chance to look at the crash? Not yet--I am still focusing on getting wip-temp (and other newstore prerequisite code) working before turning back to newstore. I'll look at this once I get back to newstore... hopefully in the next week or so! sage Regards Srikanth On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, I saw the crash again here is the output after adding the debug message from wip-newstore-debuglist -31 2015-06-03 20:28:18.864496 7fd95976b700 -1 newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is --.7fff..!!!. Here is the id of the file I posted. ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804 Let me know if you need anything else. Regards Srikanth On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting
Re: osd crash with object store set to newstore
Hi Sage, I saw the crash again here is the output after adding the debug message from wip-newstore-debuglist -31 2015-06-03 20:28:18.864496 7fd95976b700 -1 newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is --.7fff..!!!. Here is the id of the file I posted. ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804 Let me know if you need anything else. Regards Srikanth On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info
Re: osd crash with object store set to newstore
I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
osd crash with object store set to newstore
Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Regards Srikanth -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD Crash for xattr _ absent issue.
Hi, Samuel Sage In our current production environment, there exists osd crash because of the inconsistence of data, when reading the “_” xattr. Which is described in the issue: http://tracker.ceph.com/issues/10117. And I also find a two year’s old issue, which also describes the same bug: http://tracker.ceph.com/issues/3676. I think there is a apparent flaw in the related code. Could you help to review my last comment describing the way to fix the bug. I prefer the second way, we just delete the object if we can’t get the “_” xattr, instead of crashing the osd, and the object has two other replicas, which can serve the client’s request. And when the next time self-healing process(scrub, deep scrub) occurs, the object can recover from its peer. Because I am not so proficient of the source code, I don’t know if the repairing way has any other side effects on the ceph cluster. If you have any idea about the bug, please feel free to let me know. Thanks Wenjunh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bobtail to dumpling (was: OSD crash during repair)
On Fri, Sep 06, 2013 at 08:21:07AM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: Also, you should upgrade to dumpling. :) I've been considering it. It was initially a little scary with the various issues that were cropping up but that all seems to have quietened down. Of course I'd like my cluster to be clean before attempting an upgrade! Definitely. Let us know how it goes! :) Upgraded, directly from bobtail to dumpling. Well, that was a mite more traumatic than I expected. I had two issues, both my fault... Firstly, I didn't realise I should have restarted the osds one at a time rather than doing 'service ceph restart' on each host quickly in succession. Restarting them all at once meant everything was offline whilst PGs are upgrading. Secondly, whilst I saw the 'osd crush update on start' issue in the release notes, and checked that my crush map hostnames match the actual hostnames, I have two separate pools (for fast SAS vs bulk SATA disks) and I stupidly only noticed the one which matched, but not the other which didn't match. So on restart all the osds moved into the one pool, and started rebalancing. The two issues at the same time produced quite the adrenaline rush! :-) My current crush configuration is below (host b2 is recently added and I haven't added it into the pools yet). Is there a better/recommended way of using the crush map to support separate pools to avoid setting 'osd crush update on start = false'? It doesn't seem that I can use the same 'host' names under the separate 'sas' and 'default' roots? Cheers, Chris -- # ceph osd tree # idweight type name up/down reweight -8 2 root sas -7 2 rack sas-rack-1 -5 1 host b4-sas 4 0.5 osd.4 up 1 5 0.5 osd.5 up 1 -6 1 host b5-sas 2 0.5 osd.2 up 1 3 0.5 osd.3 up 1 -1 12.66 root default -3 8 rack unknownrack -2 4 host b4 0 2 osd.0 up 1 7 2 osd.7 up 1 -4 4 host b5 1 2 osd.1 up 1 6 2 osd.6 up 1 -9 4.66host b2 10 1.82osd.10 up 1 11 1.82osd.11 up 1 8 0.51osd.8 up 1 9 0.51osd.9 up 1 -- # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 root # buckets host b4 { id -2 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.0 weight 2.000 item osd.7 weight 2.000 } host b5 { id -4 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.1 weight 2.000 item osd.6 weight 2.000 } rack unknownrack { id -3 # do not change unnecessarily # weight 8.000 alg straw hash 0 # rjenkins1 item b4 weight 4.000 item b5 weight 4.000 } host b2 { id -9 # do not change unnecessarily # weight 4.660 alg straw hash 0 # rjenkins1 item osd.10 weight 1.820 item osd.11 weight 1.820 item osd.8 weight 0.510 item osd.9 weight 0.510 } root default { id -1 # do not change unnecessarily # weight 12.660 alg straw hash 0 # rjenkins1 item unknownrack weight 8.000 item b2 weight 4.660 } host b4-sas { id -5 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.4 weight 0.500 item osd.5 weight 0.500 } host b5-sas { id -6 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.2 weight 0.500 item osd.3 weight 0.500 } rack sas-rack-1 { id -7 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item b4-sas weight 1.000 item b5-sas weight 1.000 } root sas { id -8 #
Re: Bobtail to dumpling (was: OSD crash during repair)
On Wed, 11 Sep 2013, Chris Dunlop wrote: On Fri, Sep 06, 2013 at 08:21:07AM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: Also, you should upgrade to dumpling. :) I've been considering it. It was initially a little scary with the various issues that were cropping up but that all seems to have quietened down. Of course I'd like my cluster to be clean before attempting an upgrade! Definitely. Let us know how it goes! :) Upgraded, directly from bobtail to dumpling. Well, that was a mite more traumatic than I expected. I had two issues, both my fault... Firstly, I didn't realise I should have restarted the osds one at a time rather than doing 'service ceph restart' on each host quickly in succession. Restarting them all at once meant everything was offline whilst PGs are upgrading. Secondly, whilst I saw the 'osd crush update on start' issue in the release notes, and checked that my crush map hostnames match the actual hostnames, I have two separate pools (for fast SAS vs bulk SATA disks) and I stupidly only noticed the one which matched, but not the other which didn't match. So on restart all the osds moved into the one pool, and started rebalancing. The two issues at the same time produced quite the adrenaline rush! :-) I can imagine! My current crush configuration is below (host b2 is recently added and I haven't added it into the pools yet). Is there a better/recommended way of using the crush map to support separate pools to avoid setting 'osd crush update on start = false'? It doesn't seem that I can use the same 'host' names under the separate 'sas' and 'default' roots? For now we don't have a better solution than setting 'osd crush update on start = false'. Sorry! I'm guessing that it is pretty uncommong for disks to switch hosts, at least. :/ We could come up with a 'standard' way of structuring these sorts of maps with prefixes or suffixes on the bucket names; I'm open to suggestions. However, I'm also wondering if we should take the next step at the same time and embed another dimension in the CRUSH tree so that CRUSH itself understands that it is host=b4 (say) but it is only looking at the sas or ssd items. This would (help) allow rules along the lines of pick 3 hosts; choose the ssd from the first and sas disks from the other two. I'm not convinced that is an especially good idea for most users, but it's probably worth considering. sage Cheers, Chris -- # ceph osd tree # idweight type name up/down reweight -8 2 root sas -7 2 rack sas-rack-1 -5 1 host b4-sas 4 0.5 osd.4 up 1 5 0.5 osd.5 up 1 -6 1 host b5-sas 2 0.5 osd.2 up 1 3 0.5 osd.3 up 1 -1 12.66 root default -3 8 rack unknownrack -2 4 host b4 0 2 osd.0 up 1 7 2 osd.7 up 1 -4 4 host b5 1 2 osd.1 up 1 6 2 osd.6 up 1 -9 4.66host b2 10 1.82osd.10 up 1 11 1.82osd.11 up 1 8 0.51osd.8 up 1 9 0.51osd.9 up 1 -- # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 root # buckets host b4 { id -2 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.0 weight 2.000 item osd.7 weight 2.000 } host b5 { id -4 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item osd.1 weight 2.000 item osd.6 weight 2.000 } rack unknownrack { id -3 # do not change unnecessarily # weight 8.000 alg straw hash 0 # rjenkins1 item b4 weight 4.000 item b5 weight 4.000 } host b2 { id -9 # do not change unnecessarily # weight 4.660 alg straw hash 0 # rjenkins1 item osd.10 weight 1.820 item
Re: OSD crash during repair
On Fri, 6 Sep 2013, Chris Dunlop wrote: On Fri, Sep 06, 2013 at 01:12:21PM +1000, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change: internal_safe_to_start_threads = 'true' 2013-09-06 09:33:28.303658 7f0ae94bd700 0 log [ERR] : 2.12 osd.7: soid 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr snapset 2013-09-06 09:33:28.303685 7f0ae94bd700 0 log [ERR] : repair 2.12 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr 2013-09-06 09:34:45.138468 7f0ae94bd700 0 log [ERR] : 2.12 repair stat mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 bytes. 2013-09-06 09:34:45.142215 7f0ae94bd700 0 log [ERR] : 2.12 repair 0 missing, 1 inconsistent objects 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** I've just attached the full 'debug_osd 0/10' log to the bug report. This suggests to me that the object on osd.6 is missing those xattrs; can you confirm with getfattr -d on the in osd.6's data directory? I haven't yet wrapped my head around how to translate an oid like those above into a underlying file system object. What directory should I be looking at? Found it: b5# cd /var/lib/ceph/osd/ceph-6/current b5# find 2.12* | grep -i 17d9b.2ae8944a.1e11 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 b5# getfattr -d 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 ...crickets... vs. b4# cd /var/lib/ceph/osd/ceph-7/current b4# getfattr -d 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 # file: 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 user.ceph._=0sCgjhBANBACByYi4wLjE3ZDliLjJhZTg5NDRhLjAwMDAwMDAwMWUxMf7/EnqYVgAAAgAEAxACAP8AAEInCgAAuEsAAEEnCgAAuEsAAAICFQgTmwEAAHD1AgAAQAAAyY4dUpjCTSACAhUAAABCJwoAALhL user.ceph.snapset=0sAgIZAAABAA== If that is indeed the case, you should be able to move the object out of the way (don't delete it, just in case) and then do the repair. The osd.6 should recover by copying the object from osd.7 (which has the needed xattrs). Bobtail is smart enough to recover missing objects but not to recover just missing xattrs. Do you want me to hold off on any repairs to allow tracking down the crash, or is the current code sufficiently different that there's little point? Repaired! ...but why does it take multiple rounds? Excellent! It's because the first round repairs the object, but doesn't take its own change into account when verifying/recalculating the PG stats (object count, byte sum). The second pass just fixes up that arithmetic. sage b5# mv 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 .. b5# ceph pg repair 2.12 b5# while ceph -s | grep -q scrubbing; do sleep 60; done b5# tail /var/log/ceph/ceph-osd.6.log 2013-09-06 15:02:13.751160 7f6ccc5ae700 0 log [ERR] : 2.12 osd.6 missing 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 2013-09-06 15:04:15.286711 7f6ccc5ae700 0 log [ERR] : 2.12 repair stat mismatch, got 2723/2724 objects, 339/339 clones, 11311299072/11315493376 bytes. 2013-09-06 15:04:15.286766 7f6ccc5ae700 0 log [ERR] : 2.12 repair 1 missing, 0 inconsistent objects 2013-09-06 15:04:15.286823 7f6ccc5ae700 0 log [ERR] : 2.12 repair 2 errors, 2 fixed 2013-09-06 15:04:20.778377 7f6ccc5ae700 0 log [ERR] : 2.12 scrub stat mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 bytes. 2013-09-06 15:04:20.778383 7f6ccc5ae700 0 log [ERR] : 2.12 scrub 1 errors b5# ceph pg dump | grep inconsistent 2.1227230 0 0 11311299072 159103 159103 active+clean+inconsistent 2013-09-06 15:04:20.778413 20121'690883 20128'7941893 [6,7] [6,7] 20121'6908832013-09-06 15:04:20.778387 20121'6908832013-09-06 15:04:15.286835 b5# ceph pg repair 2.12 b5# while ceph -s | grep -q scrubbing; do sleep 60; done b5# tail /var/log/ceph/ceph-osd.6.log 2013-09-06 15:07:30.461959 7f6ccc5ae700 0 log [ERR] : 2.12 repair stat mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 bytes. 2013-09-06 15:07:30.461991 7f6ccc5ae700 0 log [ERR] : 2.12 repair 1 errors, 1 fixed b5# ceph pg dump | grep inconsistent 2.1227240 0 0 11315493376 159580 159580 active+clean+inconsistent 2013-09-06 15:07:30.462039 20129'690886 20128'7942171 [6,7] [6,7] 20129'690886
Re: OSD crash during repair
On Fri, 6 Sep 2013, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change: internal_safe_to_start_threads = 'true' 2013-09-06 09:33:28.303658 7f0ae94bd700 0 log [ERR] : 2.12 osd.7: soid 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr snapset 2013-09-06 09:33:28.303685 7f0ae94bd700 0 log [ERR] : repair 2.12 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr 2013-09-06 09:34:45.138468 7f0ae94bd700 0 log [ERR] : 2.12 repair stat mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 bytes. 2013-09-06 09:34:45.142215 7f0ae94bd700 0 log [ERR] : 2.12 repair 0 missing, 1 inconsistent objects 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** I've just attached the full 'debug_osd 0/10' log to the bug report. This suggests to me that the object on osd.6 is missing those xattrs; can you confirm with getfattr -d on the in osd.6's data directory? I haven't yet wrapped my head around how to translate an oid like those above into a underlying file system object. What directory should I be looking at? It's the osd.6 data directory (maybe /var/lib/ceph/osd/ceph-6, or whatever you configured), /currrent/$pgid_head/.../*rb.0.17d9b.2ae8944a.1e11*. In your case $pgid is 2.12. Do a find . | grep rb.0.17d9b.2ae8944a.1e11 and you will see it pop up (with head in there along with some other stuff). getfattr -d $file to confirm the user.ceph._ and user.ceph.snapset xattrs are missing. I would also confirm that they are present on the same file in osd.7's data directory. Maybe do a sanity check to make sure the objects otherwise look like the match (file size, md5sum, etc.). Assuming osd.7 doesn't look obviously wrong (e.g., 0 bytes or something), rename the bad osd.6 copy out of the way and let repair recover it for you. Note that you might have to do repair twice to make the pg stats number reflect the just-repaired object. If that is indeed the case, you should be able to move the object out of the way (don't delete it, just in case) and then do the repair. The osd.6 should recover by copying the object from osd.7 (which has the needed xattrs). Bobtail is smart enough to recover missing objects but not to recover just missing xattrs. Do you want me to hold off on any repairs to allow tracking down the crash, or is the current code sufficiently different that there's little point? There is little point with bobtail. Also, you should upgrade to dumpling. :) I've been considering it. It was initially a little scary with the various issues that were cropping up but that all seems to have quietened down. Of course I'd like my cluster to be clean before attempting an upgrade! Definitely. Let us know how it goes! :) sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD crash during repair
G'day, I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: http://tracker.ceph.com/issues/6233 ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) 1: /usr/bin/ceph-osd() [0x8530a2] 2: (()+0xf030) [0x7f541ca39030] 3: (gsignal()+0x35) [0x7f541b132475] 4: (abort()+0x180) [0x7f541b1356f0] 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d] 6: (()+0x63996) [0x7f541b985996] 7: (()+0x639c3) [0x7f541b9859c3] 8: (()+0x63bee) [0x7f541b985bee] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7] 10: (object_info_t::decode(ceph::buffer::list::iterator)+0x29) [0x95b579] 11: (object_info_t::object_info_t(ceph::buffer::list)+0x180) [0x695ec0] 12: (PG::repair_object(hobject_t const, ScrubMap::object*, int, int)+0xc7) [0x7646b7] 13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d] 14: (PG::scrub_finish()+0x4f) [0x76587f] 15: (PG::chunky_scrub(ThreadPool::TPHandle)+0x10d6) [0x76cb96] 16: (PG::scrub(ThreadPool::TPHandle)+0x138) [0x76d7e8] 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0xf) [0x70515f] 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542] 19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0] 20: (()+0x6b50) [0x7f541ca30b50] 21: (clone()+0x6d) [0x7f541b1daa7d] NOTE: a copy of the executable, or `objdump -rdS lt;executablegt;` is needed to interpret this. This occurs as a result of: # ceph pg dump | grep inconsistent 2.1227230 0 0 11311299072 159189 159189 active+clean+inconsistent 2013-09-06 09:35:47.512119 20117'690441 20120'7914185 [6,7] [6,7] 20021'6759672013-09-03 15:58:12.459188 19384'6654042013-08-28 12:42:07.490877 # ceph pg repair 2.12 Looking at PG::repair_object per line 12 of the backtrace, I can see a dout(10) which should tell me the problem object: src/osd/PG.cc: void PG::repair_object(const hobject_t soid, ScrubMap::object *po, int bad_peer, int ok_peer) { dout(10) repair_object soid bad_peer osd. bad_peer ok_peer osd. ok_peer dendl; ... } The 'ceph pg dump' output above tells me the primary osd is '6', so I can increase the logging level to 10 on osd.6 to get the debug output, and repair again: # ceph osd tell 6 injectargs '--debug_osd 0/10' # ceph pg repair 2.12 I get the same OSD crash, but this time it logs the dout from above, which shows the problem object: -1 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 20117'690440 active+scrubbing+deep+repair] repair_object 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 bad_peer osd.7 ok_peer osd.6 0 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** So... Firstly, is anyone interested in further investigating the problem to fix the crash behaviour? And, what's the best way to fix the pool? Cheers, Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash during repair
Hi Chris, What is the inconsistency that scrub reports in the log? My guess is that the simplest way to resolve this is to remove whichever copy you decide is invalid, but it depends on what the inconstency it is trying/failing to repair is. Thanks! sage On Fri, 6 Sep 2013, Chris Dunlop wrote: G'day, I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: http://tracker.ceph.com/issues/6233 ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) 1: /usr/bin/ceph-osd() [0x8530a2] 2: (()+0xf030) [0x7f541ca39030] 3: (gsignal()+0x35) [0x7f541b132475] 4: (abort()+0x180) [0x7f541b1356f0] 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d] 6: (()+0x63996) [0x7f541b985996] 7: (()+0x639c3) [0x7f541b9859c3] 8: (()+0x63bee) [0x7f541b985bee] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7] 10: (object_info_t::decode(ceph::buffer::list::iterator)+0x29) [0x95b579] 11: (object_info_t::object_info_t(ceph::buffer::list)+0x180) [0x695ec0] 12: (PG::repair_object(hobject_t const, ScrubMap::object*, int, int)+0xc7) [0x7646b7] 13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d] 14: (PG::scrub_finish()+0x4f) [0x76587f] 15: (PG::chunky_scrub(ThreadPool::TPHandle)+0x10d6) [0x76cb96] 16: (PG::scrub(ThreadPool::TPHandle)+0x138) [0x76d7e8] 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0xf) [0x70515f] 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542] 19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0] 20: (()+0x6b50) [0x7f541ca30b50] 21: (clone()+0x6d) [0x7f541b1daa7d] NOTE: a copy of the executable, or `objdump -rdS lt;executablegt;` is needed to interpret this. This occurs as a result of: # ceph pg dump | grep inconsistent 2.1227230 0 0 11311299072 159189 159189 active+clean+inconsistent 2013-09-06 09:35:47.512119 20117'690441 20120'7914185 [6,7] [6,7] 20021'6759672013-09-03 15:58:12.459188 19384'6654042013-08-28 12:42:07.490877 # ceph pg repair 2.12 Looking at PG::repair_object per line 12 of the backtrace, I can see a dout(10) which should tell me the problem object: src/osd/PG.cc: void PG::repair_object(const hobject_t soid, ScrubMap::object *po, int bad_peer, int ok_peer) { dout(10) repair_object soid bad_peer osd. bad_peer ok_peer osd. ok_peer dendl; ... } The 'ceph pg dump' output above tells me the primary osd is '6', so I can increase the logging level to 10 on osd.6 to get the debug output, and repair again: # ceph osd tell 6 injectargs '--debug_osd 0/10' # ceph pg repair 2.12 I get the same OSD crash, but this time it logs the dout from above, which shows the problem object: -1 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 20117'690440 active+scrubbing+deep+repair] repair_object 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 bad_peer osd.7 ok_peer osd.6 0 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** So... Firstly, is anyone interested in further investigating the problem to fix the crash behaviour? And, what's the best way to fix the pool? Cheers, Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash during repair
Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change: internal_safe_to_start_threads = 'true' 2013-09-06 09:33:28.303658 7f0ae94bd700 0 log [ERR] : 2.12 osd.7: soid 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr snapset 2013-09-06 09:33:28.303685 7f0ae94bd700 0 log [ERR] : repair 2.12 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr 2013-09-06 09:34:45.138468 7f0ae94bd700 0 log [ERR] : 2.12 repair stat mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 bytes. 2013-09-06 09:34:45.142215 7f0ae94bd700 0 log [ERR] : 2.12 repair 0 missing, 1 inconsistent objects 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** I've just attached the full 'debug_osd 0/10' log to the bug report. Thanks, Chris On Thu, Sep 05, 2013 at 07:38:47PM -0700, Sage Weil wrote: Hi Chris, What is the inconsistency that scrub reports in the log? My guess is that the simplest way to resolve this is to remove whichever copy you decide is invalid, but it depends on what the inconstency it is trying/failing to repair is. Thanks! sage On Fri, 6 Sep 2013, Chris Dunlop wrote: G'day, I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: http://tracker.ceph.com/issues/6233 ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) 1: /usr/bin/ceph-osd() [0x8530a2] 2: (()+0xf030) [0x7f541ca39030] 3: (gsignal()+0x35) [0x7f541b132475] 4: (abort()+0x180) [0x7f541b1356f0] 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d] 6: (()+0x63996) [0x7f541b985996] 7: (()+0x639c3) [0x7f541b9859c3] 8: (()+0x63bee) [0x7f541b985bee] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7] 10: (object_info_t::decode(ceph::buffer::list::iterator)+0x29) [0x95b579] 11: (object_info_t::object_info_t(ceph::buffer::list)+0x180) [0x695ec0] 12: (PG::repair_object(hobject_t const, ScrubMap::object*, int, int)+0xc7) [0x7646b7] 13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d] 14: (PG::scrub_finish()+0x4f) [0x76587f] 15: (PG::chunky_scrub(ThreadPool::TPHandle)+0x10d6) [0x76cb96] 16: (PG::scrub(ThreadPool::TPHandle)+0x138) [0x76d7e8] 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0xf) [0x70515f] 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542] 19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0] 20: (()+0x6b50) [0x7f541ca30b50] 21: (clone()+0x6d) [0x7f541b1daa7d] NOTE: a copy of the executable, or `objdump -rdS lt;executablegt;` is needed to interpret this. This occurs as a result of: # ceph pg dump | grep inconsistent 2.1227230 0 0 11311299072 159189 159189 active+clean+inconsistent 2013-09-06 09:35:47.512119 20117'69044120120'7914185 [6,7] [6,7] 20021'6759672013-09-03 15:58:12.459188 19384'6654042013-08-28 12:42:07.490877 # ceph pg repair 2.12 Looking at PG::repair_object per line 12 of the backtrace, I can see a dout(10) which should tell me the problem object: src/osd/PG.cc: void PG::repair_object(const hobject_t soid, ScrubMap::object *po, int bad_peer, int ok_peer) { dout(10) repair_object soid bad_peer osd. bad_peer ok_peer osd. ok_peer dendl; ... } The 'ceph pg dump' output above tells me the primary osd is '6', so I can increase the logging level to 10 on osd.6 to get the debug output, and repair again: # ceph osd tell 6 injectargs '--debug_osd 0/10' # ceph pg repair 2.12 I get the same OSD crash, but this time it logs the dout from above, which shows the problem object: -1 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 20117'690440 active+scrubbing+deep+repair] repair_object 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 bad_peer osd.7 ok_peer osd.6 0 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** So... Firstly, is anyone interested in further investigating the problem to fix the crash behaviour? And, what's the best way to fix the pool? Cheers, Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord
Re: OSD crash during repair
On Fri, 6 Sep 2013, Chris Dunlop wrote: Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change: internal_safe_to_start_threads = 'true' 2013-09-06 09:33:28.303658 7f0ae94bd700 0 log [ERR] : 2.12 osd.7: soid 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr snapset 2013-09-06 09:33:28.303685 7f0ae94bd700 0 log [ERR] : repair 2.12 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr 2013-09-06 09:34:45.138468 7f0ae94bd700 0 log [ERR] : 2.12 repair stat mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 bytes. 2013-09-06 09:34:45.142215 7f0ae94bd700 0 log [ERR] : 2.12 repair 0 missing, 1 inconsistent objects 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** I've just attached the full 'debug_osd 0/10' log to the bug report. This suggests to me that the object on osd.6 is missing those xattrs; can you confirm with getfattr -d on the in osd.6's data directory? If that is indeed the case, you should be able to move the object out of the way (don't delete it, just in case) and then do the repair. The osd.6 should recover by copying the object from osd.7 (which has the needed xattrs). Bobtail is smart enough to recover missing objects but not to recover just missing xattrs. Also, you should upgrade to dumpling. :) sage Thanks, Chris On Thu, Sep 05, 2013 at 07:38:47PM -0700, Sage Weil wrote: Hi Chris, What is the inconsistency that scrub reports in the log? My guess is that the simplest way to resolve this is to remove whichever copy you decide is invalid, but it depends on what the inconstency it is trying/failing to repair is. Thanks! sage On Fri, 6 Sep 2013, Chris Dunlop wrote: G'day, I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: http://tracker.ceph.com/issues/6233 ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) 1: /usr/bin/ceph-osd() [0x8530a2] 2: (()+0xf030) [0x7f541ca39030] 3: (gsignal()+0x35) [0x7f541b132475] 4: (abort()+0x180) [0x7f541b1356f0] 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d] 6: (()+0x63996) [0x7f541b985996] 7: (()+0x639c3) [0x7f541b9859c3] 8: (()+0x63bee) [0x7f541b985bee] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7] 10: (object_info_t::decode(ceph::buffer::list::iterator)+0x29) [0x95b579] 11: (object_info_t::object_info_t(ceph::buffer::list)+0x180) [0x695ec0] 12: (PG::repair_object(hobject_t const, ScrubMap::object*, int, int)+0xc7) [0x7646b7] 13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d] 14: (PG::scrub_finish()+0x4f) [0x76587f] 15: (PG::chunky_scrub(ThreadPool::TPHandle)+0x10d6) [0x76cb96] 16: (PG::scrub(ThreadPool::TPHandle)+0x138) [0x76d7e8] 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0xf) [0x70515f] 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542] 19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0] 20: (()+0x6b50) [0x7f541ca30b50] 21: (clone()+0x6d) [0x7f541b1daa7d] NOTE: a copy of the executable, or `objdump -rdS lt;executablegt;` is needed to interpret this. This occurs as a result of: # ceph pg dump | grep inconsistent 2.1227230 0 0 11311299072 159189 159189 active+clean+inconsistent 2013-09-06 09:35:47.512119 20117'69044120120'7914185 [6,7] [6,7] 20021'675967 2013-09-03 15:58:12.459188 19384'6654042013-08-28 12:42:07.490877 # ceph pg repair 2.12 Looking at PG::repair_object per line 12 of the backtrace, I can see a dout(10) which should tell me the problem object: src/osd/PG.cc: void PG::repair_object(const hobject_t soid, ScrubMap::object *po, int bad_peer, int ok_peer) { dout(10) repair_object soid bad_peer osd. bad_peer ok_peer osd. ok_peer dendl; ... } The 'ceph pg dump' output above tells me the primary osd is '6', so I can increase the logging level to 10 on osd.6 to get the debug output, and repair again: # ceph osd tell 6 injectargs '--debug_osd 0/10' # ceph pg repair 2.12 I get the same OSD crash, but this time it logs the dout from above, which shows the problem object: -1 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 20117'690440 active+scrubbing+deep+repair] repair_object 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 bad_peer osd.7 ok_peer osd.6 0 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** So
Re: OSD crash during repair
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change: internal_safe_to_start_threads = 'true' 2013-09-06 09:33:28.303658 7f0ae94bd700 0 log [ERR] : 2.12 osd.7: soid 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr snapset 2013-09-06 09:33:28.303685 7f0ae94bd700 0 log [ERR] : repair 2.12 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr 2013-09-06 09:34:45.138468 7f0ae94bd700 0 log [ERR] : 2.12 repair stat mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 bytes. 2013-09-06 09:34:45.142215 7f0ae94bd700 0 log [ERR] : 2.12 repair 0 missing, 1 inconsistent objects 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** I've just attached the full 'debug_osd 0/10' log to the bug report. This suggests to me that the object on osd.6 is missing those xattrs; can you confirm with getfattr -d on the in osd.6's data directory? I haven't yet wrapped my head around how to translate an oid like those above into a underlying file system object. What directory should I be looking at? If that is indeed the case, you should be able to move the object out of the way (don't delete it, just in case) and then do the repair. The osd.6 should recover by copying the object from osd.7 (which has the needed xattrs). Bobtail is smart enough to recover missing objects but not to recover just missing xattrs. Do you want me to hold off on any repairs to allow tracking down the crash, or is the current code sufficiently different that there's little point? Also, you should upgrade to dumpling. :) I've been considering it. It was initially a little scary with the various issues that were cropping up but that all seems to have quietened down. Of course I'd like my cluster to be clean before attempting an upgrade! Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash during repair
On Fri, Sep 06, 2013 at 01:12:21PM +1000, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change: internal_safe_to_start_threads = 'true' 2013-09-06 09:33:28.303658 7f0ae94bd700 0 log [ERR] : 2.12 osd.7: soid 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr snapset 2013-09-06 09:33:28.303685 7f0ae94bd700 0 log [ERR] : repair 2.12 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr 2013-09-06 09:34:45.138468 7f0ae94bd700 0 log [ERR] : 2.12 repair stat mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 bytes. 2013-09-06 09:34:45.142215 7f0ae94bd700 0 log [ERR] : 2.12 repair 0 missing, 1 inconsistent objects 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** I've just attached the full 'debug_osd 0/10' log to the bug report. This suggests to me that the object on osd.6 is missing those xattrs; can you confirm with getfattr -d on the in osd.6's data directory? I haven't yet wrapped my head around how to translate an oid like those above into a underlying file system object. What directory should I be looking at? Found it: b5# cd /var/lib/ceph/osd/ceph-6/current b5# find 2.12* | grep -i 17d9b.2ae8944a.1e11 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 b5# getfattr -d 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 ...crickets... vs. b4# cd /var/lib/ceph/osd/ceph-7/current b4# getfattr -d 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 # file: 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 user.ceph._=0sCgjhBANBACByYi4wLjE3ZDliLjJhZTg5NDRhLjAwMDAwMDAwMWUxMf7/EnqYVgAAAgAEAxACAP8AAEInCgAAuEsAAEEnCgAAuEsAAAICFQgTmwEAAHD1AgAAQAAAyY4dUpjCTSACAhUAAABCJwoAALhL user.ceph.snapset=0sAgIZAAABAA== If that is indeed the case, you should be able to move the object out of the way (don't delete it, just in case) and then do the repair. The osd.6 should recover by copying the object from osd.7 (which has the needed xattrs). Bobtail is smart enough to recover missing objects but not to recover just missing xattrs. Do you want me to hold off on any repairs to allow tracking down the crash, or is the current code sufficiently different that there's little point? Repaired! ...but why does it take multiple rounds? b5# mv 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 .. b5# ceph pg repair 2.12 b5# while ceph -s | grep -q scrubbing; do sleep 60; done b5# tail /var/log/ceph/ceph-osd.6.log 2013-09-06 15:02:13.751160 7f6ccc5ae700 0 log [ERR] : 2.12 osd.6 missing 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 2013-09-06 15:04:15.286711 7f6ccc5ae700 0 log [ERR] : 2.12 repair stat mismatch, got 2723/2724 objects, 339/339 clones, 11311299072/11315493376 bytes. 2013-09-06 15:04:15.286766 7f6ccc5ae700 0 log [ERR] : 2.12 repair 1 missing, 0 inconsistent objects 2013-09-06 15:04:15.286823 7f6ccc5ae700 0 log [ERR] : 2.12 repair 2 errors, 2 fixed 2013-09-06 15:04:20.778377 7f6ccc5ae700 0 log [ERR] : 2.12 scrub stat mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 bytes. 2013-09-06 15:04:20.778383 7f6ccc5ae700 0 log [ERR] : 2.12 scrub 1 errors b5# ceph pg dump | grep inconsistent 2.1227230 0 0 11311299072 159103 159103 active+clean+inconsistent 2013-09-06 15:04:20.778413 20121'690883 20128'7941893 [6,7] [6,7] 20121'6908832013-09-06 15:04:20.778387 20121'6908832013-09-06 15:04:15.286835 b5# ceph pg repair 2.12 b5# while ceph -s | grep -q scrubbing; do sleep 60; done b5# tail /var/log/ceph/ceph-osd.6.log 2013-09-06 15:07:30.461959 7f6ccc5ae700 0 log [ERR] : 2.12 repair stat mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 bytes. 2013-09-06 15:07:30.461991 7f6ccc5ae700 0 log [ERR] : 2.12 repair 1 errors, 1 fixed b5# ceph pg dump | grep inconsistent 2.1227240 0 0 11315493376 159580 159580 active+clean+inconsistent 2013-09-06 15:07:30.462039 20129'690886 20128'7942171 [6,7] [6,7] 20129'6908862013-09-06 15:07:30.461995 20129'6908862013-09-06 15:07:30.461995 b5# ceph pg repair 2.12 b5# while ceph -s | grep -q scrubbing; do sleep 60; done b5# tail /var/log/ceph/ceph-osd.6.log 2013-09-06 15:09:36.993049 7f6ccc5ae700 0 log [INF] : 2.12 repair ok, 0 fixed # ceph pg dump | grep inconsistent ...crickets... Chris -- To unsubscribe from this list: send the
OSD crash upon pool creation
Hello, Using db2bb270e93ed44f9252d65d1d4c9b36875d0ea5 I had observed some disaster-alike behavior after ``pool create'' command - every osd daemon in the cluster will die at least once(some will crash times in a row after bringing back). Please take a look on the backtraces(almost identical) below. Issue #5637 is created in the tracker. Thanks! http://xdel.ru/downloads/poolcreate.txt.gz http://xdel.ru/downloads/poolcreate2.txt.gz -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD Crash
I had one of my OSDs crash yesterday. I'm using ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5). The part of the log file where the crash happened is attached. Not really sure what lead up to it, but I did get an alert from my server monitor telling me my swap space got really low around the time it crashed. The OSD reconnected after restarting the service. Currently, I'm waiting patiently as 1 of my 400 pgs gets out of active+clean+scrubbing status. Dave Spano Optogenics Systems Administrator -17 2013-03-03 13:02:13.478152 7f5d5a9b5700 5 --OSD::tracker-- reqid: client.13039.0:6860359, seq: 5393222, time: 2013-03-03 13:02:13.478134, event: write_thread_in_journal_buffer, request: osd_sub_op(client.13039.0:6860359 3.0 a10c17c8/rb.0.2dd7.16d28c4f.002f/head//3 [] v 411'1980074 snapset=0=[]:[] snapc=0=[]) v7 -16 2013-03-03 13:02:13.478153 7f5d559ab700 1 -- 192.168.3.11:6801/4500 -- osd.1 192.168.3.12:6802/2467 -- osd_sub_op_reply(client.14000.1:570700 0.16 5e01a96/13797f2./head//0 [] ondisk, result = 0) v1 -- ?+0 0xc45cc80 -15 2013-03-03 13:02:13.478184 7f5d5a9b5700 5 --OSD::tracker-- reqid: client.14000.1:570701, seq: 5393223, time: 2013-03-03 13:02:13.478184, event: write_thread_in_journal_buffer, request: osd_sub_op(client.14000.1:570701 0.22 40dccca2/11164ca.0002/head//0 [] v 411'447369 snapset=0=[]:[] snapc=0=[]) v7 -14 2013-03-03 13:02:13.478209 7f5d5a9b5700 5 --OSD::tracker-- reqid: client.11755.0:2625658, seq: 5393225, time: 2013-03-03 13:02:13.478209, event: write_thread_in_journal_buffer, request: osd_sub_op(client.11755.0:2625658 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095529 snapset=0=[]:[] snapc=0=[]) v7 -13 2013-03-03 13:02:13.478234 7f5d5a9b5700 5 --OSD::tracker-- reqid: client.11755.0:2625659, seq: 5393226, time: 2013-03-03 13:02:13.478234, event: write_thread_in_journal_buffer, request: osd_sub_op(client.11755.0:2625659 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095530 snapset=0=[]:[] snapc=0=[]) v7 -12 2013-03-03 13:02:13.484696 7f5d549a9700 1 -- 192.168.3.11:6800/4500 == client.11755 192.168.1.64:0/1062411 90128 ping v1 0+0+0 (0 0 0) 0xff4e000 con 0x307a6e0 -11 2013-03-03 13:02:13.489457 7f5d4f99f700 5 --OSD::tracker-- reqid: client.11755.0:2625660, seq: 5393227, time: 2013-03-03 13:02:13.489457, event: started, request: osd_sub_op(client.11755.0:2625660 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095531 snapset=0=[]:[] snapc=0=[]) v7 -10 2013-03-03 13:02:13.489503 7f5d4f99f700 5 --OSD::tracker-- reqid: client.11755.0:2625660, seq: 5393227, time: 2013-03-03 13:02:13.489503, event: commit_queued_for_journal_write, request: osd_sub_op(client.11755.0:2625660 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095531 snapset=0=[]:[] snapc=0=[]) v7 -9 2013-03-03 13:02:13.571632 7f5d501a0700 5 --OSD::tracker-- reqid: client.11755.0:2625657, seq: 5393224, time: 2013-03-03 13:02:13.571631, event: started, request: osd_op(client.11755.0:2625657 rb.0.2ea4.614c277f.003d [write 1253376~4096] 3.c7bd6ff1) v4 -8 2013-03-03 13:02:13.571661 7f5d501a0700 5 --OSD::tracker-- reqid: client.11755.0:2625657, seq: 5393224, time: 2013-03-03 13:02:13.571661, event: started, request: osd_op(client.11755.0:2625657 rb.0.2ea4.614c277f.003d [write 1253376~4096] 3.c7bd6ff1) v4 -7 2013-03-03 13:02:13.571733 7f5d501a0700 5 --OSD::tracker-- reqid: client.11755.0:2625657, seq: 5393224, time: 2013-03-03 13:02:13.571733, event: waiting for subops from [1], request: osd_op(client.11755.0:2625657 rb.0.2ea4.614c277f.003d [write 1253376~4096] 3.c7bd6ff1) v4 -6 2013-03-03 13:02:13.598028 7f5d5a9b5700 5 --OSD::tracker-- reqid: client.13039.0:6860359, seq: 5393222, time: 2013-03-03 13:02:13.598027, event: journaled_completion_queued, request: osd_sub_op(client.13039.0:6860359 3.0 a10c17c8/rb.0.2dd7.16d28c4f.002f/head//3 [] v 411'1980074 snapset=0=[]:[] snapc=0=[]) v7 -5 2013-03-03 13:02:13.598061 7f5d5a9b5700 5 --OSD::tracker-- reqid: client.14000.1:570701, seq: 5393223, time: 2013-03-03 13:02:13.598061, event: journaled_completion_queued, request: osd_sub_op(client.14000.1:570701 0.22 40dccca2/11164ca.0002/head//0 [] v 411'447369 snapset=0=[]:[] snapc=0=[]) v7 -4 2013-03-03 13:02:13.598081 7f5d5a9b5700 5 --OSD::tracker-- reqid: client.11755.0:2625658, seq: 5393225, time: 2013-03-03 13:02:13.598081, event: journaled_completion_queued, request: osd_sub_op(client.11755.0:2625658 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095529 snapset=0=[]:[] snapc=0=[]) v7 -3 2013-03-03 13:02:13.598098 7f5d5a9b5700 5 --OSD::tracker-- reqid: client.11755.0:2625659, seq: 5393226, time: 2013-03-03 13:02:13.598098, event: journaled_completion_queued, request: osd_sub_op(client.11755.0:2625659 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095530 snapset=0=[]:[] snapc=0=[]) v7
Re: OSD crash, ceph version 0.56.1
On Wed, Jan 9, 2013 at 4:38 PM, Sage Weil s...@inktank.com wrote: On Wed, 9 Jan 2013, Ian Pye wrote: Hi, Every time I try an bring up an OSD, it crashes and I get the following: error (121) Remote I/O error not handled on operation 20 This error code (EREMOTEIO) is not used by Ceph. What fs are you using? Which kernel version? Anything else unusual happen with your hardware recently that might have wreaked havoc on your underlying fs? 3.7.1 kernel with XFS. Its a demo-box from a vendor, so should be brand new. I'm going to say its a disk error, given the following: mkfs.xfs: read failed: Input/output error Interestingly, running an osd and btrfs worked fine on the same disk. Thanks for the help, Ian sage The cluster is new and only has a little bit of data on it. Any ideas what is going on? Does Remote I/O mean a network error? Full log below: -9 2013-01-10 00:00:20.182237 7f2ddde8f910 0 filestore(/mnt/dist_j/ceph) error (121) Remote I/O error not handled on operation 20 (12.0.0, or op 0, counting from 0) -8 2013-01-10 00:00:20.182275 7f2ddde8f910 0 filestore(/mnt/dist_j/ceph) unexpected error code -7 2013-01-10 00:00:20.182285 7f2ddde8f910 0 filestore(/mnt/dist_j/ceph) transaction dump: { ops: [ { op_num: 0, op_name: mkcoll, collection: 0.2c0_head}, { op_num: 1, op_name: collection_setattr, collection: 0.2c0_head, name: info, length: 5}, { op_num: 2, op_name: truncate, collection: meta, oid: a04c46e9\/pginfo_0.2c0\/0\/\/-1, offset: 0}, { op_num: 3, op_name: write, collection: meta, oid: a04c46e9\/pginfo_0.2c0\/0\/\/-1, length: 531, offset: 0, bufferlist length: 531}, { op_num: 4, op_name: remove, collection: meta, oid: 1f9ede85\/pglog_0.2c0\/0\/\/-1}, { op_num: 5, op_name: write, collection: meta, oid: 1f9ede85\/pglog_0.2c0\/0\/\/-1, length: 0, offset: 0, bufferlist length: 0}, { op_num: 6, op_name: collection_setattr, collection: 0.2c0_head, name: ondisklog, length: 34}, { op_num: 7, op_name: nop}]} -6 2013-01-10 00:00:20.183085 7f2dd5e7f910 10 monclient: _send_mon_message to mon.a at 108.162.209.120:6789/0 -5 2013-01-10 00:00:20.183108 7f2dd5e7f910 1 -- 108.162.209.120:6834/6359 -- 108.162.209.120:6789/0 -- osd_pgtemp(e22 {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} v22) v1 -- ?+0 0x5b15600 con 0x34629a0 -4 2013-01-10 00:00:20.183772 7f2dd6680910 10 monclient: _send_mon_message to mon.a at 108.162.209.120:6789/0 -3 2013-01-10 00:00:20.183797 7f2dd6680910 1 -- 108.162.209.120:6834/6359 -- 108.162.209.120:6789/0 -- osd_pgtemp(e22 {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} v22) v1 -- ?+0 0x5f75600 con 0x34629a0 -2 2013-01-10 00:00:20.184315 7f2dd5e7f910 10 monclient: _send_mon_message to mon.a at 108.162.209.120:6789/0 -1 2013-01-10 00:00:20.184338 7f2dd5e7f910 1 -- 108.162.209.120:6834/6359 -- 108.162.209.120:6789/0 -- osd_pgtemp(e22 {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} v22) v1 -- ?+0 0x5b15400 con 0x34629a0 0 2013-01-10 00:00:20.184755 7f2ddde8f910 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)' thread 7f2ddde8f910 time 2013-01-10 00:00:20.182422 os/FileStore.cc: 2681: FAILED assert(0 == unexpected error) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int)+0x90a) [0x73e14a] 2: (FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long)+0x4c) [0x7455dc] 3: (FileStore::_do_op(FileStore::OpSequencer*)+0xab) [0x72428b] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x894feb] 5: (ThreadPool::WorkThread::entry()+0x10) [0x8977d0] 6: /lib/libpthread.so.0 [0x7f2de6d087aa] 7: (clone()+0x6d) [0x7f2de518159d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5
osd crash after reboot
Hello list, after a reboot of my node i see this on all OSDs of this node after the reboot: 2012-12-14 09:03:20.393224 7f8e652f8780 -1 osd/OSD.cc: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f8e652f8780 time 2012-12-14 09:03:20.392528 osd/OSD.cc: 4385: FAILED assert(_get_map_bl(epoch, bl)) ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824) 1: (OSDService::get_map(unsigned int)+0x918) [0x607f78] 2: (OSD::load_pgs()+0x13ed) [0x6168ad] 3: (OSD::init()+0xaff) [0x617a5f] 4: (main()+0x2de6) [0x55a416] 5: (__libc_start_main()+0xfd) [0x7f8e63093c8d] 6: /usr/bin/ceph-osd() [0x557269] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -29 2012-12-14 09:03:20.266349 7f8e652f8780 5 asok(0x285c000) register_command perfcounters_dump hook 0x2850010 -28 2012-12-14 09:03:20.266366 7f8e652f8780 5 asok(0x285c000) register_command 1 hook 0x2850010 -27 2012-12-14 09:03:20.266369 7f8e652f8780 5 asok(0x285c000) register_command perf dump hook 0x2850010 -26 2012-12-14 09:03:20.266379 7f8e652f8780 5 asok(0x285c000) register_command perfcounters_schema hook 0x2850010 -25 2012-12-14 09:03:20.266383 7f8e652f8780 5 asok(0x285c000) register_command 2 hook 0x2850010 -24 2012-12-14 09:03:20.266386 7f8e652f8780 5 asok(0x285c000) register_command perf schema hook 0x2850010 -23 2012-12-14 09:03:20.266389 7f8e652f8780 5 asok(0x285c000) register_command config show hook 0x2850010 -22 2012-12-14 09:03:20.266392 7f8e652f8780 5 asok(0x285c000) register_command config set hook 0x2850010 -21 2012-12-14 09:03:20.266396 7f8e652f8780 5 asok(0x285c000) register_command log flush hook 0x2850010 -20 2012-12-14 09:03:20.266398 7f8e652f8780 5 asok(0x285c000) register_command log dump hook 0x2850010 -19 2012-12-14 09:03:20.266401 7f8e652f8780 5 asok(0x285c000) register_command log reopen hook 0x2850010 -18 2012-12-14 09:03:20.267686 7f8e652f8780 0 ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824), process ceph-osd, pid 7212 -17 2012-12-14 09:03:20.268738 7f8e652f8780 1 finished global_init_daemonize -16 2012-12-14 09:03:20.275957 7f8e652f8780 0 filestore(/ceph/osd.1/) mount FIEMAP ioctl is supported and appears to work -15 2012-12-14 09:03:20.275968 7f8e652f8780 0 filestore(/ceph/osd.1/) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option -14 2012-12-14 09:03:20.276177 7f8e652f8780 0 filestore(/ceph/osd.1/) mount did NOT detect btrfs -13 2012-12-14 09:03:20.277051 7f8e652f8780 0 filestore(/ceph/osd.1/) mount syscall(__NR_syncfs, fd) fully supported -12 2012-12-14 09:03:20.277585 7f8e652f8780 0 filestore(/ceph/osd.1/) mount found snaps -11 2012-12-14 09:03:20.278899 7f8e652f8780 0 filestore(/ceph/osd.1/) mount: enabling WRITEAHEAD journal mode: btrfs not detected -10 2012-12-14 09:03:20.290745 7f8e652f8780 0 journal kernel version is 3.6.10 -9 2012-12-14 09:03:20.320728 7f8e652f8780 0 journal kernel version is 3.6.10 -8 2012-12-14 09:03:20.328381 7f8e652f8780 0 filestore(/ceph/osd.1/) mount FIEMAP ioctl is supported and appears to work -7 2012-12-14 09:03:20.328391 7f8e652f8780 0 filestore(/ceph/osd.1/) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option -6 2012-12-14 09:03:20.328574 7f8e652f8780 0 filestore(/ceph/osd.1/) mount did NOT detect btrfs -5 2012-12-14 09:03:20.329579 7f8e652f8780 0 filestore(/ceph/osd.1/) mount syscall(__NR_syncfs, fd) fully supported -4 2012-12-14 09:03:20.329612 7f8e652f8780 0 filestore(/ceph/osd.1/) mount found snaps -3 2012-12-14 09:03:20.330786 7f8e652f8780 0 filestore(/ceph/osd.1/) mount: enabling WRITEAHEAD journal mode: btrfs not detected -2 2012-12-14 09:03:20.340711 7f8e652f8780 0 journal kernel version is 3.6.10 -1 2012-12-14 09:03:20.370707 7f8e652f8780 0 journal kernel version is 3.6.10 0 2012-12-14 09:03:20.393224 7f8e652f8780 -1 osd/OSD.cc: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f8e652f8780 time 2012-12-14 09:03:20.392528 osd/OSD.cc: 4385: FAILED assert(_get_map_bl(epoch, bl)) ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824) 1: (OSDService::get_map(unsigned int)+0x918) [0x607f78] 2: (OSD::load_pgs()+0x13ed) [0x6168ad] 3: (OSD::init()+0xaff) [0x617a5f] 4: (main()+0x2de6) [0x55a416] 5: (__libc_start_main()+0xfd) [0x7f8e63093c8d] 6: /usr/bin/ceph-osd() [0x557269] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash after reboot
same log more verbose: 11 ec=10 les/c 3307/3307 3306/3306/3306) [] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] read_log done -11 2012-12-14 09:17:50.648572 7fb6e0d6b780 10 osd.3 pg_epoch: 3996 pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] handle_loaded -10 2012-12-14 09:17:50.648581 7fb6e0d6b780 20 osd.3 pg_epoch: 3996 pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] exit Initial 0.015080 0 0.00 -9 2012-12-14 09:17:50.648591 7fb6e0d6b780 20 osd.3 pg_epoch: 3996 pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] enter Reset -8 2012-12-14 09:17:50.648599 7fb6e0d6b780 20 osd.3 pg_epoch: 3996 pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] set_last_peering_reset 3996 -7 2012-12-14 09:17:50.648609 7fb6e0d6b780 10 osd.3 4233 load_pgs loaded pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=3996 lcod 0'0 mlcod 0'0 inactive] log(1379'2968,3988'3969] -6 2012-12-14 09:17:50.648649 7fb6e0d6b780 15 filestore(/ceph/osd.3/) collection_getattr /ceph/osd.3//current/0.1_head 'info' -5 2012-12-14 09:17:50.648664 7fb6e0d6b780 10 filestore(/ceph/osd.3/) collection_getattr /ceph/osd.3//current/0.1_head 'info' = 5 -4 2012-12-14 09:17:50.648672 7fb6e0d6b780 20 osd.3 0 get_map 3316 - loading and decoding 0x2943e00 -3 2012-12-14 09:17:50.648678 7fb6e0d6b780 15 filestore(/ceph/osd.3/) read meta/a09ec88/osdmap.3316/0//-1 0~0 -2 2012-12-14 09:17:50.648705 7fb6e0d6b780 10 filestore(/ceph/osd.3/) error opening file /ceph/osd.3//current/meta/DIR_8/DIR_8/osdmap.3316__0_0A09EC88__none with flags=0 and mode=0: (2) No such file or directory -1 2012-12-14 09:17:50.648722 7fb6e0d6b780 10 filestore(/ceph/osd.3/) FileStore::read(meta/a09ec88/osdmap.3316/0//-1) open error: (2) No such file or directory 0 2012-12-14 09:17:50.649586 7fb6e0d6b780 -1 osd/OSD.cc: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6e0d6b780 time 2012-12-14 09:17:50.648733 osd/OSD.cc: 4385: FAILED assert(_get_map_bl(epoch, bl)) ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824) 1: (OSDService::get_map(unsigned int)+0x918) [0x607f78] 2: (OSD::load_pgs()+0x13ed) [0x6168ad] 3: (OSD::init()+0xaff) [0x617a5f] 4: (main()+0x2de6) [0x55a416] 5: (__libc_start_main()+0xfd) [0x7fb6deb06c8d] 6: /usr/bin/ceph-osd() [0x557269] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/20 journaler 0/ 5 objectcacher 0/ 5 client 0/20 osd 0/ 0 optracker 0/ 0 objclass 0/20 filestore 0/20 journal 0/ 0 ms 1/ 5 mon 0/ 0 monc 0/ 5 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent10 max_new 1000 log_file /var/log/ceph/ceph-osd.3.log --- end dump of recent events --- 2012-12-14 09:17:50.714676 7fb6e0d6b780 -1 *** Caught signal (Aborted) ** in thread 7fb6e0d6b780 ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824) 1: /usr/bin/ceph-osd() [0x7a1889] 2: (()+0xeff0) [0x7fb6e0750ff0] 3: (gsignal()+0x35) [0x7fb6deb1a1b5] 4: (abort()+0x180) [0x7fb6deb1cfc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb6df3aedc5] 6: (()+0xcb166) [0x7fb6df3ad166] 7: (()+0xcb193) [0x7fb6df3ad193] 8: (()+0xcb28e) [0x7fb6df3ad28e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x805659] 10: (OSDService::get_map(unsigned int)+0x918) [0x607f78] 11: (OSD::load_pgs()+0x13ed) [0x6168ad] 12: (OSD::init()+0xaff) [0x617a5f] 13: (main()+0x2de6) [0x55a416] 14: (__libc_start_main()+0xfd) [0x7fb6deb06c8d] 15: /usr/bin/ceph-osd() [0x557269] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- 0 2012-12-14 09:17:50.714676 7fb6e0d6b780 -1 *** Caught signal (Aborted) ** in thread 7fb6e0d6b780 ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824) 1: /usr/bin/ceph-osd() [0x7a1889] 2: (()+0xeff0) [0x7fb6e0750ff0] 3: (gsignal()+0x35) [0x7fb6deb1a1b5] 4: (abort()+0x180) [0x7fb6deb1cfc0] 5:
Re: osd crash after reboot
On 12/14/2012 10:14 AM, Stefan Priebe wrote: One more IMPORTANT note. This might happen due to the fact that a disk was missing (disk failure) afte the reboot. fstab and mountpoint are working with UUIDs so they match but the journal block device: osd journal = /dev/sde1 didn't match anymore - as the numbers got renumber due to the failed disk. Is there a way to use some kind of UUIDs here too for journal? You should be able to use /dev/disk/by-uuid/* instead. That should give you a stable view of the filesystems. Regards, Dennis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash after reboot
On 12/14/2012 08:52 AM, Dennis Jacobfeuerborn wrote: On 12/14/2012 10:14 AM, Stefan Priebe wrote: One more IMPORTANT note. This might happen due to the fact that a disk was missing (disk failure) afte the reboot. fstab and mountpoint are working with UUIDs so they match but the journal block device: osd journal = /dev/sde1 didn't match anymore - as the numbers got renumber due to the failed disk. Is there a way to use some kind of UUIDs here too for journal? You should be able to use /dev/disk/by-uuid/* instead. That should give you a stable view of the filesystems. I often map partitions to something in /dev/disk/by-partlabel and use those in my ceph.conf files. that way disks can be remapped behind the scenes and the ceph configuration doesn't have to change even if disks get replaced. Regards, Dennis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash after reboot
Hello Dennis, Am 14.12.2012 15:52, schrieb Dennis Jacobfeuerborn: didn't match anymore - as the numbers got renumber due to the failed disk. Is there a way to use some kind of UUIDs here too for journal? You should be able to use /dev/disk/by-uuid/* instead. That should give you a stable view of the filesystems. Good idea but there are only listed partitions with UUIDs. When the journal is using directly the partition it does not have a UUID. But this reminded me of /dev/disk/by-id and that works fine. I'm now using the wwn Number. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash after reboot
Hi Stefan, Here's what I often do when I have a journal and data partition sharing a disk: sudo parted -s -a optimal /dev/$DEV mklabel gpt sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100% Mark On 12/14/2012 09:11 AM, Stefan Priebe - Profihost AG wrote: Hi Mark, but do i set a label for a partition without FS like the journal blockdev? Am 14.12.2012 16:01, schrieb Mark Nelson: I often map partitions to something in /dev/disk/by-partlabel and use those in my ceph.conf files. that way disks can be remapped behind the scenes and the ceph configuration doesn't have to change even if disks get replaced. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash after reboot
Hi Mark, Am 14.12.2012 16:20, schrieb Mark Nelson: sudo parted -s -a optimal /dev/$DEV mklabel gpt sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100% My disks are gpt too and i'm also using parted. But i don't want to recreate my partitions. I haven't seen a way in parted to set such a label later. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash after reboot
Hello Mark, Am 14.12.2012 16:20, schrieb Mark Nelson: sudo parted -s -a optimal /dev/$DEV mklabel gpt sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100% Isn't that the part type you're using? mkpart part-type start-mb end-mb I like your idea and i think it's a good one but i want to know why this works. part-type isn't FS label... Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash after reboot
On Fri, 14 Dec 2012, Stefan Priebe wrote: One more IMPORTANT note. This might happen due to the fact that a disk was missing (disk failure) afte the reboot. fstab and mountpoint are working with UUIDs so they match but the journal block device: osd journal = /dev/sde1 didn't match anymore - as the numbers got renumber due to the failed disk. Is there a way to use some kind of UUIDs here too for journal? I think others have addressed the uuid question, but one note: The ceph-osd process has an internal uuid/fingerprint on the journal and data dir, and will refuse to start if they don't match. sage Stefan Am 14.12.2012 09:22, schrieb Stefan Priebe: same log more verbose: 11 ec=10 les/c 3307/3307 3306/3306/3306) [] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] read_log done -11 2012-12-14 09:17:50.648572 7fb6e0d6b780 10 osd.3 pg_epoch: 3996 pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] handle_loaded -10 2012-12-14 09:17:50.648581 7fb6e0d6b780 20 osd.3 pg_epoch: 3996 pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] exit Initial 0.015080 0 0.00 -9 2012-12-14 09:17:50.648591 7fb6e0d6b780 20 osd.3 pg_epoch: 3996 pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] enter Reset -8 2012-12-14 09:17:50.648599 7fb6e0d6b780 20 osd.3 pg_epoch: 3996 pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] set_last_peering_reset 3996 -7 2012-12-14 09:17:50.648609 7fb6e0d6b780 10 osd.3 4233 load_pgs loaded pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=3996 lcod 0'0 mlcod 0'0 inactive] log(1379'2968,3988'3969] -6 2012-12-14 09:17:50.648649 7fb6e0d6b780 15 filestore(/ceph/osd.3/) collection_getattr /ceph/osd.3//current/0.1_head 'info' -5 2012-12-14 09:17:50.648664 7fb6e0d6b780 10 filestore(/ceph/osd.3/) collection_getattr /ceph/osd.3//current/0.1_head 'info' = 5 -4 2012-12-14 09:17:50.648672 7fb6e0d6b780 20 osd.3 0 get_map 3316 - loading and decoding 0x2943e00 -3 2012-12-14 09:17:50.648678 7fb6e0d6b780 15 filestore(/ceph/osd.3/) read meta/a09ec88/osdmap.3316/0//-1 0~0 -2 2012-12-14 09:17:50.648705 7fb6e0d6b780 10 filestore(/ceph/osd.3/) error opening file /ceph/osd.3//current/meta/DIR_8/DIR_8/osdmap.3316__0_0A09EC88__none with flags=0 and mode=0: (2) No such file or directory -1 2012-12-14 09:17:50.648722 7fb6e0d6b780 10 filestore(/ceph/osd.3/) FileStore::read(meta/a09ec88/osdmap.3316/0//-1) open error: (2) No such file or directory 0 2012-12-14 09:17:50.649586 7fb6e0d6b780 -1 osd/OSD.cc: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6e0d6b780 time 2012-12-14 09:17:50.648733 osd/OSD.cc: 4385: FAILED assert(_get_map_bl(epoch, bl)) ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824) 1: (OSDService::get_map(unsigned int)+0x918) [0x607f78] 2: (OSD::load_pgs()+0x13ed) [0x6168ad] 3: (OSD::init()+0xaff) [0x617a5f] 4: (main()+0x2de6) [0x55a416] 5: (__libc_start_main()+0xfd) [0x7fb6deb06c8d] 6: /usr/bin/ceph-osd() [0x557269] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/20 journaler 0/ 5 objectcacher 0/ 5 client 0/20 osd 0/ 0 optracker 0/ 0 objclass 0/20 filestore 0/20 journal 0/ 0 ms 1/ 5 mon 0/ 0 monc 0/ 5 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent10 max_new 1000 log_file /var/log/ceph/ceph-osd.3.log --- end dump of recent events --- 2012-12-14 09:17:50.714676 7fb6e0d6b780 -1 *** Caught signal (Aborted) ** in thread 7fb6e0d6b780 ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824) 1: /usr/bin/ceph-osd() [0x7a1889] 2: (()+0xeff0) [0x7fb6e0750ff0] 3: (gsignal()+0x35) [0x7fb6deb1a1b5] 4: (abort()+0x180) [0x7fb6deb1cfc0] 5:
Re: osd crash after reboot
Hi Sage, this was just an idea and i need to fix MY uuid problem. But then the crash is still a problem of ceph. Have you looked into my log? Am 14.12.2012 20:42, schrieb Sage Weil: On Fri, 14 Dec 2012, Stefan Priebe wrote: One more IMPORTANT note. This might happen due to the fact that a disk was missing (disk failure) afte the reboot. fstab and mountpoint are working with UUIDs so they match but the journal block device: osd journal = /dev/sde1 didn't match anymore - as the numbers got renumber due to the failed disk. Is there a way to use some kind of UUIDs here too for journal? I think others have addressed the uuid question, but one note: The ceph-osd process has an internal uuid/fingerprint on the journal and data dir, and will refuse to start if they don't match. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash on 0.48.2argonaut
On 11/14/2012 11:31 PM, eric_yh_c...@wiwynn.com wrote: Dear All: I met this issue on one of osd node. Is this a known issue? Thanks! ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: /usr/bin/ceph-osd() [0x6edaba] 2: (()+0xfcb0) [0x7f08b112dcb0] 3: (gsignal()+0x35) [0x7f08afd09445] 4: (abort()+0x17b) [0x7f08afd0cbab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f08b065769d] 6: (()+0xb5846) [0x7f08b0655846] 7: (()+0xb5873) [0x7f08b0655873] 8: (()+0xb596e) [0x7f08b065596e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1de) [0x7a82fe] 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x693) [0x530f83] 11: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x159) [0x531ac9] 12: (ReplicatedPG::sub_op_modify_reply(std::tr1::shared_ptrOpRequest)+0x15c) [0x53251c] 13: (ReplicatedPG::do_sub_op_reply(std::tr1::shared_ptrOpRequest)+0x81) [0x54d241] 14: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x1e3) [0x600883] 15: (OSD::dequeue_op(PG*)+0x238) [0x5bfaf8] 16: (ThreadPool::worker()+0x4d5) [0x79f835] 17: (ThreadPool::WorkThread::entry()+0xd) [0x5d87cd] 18: (()+0x7e9a) [0x7f08b1125e9a] 19: (clone()+0x6d) [0x7f08afdc54bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. The log of the crashed osd should show which assert actually failed. It could be this bug, but I can't tell without knowing which assert was triggered: http://tracker.newdream.net/issues/2956 Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD crash on 0.48.2argonaut
Dear All: I met this issue on one of osd node. Is this a known issue? Thanks! ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: /usr/bin/ceph-osd() [0x6edaba] 2: (()+0xfcb0) [0x7f08b112dcb0] 3: (gsignal()+0x35) [0x7f08afd09445] 4: (abort()+0x17b) [0x7f08afd0cbab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f08b065769d] 6: (()+0xb5846) [0x7f08b0655846] 7: (()+0xb5873) [0x7f08b0655873] 8: (()+0xb596e) [0x7f08b065596e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1de) [0x7a82fe] 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x693) [0x530f83] 11: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x159) [0x531ac9] 12: (ReplicatedPG::sub_op_modify_reply(std::tr1::shared_ptrOpRequest)+0x15c) [0x53251c] 13: (ReplicatedPG::do_sub_op_reply(std::tr1::shared_ptrOpRequest)+0x81) [0x54d241] 14: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x1e3) [0x600883] 15: (OSD::dequeue_op(PG*)+0x238) [0x5bfaf8] 16: (ThreadPool::worker()+0x4d5) [0x79f835] 17: (ThreadPool::WorkThread::entry()+0xd) [0x5d87cd] 18: (()+0x7e9a) [0x7f08b1125e9a] 19: (clone()+0x6d) [0x7f08afdc54bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)
Do you have a coredump for the crash? Can you reproduce the crash with: debug filestore = 20 debug osd = 20 and post the logs? As far as the incomplete pg goes, can you post the output of ceph pg pgid query where pgid is the pgid of the incomplete pg (e.g. 1.34)? Thanks -Sam On Thu, Oct 11, 2012 at 3:17 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Hello everybody. I'm currently having problem with 1 of my OSD, crashing with this trace : ceph version 0.52 (commit:e48859474c4944d4ff201ddc9f5fd400e8898173) 1: /usr/bin/ceph-osd() [0x737879] 2: (()+0xf030) [0x7f43f0af0030] 3: (ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)+0x292) [0x555262] 4: (ReplicatedPG::recover_backfill(int)+0x1c1a) [0x55c93a] 5: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x26a) [0x563c1a] 6: (OSD::do_recovery(PG*)+0x39d) [0x5d3c9d] 7: (OSD::RecoveryWQ::_process(PG*)+0xd) [0x6119fd] 8: (ThreadPool::worker()+0x82b) [0x7c176b] 9: (ThreadPool::WorkThread::entry()+0xd) [0x5f609d] 10: (()+0x6b50) [0x7f43f0ae7b50] 11: (clone()+0x6d) [0x7f43ef81b78d] Restarting gives the same message after some seconds. I've been watching the bug tracker but I don't see something related. Some informations : kernel is 3.6.1, with standard debian packages from ceph.com My ceph cluster was running well and stable on 6 osd since june (3 datacenters, 2 with 2 nodes, 1 with 4 nodes, a replication of 2, and adjusted weight to try to balance data evenly). Beginned with the then-up-to-date version, then 0.48, 49,50,51... Data store is on XFS. I'm currently in the process of growing my ceph from 6 nodes to 12 nodes. 11 nodes are currently in ceph, for a 130 TB total. Declaring new osd was OK, the data has moved quite ok (in fact I had some OSD crash - not definitive, the osd restart ok-, maybe related to an error in my new nodes network configuration that I discovered after. More on that later, I can find the traces, but I'm not sure it's related) When ceph was finally stable again, with HEALTH_OK, I decided to reweight the osd (that was tuesday). Operation went quite OK, but near the end of operation (0,085% left), 1 of my OSD crashed, and won't start again. More problematic, with this osd down, I have 1 incomplete PG : ceph -s health HEALTH_WARN 86 pgs backfill; 231 pgs degraded; 4 pgs down; 15 pgs incomplete; 4 pgs peering; 134 pgs recovering; 19 pgs stuck inactive; 455 pgs stuck unclean; recovery 2122878/23181946 degraded (9.157%); 2321/11590973 unfound (0.020%); 1 near full osd(s) monmap e1: 3 mons at {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, election epoch 20, quorum 0,1,2 chichibu,glenesk,karuizawa osdmap e13184: 11 osds: 10 up, 10 in pgmap v2399093: 1728 pgs: 165 active, 1270 active+clean, 8 active+recovering+degraded, 41 active+recovering+degraded+remapped+backfill, 4 down+peering, 137 active+degraded, 3 active+clean+scrubbing, 15 incomplete, 40 active+recovering, 45 active+recovering+degraded+backfill; 44119 GB data, 84824 GB used, 37643 GB / 119 TB avail; 2122878/23181946 degraded (9.157%); 2321/11590973 unfound (0.020%) mdsmap e321: 1/1/1 up {0=karuizawa=up:active}, 2 up:standby how is it possible as I have a replication of 2 ? Is it a known problem ? Cheers, -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)
Hello everybody. I'm currently having problem with 1 of my OSD, crashing with this trace : ceph version 0.52 (commit:e48859474c4944d4ff201ddc9f5fd400e8898173) 1: /usr/bin/ceph-osd() [0x737879] 2: (()+0xf030) [0x7f43f0af0030] 3: (ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)+0x292) [0x555262] 4: (ReplicatedPG::recover_backfill(int)+0x1c1a) [0x55c93a] 5: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x26a) [0x563c1a] 6: (OSD::do_recovery(PG*)+0x39d) [0x5d3c9d] 7: (OSD::RecoveryWQ::_process(PG*)+0xd) [0x6119fd] 8: (ThreadPool::worker()+0x82b) [0x7c176b] 9: (ThreadPool::WorkThread::entry()+0xd) [0x5f609d] 10: (()+0x6b50) [0x7f43f0ae7b50] 11: (clone()+0x6d) [0x7f43ef81b78d] Restarting gives the same message after some seconds. I've been watching the bug tracker but I don't see something related. Some informations : kernel is 3.6.1, with standard debian packages from ceph.com My ceph cluster was running well and stable on 6 osd since june (3 datacenters, 2 with 2 nodes, 1 with 4 nodes, a replication of 2, and adjusted weight to try to balance data evenly). Beginned with the then-up-to-date version, then 0.48, 49,50,51... Data store is on XFS. I'm currently in the process of growing my ceph from 6 nodes to 12 nodes. 11 nodes are currently in ceph, for a 130 TB total. Declaring new osd was OK, the data has moved quite ok (in fact I had some OSD crash - not definitive, the osd restart ok-, maybe related to an error in my new nodes network configuration that I discovered after. More on that later, I can find the traces, but I'm not sure it's related) When ceph was finally stable again, with HEALTH_OK, I decided to reweight the osd (that was tuesday). Operation went quite OK, but near the end of operation (0,085% left), 1 of my OSD crashed, and won't start again. More problematic, with this osd down, I have 1 incomplete PG : ceph -s health HEALTH_WARN 86 pgs backfill; 231 pgs degraded; 4 pgs down; 15 pgs incomplete; 4 pgs peering; 134 pgs recovering; 19 pgs stuck inactive; 455 pgs stuck unclean; recovery 2122878/23181946 degraded (9.157%); 2321/11590973 unfound (0.020%); 1 near full osd(s) monmap e1: 3 mons at {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, election epoch 20, quorum 0,1,2 chichibu,glenesk,karuizawa osdmap e13184: 11 osds: 10 up, 10 in pgmap v2399093: 1728 pgs: 165 active, 1270 active+clean, 8 active+recovering+degraded, 41 active+recovering+degraded+remapped+backfill, 4 down+peering, 137 active+degraded, 3 active+clean+scrubbing, 15 incomplete, 40 active+recovering, 45 active+recovering+degraded+backfill; 44119 GB data, 84824 GB used, 37643 GB / 119 TB avail; 2122878/23181946 degraded (9.157%); 2321/11590973 unfound (0.020%) mdsmap e321: 1/1/1 up {0=karuizawa=up:active}, 2 up:standby how is it possible as I have a replication of 2 ? Is it a known problem ? Cheers, -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD-crash on 0.48.1argonout, error void ReplicatedPG::recover_got(hobject_t, eversion_t) not seen on list
Hi all, after adding a new node into our ceph-cluster yesterday, we had a crash of one OSD. I have found this kind of message in the bugtracker as being solved ( http://tracker.newdream.net/issues/2075), I will update this one for my convenience and attach the according log ( due to productive site, there is no more verbose debug available, sorry). Other than that, everything went almost smoothly, except the annoying slow requests, which are hopefully not only fixed in 0.51, ... when do we expect next stable, btw? The replication was fast, due to a SSD-cached LSI-controller, 4 OSDs per node, one per HDD, 1Gbit was completely saturated, time for next step towards 10Gbit ;) Regards, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
Hi, Almost always one or more osd dies when doing overlapped recovery - e.g. add new crushmap and remove some newly added osds from cluster some minutes later during remap or inject two slightly different crushmaps after a short time(surely preserving at least one of replicas online). Seems that osd dying on excessive amount of operations in queue because under normal test, e.g. rados, iowait does not break one percent barrier but during recovery it may raise up to ten percents(2108 w/ cache, splitted disks as R0 each). #0 0x7f62f193a445 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f62f193db9b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7f62f2236665 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x7f62f2234796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x7f62f22347c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x7f62f22349ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00844e11 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () #7 0x0073148f in FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int) () #8 0x0073484e in FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long) () #9 0x0070c680 in FileStore::_do_op(FileStore::OpSequencer*) () #10 0x0083ce01 in ThreadPool::worker() () #11 0x006823ed in ThreadPool::WorkThread::entry() () #12 0x7f62f345ee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #13 0x7f62f19f64cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #14 0x in ?? () ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) On Sun, Aug 26, 2012 at 8:52 PM, Andrey Korolyov and...@xdel.ru wrote: During recovery, following crash happens(simular to http://tracker.newdream.net/issues/2126 which marked resolved long ago): http://xdel.ru/downloads/ceph-log/osd-2012-08-26.txt On Sat, Aug 25, 2012 at 12:30 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Aug 23, 2012 at 4:09 AM, Gregory Farnum g...@inktank.com wrote: The tcmalloc backtrace on the OSD suggests this may be unrelated, but what's the fd limit on your monitor process? You may be approaching that limit if you've got 500 OSDs and a similar number of clients. Thanks! I didn`t measured a # of connection because of bearing in mind 1 conn per client, raising limit did the thing. Previously mentioned qemu-kvm zombie does not related to rbd itself - it can be created by destroying libvirt domain which is in saving state or vice-versa, so I`ll put a workaround on this. Right now I am faced different problem - osds dying silently, e.g. not leaving a core, I`ll check logs on the next testing phase. On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote: On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 #14
Re: OSD crash
On Tue, 4 Sep 2012, Andrey Korolyov wrote: Hi, Almost always one or more osd dies when doing overlapped recovery - e.g. add new crushmap and remove some newly added osds from cluster some minutes later during remap or inject two slightly different crushmaps after a short time(surely preserving at least one of replicas online). Seems that osd dying on excessive amount of operations in queue because under normal test, e.g. rados, iowait does not break one percent barrier but during recovery it may raise up to ten percents(2108 w/ cache, splitted disks as R0 each). #0 0x7f62f193a445 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f62f193db9b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7f62f2236665 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x7f62f2234796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x7f62f22347c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x7f62f22349ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00844e11 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () #7 0x0073148f in FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int) () Can you install debug symbols to see what line number this is one (e.g. apt-get install ceph-dbg), or check in the log file to see what the assert failure is? Thanks! sage #8 0x0073484e in FileStore::do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long) () #9 0x0070c680 in FileStore::_do_op(FileStore::OpSequencer*) () #10 0x0083ce01 in ThreadPool::worker() () #11 0x006823ed in ThreadPool::WorkThread::entry() () #12 0x7f62f345ee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #13 0x7f62f19f64cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #14 0x in ?? () ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) On Sun, Aug 26, 2012 at 8:52 PM, Andrey Korolyov and...@xdel.ru wrote: During recovery, following crash happens(simular to http://tracker.newdream.net/issues/2126 which marked resolved long ago): http://xdel.ru/downloads/ceph-log/osd-2012-08-26.txt On Sat, Aug 25, 2012 at 12:30 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Aug 23, 2012 at 4:09 AM, Gregory Farnum g...@inktank.com wrote: The tcmalloc backtrace on the OSD suggests this may be unrelated, but what's the fd limit on your monitor process? You may be approaching that limit if you've got 500 OSDs and a similar number of clients. Thanks! I didn`t measured a # of connection because of bearing in mind 1 conn per client, raising limit did the thing. Previously mentioned qemu-kvm zombie does not related to rbd itself - it can be created by destroying libvirt domain which is in saving state or vice-versa, so I`ll put a workaround on this. Right now I am faced different problem - osds dying silently, e.g. not leaving a core, I`ll check logs on the next testing phase. On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote: On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11
Re: OSD crash
On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #16 0x7fc37d1c47c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #17 0x7fc37d1c49ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c 0 == \unexpected error\, file=optimized out, line=3007, func=0x90ef80 unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)) at common/assert.cc:77 This means it got an unexpected error when talking to the file system. If you look in the osd log, it may tell you what that was. (It may not--there isn't usually the other tcmalloc stuff triggered from the assert handler.) What happens if you restart that ceph-osd daemon? sage #19 0x0073148f in FileStore::_do_transaction (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, trans_num=trans_num@entry=0) at os/FileStore.cc:3007 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, tls=..., op_seq=429545) at os/FileStore.cc:2436 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, osr=optimized out) at os/FileStore.cc:2259 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at common/WorkQueue.cc:54 #23 0x006823ed in ThreadPool::WorkThread::entry (this=optimized out) at ./common/WorkQueue.h:126 #24 0x7fc37e3eee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #26 0x in ?? () mon bt was exactly the same as in http://tracker.newdream.net/issues/2762 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote: On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #16 0x7fc37d1c47c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #17 0x7fc37d1c49ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c 0 == \unexpected error\, file=optimized out, line=3007, func=0x90ef80 unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)) at common/assert.cc:77 This means it got an unexpected error when talking to the file system. If you look in the osd log, it may tell you what that was. (It may not--there isn't usually the other tcmalloc stuff triggered from the assert handler.) What happens if you restart that ceph-osd daemon? sage Unfortunately I have completely disabled logs during test, so there are no suggestion of assert_fail. The main problem was revealed - created VMs was pointed to one monitor instead set of three, so there may be some unusual things(btw, crashed mon isn`t one from above, but a neighbor of crashed osds on first node). After IPMI reset node returns back well and cluster behavior seems to be okay - stuck kvm I/O somehow prevented even other module load|unload on this node, so I finally decided to do hard reset. Despite I`m using almost generic wheezy, glibc was updated to 2.15, may be because of this my trace appears first time ever. I`m almost sure that fs does not triggered this crash and mainly suspecting stuck kvm processes. I`ll rerun test with same conditions tomorrow(~500 vms pointed to one mon and very high I/O, but with osd logging). #19 0x0073148f in FileStore::_do_transaction (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, trans_num=trans_num@entry=0) at os/FileStore.cc:3007 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, tls=..., op_seq=429545) at os/FileStore.cc:2436 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, osr=optimized out) at os/FileStore.cc:2259 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at common/WorkQueue.cc:54 #23 0x006823ed in ThreadPool::WorkThread::entry (this=optimized out) at ./common/WorkQueue.h:126 #24 0x7fc37e3eee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #26 0x in ?? () mon bt was exactly the same as in http://tracker.newdream.net/issues/2762 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info
Re: OSD crash
The tcmalloc backtrace on the OSD suggests this may be unrelated, but what's the fd limit on your monitor process? You may be approaching that limit if you've got 500 OSDs and a similar number of clients. On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote: On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #16 0x7fc37d1c47c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #17 0x7fc37d1c49ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c 0 == \unexpected error\, file=optimized out, line=3007, func=0x90ef80 unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)) at common/assert.cc:77 This means it got an unexpected error when talking to the file system. If you look in the osd log, it may tell you what that was. (It may not--there isn't usually the other tcmalloc stuff triggered from the assert handler.) What happens if you restart that ceph-osd daemon? sage Unfortunately I have completely disabled logs during test, so there are no suggestion of assert_fail. The main problem was revealed - created VMs was pointed to one monitor instead set of three, so there may be some unusual things(btw, crashed mon isn`t one from above, but a neighbor of crashed osds on first node). After IPMI reset node returns back well and cluster behavior seems to be okay - stuck kvm I/O somehow prevented even other module load|unload on this node, so I finally decided to do hard reset. Despite I`m using almost generic wheezy, glibc was updated to 2.15, may be because of this my trace appears first time ever. I`m almost sure that fs does not triggered this crash and mainly suspecting stuck kvm processes. I`ll rerun test with same conditions tomorrow(~500 vms pointed to one mon and very high I/O, but with osd logging). #19 0x0073148f in FileStore::_do_transaction (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, trans_num=trans_num@entry=0) at os/FileStore.cc:3007 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, tls=..., op_seq=429545) at os/FileStore.cc:2436 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, osr=optimized out) at os/FileStore.cc:2259 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at common/WorkQueue.cc:54 #23 0x006823ed in ThreadPool::WorkThread::entry (this=optimized out) at ./common/WorkQueue.h:126 #24 0x7fc37e3eee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #26 0x in ?? () mon bt was exactly the same as in http://tracker.newdream.net/issues/2762 -- To unsubscribe from this list: send the line
Re: domino-style OSD crash
Le 09/07/2012 19:14, Samuel Just a écrit : Can you restart the node that failed to complete the upgrade with Well, it's a little big complicated ; I now run those nodes with XFS, and I've long-running jobs on it right now, so I can't stop the ceph cluster at the moment. As I've keeped the original broken btrfs volumes, I tried this morning to run the old osd in parrallel, using the $cluster variable. I only have partial success. I tried using different port for the mons, but ceph want to use the old mon map. I can edit it (epoch 1) but it seems to use 'latest' instead, the format isn't compatible with monmaptool and I don't know how to inject the modified on a non running cluster. Anyway, osd seems to start fine, and I can reproduce the bug : debug filestore = 20 debug osd = 20 I've put it in [global], is it sufficient ? and post the log after an hour or so of running? The upgrade process might legitimately take a while. -Sam Only 15 minutes running, but ceph-osd is consumming lots of cpu, and a strace shows lots of pread. Here is the log : [..] 2012-07-10 11:33:29.560052 7f3e615ac780 0 filestore(/CEPH-PROD/data/osd.1) mount syncfs(2) syscall not support by glibc 2012-07-10 11:33:29.560062 7f3e615ac780 0 filestore(/CEPH-PROD/data/osd.1) mount no syncfs(2), but the btrfs SYNC ioctl will suffice 2012-07-10 11:33:29.560172 7f3e615ac780 -1 filestore(/CEPH-PROD/data/osd.1) FileStore::mount : stale version stamp detected: 2. Proceeding, do_update is set, performing disk format upgrade. 2012-07-10 11:33:29.560233 7f3e615ac780 0 filestore(/CEPH-PROD/data/osd.1) mount found snaps 3744666,3746725 2012-07-10 11:33:29.560263 7f3e615ac780 10 filestore(/CEPH-PROD/data/osd.1) current/ seq was 3746725 2012-07-10 11:33:29.560267 7f3e615ac780 10 filestore(/CEPH-PROD/data/osd.1) most recent snap from 3744666,3746725 is 3746725 2012-07-10 11:33:29.560280 7f3e615ac780 10 filestore(/CEPH-PROD/data/osd.1) mount rolling back to consistent snap 3746725 2012-07-10 11:33:29.839281 7f3e615ac780 5 filestore(/CEPH-PROD/data/osd.1) mount op_seq is 3746725 ... and nothing more. I'll let him running for 3 hours. If I have another message, I'll let you know. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: As I've keeped the original broken btrfs volumes, I tried this morning to run the old osd in parrallel, using the $cluster variable. I only have partial success. The cluster mechanism was never intended for moving existing osds to other clusters. Trying that might not be a good idea. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Le 10/07/2012 17:56, Tommi Virtanen a écrit : On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: As I've keeped the original broken btrfs volumes, I tried this morning to run the old osd in parrallel, using the $cluster variable. I only have partial success. The cluster mechanism was never intended for moving existing osds to other clusters. Trying that might not be a good idea. Ok, good to know. I saw that the remaining maps could lead to problem, but in 2 words, what are the other associated risks ? Basically If I use 2 distincts config files, with differents non-overlapping paths, and different ports for OSD, MDS MON, we basically have 2 distincts and independant instances ? By the way, is using 2 mon instance with different ports supported ? Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: The cluster mechanism was never intended for moving existing osds to other clusters. Trying that might not be a good idea. Ok, good to know. I saw that the remaining maps could lead to problem, but in 2 words, what are the other associated risks ? Basically If I use 2 distincts config files, with differents non-overlapping paths, and different ports for OSD, MDS MON, we basically have 2 distincts and independant instances ? Fundamentally, it comes down to this: the two clusters will still have the same fsid, and you won't be isolated from configuration errors or leftover state (such as the monmap) in any way. There's a high chance that your let's poke around and debug cluster wrecks your healthy cluster. By the way, is using 2 mon instance with different ports supported ? Monitors are identified by ip:port. You can have multiple bind to the same IP address, as long as they get separate ports. Naturally, this practically means giving up on high availability. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Le 10/07/2012 19:11, Tommi Virtanen a écrit : On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: The cluster mechanism was never intended for moving existing osds to other clusters. Trying that might not be a good idea. Ok, good to know. I saw that the remaining maps could lead to problem, but in 2 words, what are the other associated risks ? Basically If I use 2 distincts config files, with differents non-overlapping paths, and different ports for OSD, MDS MON, we basically have 2 distincts and independant instances ? Fundamentally, it comes down to this: the two clusters will still have the same fsid, and you won't be isolated from configuration errors or Ah I understand. This is not the case : see : root@chichibu:~# cat /CEPH/data/osd.0/fsid f00139fe-478e-4c50-80e2-f7cb359100d4 root@chichibu:~# cat /CEPH-PROD/data/osd.0/fsid 43afd025-330e-4aa8-9324-3e9b0afce794 (CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, completely redone reformatted with mkcephfs. The volumes are totally independant : if you want the gore details : root@chichibu:~# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert ceph-osdLocalDisk -wi-a- 225,00g mon-btrfs LocalDisk -wi-ao 10,00g mon-xfs LocalDisk -wi-ao 10,00g dataceph-chichibu -wi-ao 5,00t- OLD btrfs, mounted on /CEPH-PROD dataxceph-chichibu -wi-ao 4,50t - NEW xfs, mounted on /CEPH leftover state (such as the monmap) in any way. There's a high chance that your let's poke around and debug cluster wrecks your healthy cluster. Yes I understand the risk. By the way, is using 2 mon instance with different ports supported ? Monitors are identified by ip:port. You can have multiple bind to the same IP address, as long as they get separate ports. Naturally, this practically means giving up on high availability. The idea is not just having 2 mon. I'll still use 3 differents machines for mon, but with 2 mon instance on each. One for the current ceph, the other for the old ceph. 2x3 Mon. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Tue, Jul 10, 2012 at 10:36 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Fundamentally, it comes down to this: the two clusters will still have the same fsid, and you won't be isolated from configuration errors or (CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, completely redone reformatted with mkcephfs. The volumes are totally independant : Ahh you re-created the monitors too. That changes things, then you have a new random fsid. I understood you only re-mkfsed the osd. Doing it like that, your real worry is just the remembered state of monmaps, osdmaps etc. If the daemons accidentally talk to the wrong cluster, the fsid *should* protect you from damage; they should get rejected. Similarly, if you use cephx authentication, the keys won't match either. Naturally, this practically means giving up on high availability. The idea is not just having 2 mon. I'll still use 3 differents machines for mon, but with 2 mon instance on each. One for the current ceph, the other for the old ceph. 2x3 Mon. That should be perfectly doable. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Can you restart the node that failed to complete the upgrade with debug filestore = 20 debug osd = 20 and post the log after an hour or so of running? The upgrade process might legitimately take a while. -Sam On Sat, Jul 7, 2012 at 1:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Le 06/07/2012 19:01, Gregory Farnum a écrit : On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Le 05/07/2012 23:32, Gregory Farnum a écrit : [...] ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out, OH , I didn't finish the sentence... When 1 osd was out, missing data was copied on another nodes, probably speeding btrfs problem on those nodes (I suspect erroneous out of space conditions) Ah. How full are/were the disks? The OSD nodes were below 50 % (all are 5 To volumes): osd.0 : 31% osd.1 : 31% osd.2 : 39% osd.3 : 65% no osd.4 :) osd.5 : 35% osd.6 : 60% osd.7 : 42% osd.8 : 34% all the volumes were using btrfs with lzo compress. [...] Oh, interesting. Are the broken nodes all on the same set of arrays? No. There are 4 completely independant raid arrays, in 4 different locations. They are similar (same brand model, but slighltly different disks, and 1 different firmware), all arrays are multipathed. I don't think the raid array is the problem. We use those particular models since 2/3 years, and in the logs I don't see any problem that can be caused by the storage itself (like scsi or multipath errors) I must have misunderstood then. What did you mean by 1 Array for 2 OSD nodes? I have 8 osd nodes, in 4 different locations (several km away). In each location I have 2 nodes and 1 raid Array. On each location, each raid array has 16 2To disks, 2 controllers with 4x 8 Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks for one, 7 disks for the orher). Each raid set is primary attached to 1 controller, and each osd node on the location has acces to the controller with 2 distinct paths. There were no correlation between failed nodes raid array. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causing corruption in ceph data on this node, and then the subsequent crash of other nodes. RIGHT now btrfs on this node is OK. I can access the filesystem without errors. But the LevelDB isn't. It's contents got corrupted, somehow somewhere, and it really is up to the LevelDB library to tolerate those errors; we have a simple get/put interface we use, and LevelDB is triggering an internal error. One node had problem with btrfs, leading first to kernel problem , probably corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. The LevelDB binary contents are not transferred over to other nodes; this kind of corruption would not spread over the Ceph clustering mechanisms. It's more likely that you have 4 independently corrupted LevelDBs. Something in the workload Ceph runs makes that corruption quite likely. The information here isn't enough to say whether the cause of the corruption is btrfs or LevelDB, but the recovery needs to handled by LevelDB -- and upstream is working on making it more robust: http://code.google.com/p/leveldb/issues/detail?id=97 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Le 09/07/2012 19:43, Tommi Virtanen a écrit : On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causing corruption in ceph data on this node, and then the subsequent crash of other nodes. RIGHT now btrfs on this node is OK. I can access the filesystem without errors. But the LevelDB isn't. It's contents got corrupted, somehow somewhere, and it really is up to the LevelDB library to tolerate those errors; we have a simple get/put interface we use, and LevelDB is triggering an internal error. Yes, understood. One node had problem with btrfs, leading first to kernel problem , probably corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. The LevelDB binary contents are not transferred over to other nodes; Ok thanks for the clarification ; this kind of corruption would not spread over the Ceph clustering mechanisms. It's more likely that you have 4 independently corrupted LevelDBs. Something in the workload Ceph runs makes that corruption quite likely. Very likely : since I reformatted my nodes with XFS I don't have problems so far. The information here isn't enough to say whether the cause of the corruption is btrfs or LevelDB, but the recovery needs to handled by LevelDB -- and upstream is working on making it more robust: http://code.google.com/p/leveldb/issues/detail?id=97 Yes, saw this. It's very important. Sometimes, s... happens. In respect to the size ceph volumes can reach, having a tool to restart damaged nodes (for whatever reason) is a must. Thanks for the time you took to answer. It's much clearer for me now. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Mon, Jul 9, 2012 at 12:05 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote: The information here isn't enough to say whether the cause of the corruption is btrfs or LevelDB, but the recovery needs to handled by LevelDB -- and upstream is working on making it more robust: http://code.google.com/p/leveldb/issues/detail?id=97 Yes, saw this. It's very important. Sometimes, s... happens. In respect to the size ceph volumes can reach, having a tool to restart damaged nodes (for whatever reason) is a must. Thanks for the time you took to answer. It's much clearer for me now. If it doesn't recover, you re-format the disk and thereby throw away the contents. Not really all that different from handling hardware failure. That's why we have replication. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Le 06/07/2012 19:01, Gregory Farnum a écrit : On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Le 05/07/2012 23:32, Gregory Farnum a écrit : [...] ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out, OH , I didn't finish the sentence... When 1 osd was out, missing data was copied on another nodes, probably speeding btrfs problem on those nodes (I suspect erroneous out of space conditions) Ah. How full are/were the disks? The OSD nodes were below 50 % (all are 5 To volumes): osd.0 : 31% osd.1 : 31% osd.2 : 39% osd.3 : 65% no osd.4 :) osd.5 : 35% osd.6 : 60% osd.7 : 42% osd.8 : 34% all the volumes were using btrfs with lzo compress. [...] Oh, interesting. Are the broken nodes all on the same set of arrays? No. There are 4 completely independant raid arrays, in 4 different locations. They are similar (same brand model, but slighltly different disks, and 1 different firmware), all arrays are multipathed. I don't think the raid array is the problem. We use those particular models since 2/3 years, and in the logs I don't see any problem that can be caused by the storage itself (like scsi or multipath errors) I must have misunderstood then. What did you mean by 1 Array for 2 OSD nodes? I have 8 osd nodes, in 4 different locations (several km away). In each location I have 2 nodes and 1 raid Array. On each location, each raid array has 16 2To disks, 2 controllers with 4x 8 Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks for one, 7 disks for the orher). Each raid set is primary attached to 1 controller, and each osd node on the location has acces to the controller with 2 distinct paths. There were no correlation between failed nodes raid array. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Le 05/07/2012 23:32, Gregory Farnum a écrit : [...] ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out, OH , I didn't finish the sentence... When 1 osd was out, missing data was copied on another nodes, probably speeding btrfs problem on those nodes (I suspect erroneous out of space conditions) I've reformatted OSD with xfs. Performance is slightly worse for the moment (well, depend on the workload, and maybe lack of syncfs is to blame), but at least I hope to have the storage layer rock-solid. BTW, I've managed to keep the faulty btrfs volumes . [...] I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, Yes. I designed the cluster that way. All nodes are identical hardware (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD) Oh, interesting. Are the broken nodes all on the same set of arrays? No. There are 4 completely independant raid arrays, in 4 different locations. They are similar (same brand model, but slighltly different disks, and 1 different firmware), all arrays are multipathed. I don't think the raid array is the problem. We use those particular models since 2/3 years, and in the logs I don't see any problem that can be caused by the storage itself (like scsi or multipath errors) Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Le 05/07/2012 23:32, Gregory Farnum a écrit : [...] ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out, OH , I didn't finish the sentence... When 1 osd was out, missing data was copied on another nodes, probably speeding btrfs problem on those nodes (I suspect erroneous out of space conditions) Ah. How full are/were the disks? I've reformatted OSD with xfs. Performance is slightly worse for the moment (well, depend on the workload, and maybe lack of syncfs is to blame), but at least I hope to have the storage layer rock-solid. BTW, I've managed to keep the faulty btrfs volumes . [...] I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, Yes. I designed the cluster that way. All nodes are identical hardware (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD) Oh, interesting. Are the broken nodes all on the same set of arrays? No. There are 4 completely independant raid arrays, in 4 different locations. They are similar (same brand model, but slighltly different disks, and 1 different firmware), all arrays are multipathed. I don't think the raid array is the problem. We use those particular models since 2/3 years, and in the logs I don't see any problem that can be caused by the storage itself (like scsi or multipath errors) I must have misunderstood then. What did you mean by 1 Array for 2 OSD nodes? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Le 04/07/2012 18:21, Gregory Farnum a écrit : On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr (mailto:yann.dup...@univ-nantes.fr) wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be quite big. At this point, it's more about the upstream developers (of btrfs etc) than us; we're on good terms with them but not experts on the on-disk format(s). You might want to send an email to the relevant mailing lists before wiping the disks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causing corruption in ceph data on this node, and then the subsequent crash of other nodes. RIGHT now btrfs on this node is OK. I can access the filesystem without errors. For the moment, on 8 nodes, 4 refuse to restart . 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem with the underlying fs as far as I can tell. So I think the scenario is : One node had problem with btrfs, leading first to kernel problem , probably corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/ ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out, If you think this scenario is highly improbable in real life (that is, btrfs will probably be fixed for good, and then, corruption can't happen), it's ok. But I wonder if this scenario can be triggered with other problem, and bad data can be transmitted to other sane nodes (power outage, out of memory condition, disk full... for example) That's why I proposed you a crashed ceph volume image (I shouldn't have talked about a crashed fs, sorry for the confusion) I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a ceph image won't really help in doing so. ok, no problem. I'll restart from scratch, freshly formated. I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, Yes. I designed the cluster that way. All nodes are identical hardware (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD) Oh, interesting. Are the broken nodes all on the same set of arrays? or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can? Ceph was running 0.47.2 at that time - (debian package for ceph). After the crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without success. Nothing particular for upgrades, because for the moment ceph is broken, so just apt-get upgrade with new version. ceph -s show that : root@label5:~# ceph -s health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale; 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%); 1814/1245570 unfound (0.146%) monmap e1: 3 mons at {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa osdmap e2404: 8 osds: 3 up, 3 in pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5 active+recovering+remapped, 32 active+clean+replay, 11 active+recovering+degraded, 25 active+remapped, 710 down+peering, 222 active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering, 20 stale+active+degraded, 6 down+remapped+peering, 8 stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%); 1814
Re: domino-style OSD crash
Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be quite big. At this point, it's more about the upstream developers (of btrfs etc) than us; we're on good terms with them but not experts on the on-disk format(s). You might want to send an email to the relevant mailing lists before wiping the disks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causing corruption in ceph data on this node, and then the subsequent crash of other nodes. RIGHT now btrfs on this node is OK. I can access the filesystem without errors. For the moment, on 8 nodes, 4 refuse to restart . 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem with the underlying fs as far as I can tell. So I think the scenario is : One node had problem with btrfs, leading first to kernel problem , probably corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. If you think this scenario is highly improbable in real life (that is, btrfs will probably be fixed for good, and then, corruption can't happen), it's ok. But I wonder if this scenario can be triggered with other problem, and bad data can be transmitted to other sane nodes (power outage, out of memory condition, disk full... for example) That's why I proposed you a crashed ceph volume image (I shouldn't have talked about a crashed fs, sorry for the confusion) Talking about btrfs, there is a lot of fixes in btrfs between 3.4 and 3.5rc. After the crash, I couldn't mount the btrfs volume. With 3.5rc I can , and there is no sign of problem on it. It does'nt mean data is safe there, but i think it's a sign that at least, some bugs have been corrected in btrfs code. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr (mailto:yann.dup...@univ-nantes.fr) wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be quite big. At this point, it's more about the upstream developers (of btrfs etc) than us; we're on good terms with them but not experts on the on-disk format(s). You might want to send an email to the relevant mailing lists before wiping the disks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causing corruption in ceph data on this node, and then the subsequent crash of other nodes. RIGHT now btrfs on this node is OK. I can access the filesystem without errors. For the moment, on 8 nodes, 4 refuse to restart . 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem with the underlying fs as far as I can tell. So I think the scenario is : One node had problem with btrfs, leading first to kernel problem , probably corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/ If you think this scenario is highly improbable in real life (that is, btrfs will probably be fixed for good, and then, corruption can't happen), it's ok. But I wonder if this scenario can be triggered with other problem, and bad data can be transmitted to other sane nodes (power outage, out of memory condition, disk full... for example) That's why I proposed you a crashed ceph volume image (I shouldn't have talked about a crashed fs, sorry for the confusion) I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a ceph image won't really help in doing so. I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Le 04/07/2012 18:21, Gregory Farnum a écrit : On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr (mailto:yann.dup...@univ-nantes.fr) wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be quite big. At this point, it's more about the upstream developers (of btrfs etc) than us; we're on good terms with them but not experts on the on-disk format(s). You might want to send an email to the relevant mailing lists before wiping the disks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causing corruption in ceph data on this node, and then the subsequent crash of other nodes. RIGHT now btrfs on this node is OK. I can access the filesystem without errors. For the moment, on 8 nodes, 4 refuse to restart . 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem with the underlying fs as far as I can tell. So I think the scenario is : One node had problem with btrfs, leading first to kernel problem , probably corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/ ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out, If you think this scenario is highly improbable in real life (that is, btrfs will probably be fixed for good, and then, corruption can't happen), it's ok. But I wonder if this scenario can be triggered with other problem, and bad data can be transmitted to other sane nodes (power outage, out of memory condition, disk full... for example) That's why I proposed you a crashed ceph volume image (I shouldn't have talked about a crashed fs, sorry for the confusion) I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a ceph image won't really help in doing so. ok, no problem. I'll restart from scratch, freshly formated. I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, Yes. I designed the cluster that way. All nodes are identical hardware (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD) or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can? Ceph was running 0.47.2 at that time - (debian package for ceph). After the crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without success. Nothing particular for upgrades, because for the moment ceph is broken, so just apt-get upgrade with new version. ceph -s show that : root@label5:~# ceph -s health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale; 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%); 1814/1245570 unfound (0.146%) monmap e1: 3 mons at {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa osdmap e2404: 8 osds: 3 up, 3 in pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5 active+recovering+remapped, 32 active+clean+replay, 11 active+recovering+degraded, 25 active+remapped, 710 down+peering, 222 active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering, 20 stale+active+degraded, 6 down+remapped+peering, 8 stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%); 1814/1245570 unfound (0.146%) mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of the 4 surviving OSD didn't
Re: domino-style OSD crash
On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right now. Tried to restart osd with 0.47.3, then next branch, and today with 0.48. 4 of 8 nodes fails with the same message : ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030) 1: /usr/bin/ceph-osd() [0x701929] ... 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const) const+0x4d) [0x6e811d] That looks like http://tracker.newdream.net/issues/2563 and the best we have for that ticket is looks like you have a corrupted leveldb file. Is this reproducible with a freshly mkfs'ed data partition? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Le 03/07/2012 21:42, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right now. Tried to restart osd with 0.47.3, then next branch, and today with 0.48. 4 of 8 nodes fails with the same message : ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030) 1: /usr/bin/ceph-osd() [0x701929] ... 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const) const+0x4d) [0x6e811d] That looks like http://tracker.newdream.net/issues/2563 and the best we have for that ticket is looks like you have a corrupted leveldb file. Is this reproducible with a freshly mkfs'ed data partition? Probably not. I have multiple data volumes on each nodes (I was planning xfs vs ext4 vs btrfs benchmarks before being ill) and thoses nodes start OK with another data partition . It's very probable that there is corruption somewhere, due to kernel bug , probably triggered by btrfs. Issue 2563 is probably the same. I'd like to restart those nodes without formatting them, not because the data is valuable, but because if the same thing happens in production, a method similar to fsck the node could be of great value. I saw the method to check the leveldb. Will try tomorrow without garantees. In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be quite big. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be quite big. At this point, it's more about the upstream developers (of btrfs etc) than us; we're on good terms with them but not experts on the on-disk format(s). You might want to send an email to the relevant mailing lists before wiping the disks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should an OSD crash when journal device is out of space?
Hey guys, Thanks for the problem report. I've created an issue to track it at http://tracker.newdream.net/issues/2687. It looks like we just assume that if you're using a file, you've got enough space for it. It shouldn't be a big deal to at least do some startup checks which will fail gracefully. -Greg On Wed, Jun 20, 2012 at 1:57 PM, Matthew Roy imjustmatt...@gmail.com wrote: I hit this a couple times and wondered the same thing. Why does the OSD need to bail when it runs out of journal space? On Wed, Jun 20, 2012 at 3:56 PM, Travis Rhoden trho...@gmail.com wrote: Not sure if this is a bug or not. It was definitely user error -- but since the OSD process bailed, figured I would report it. I had /tmpfs mounted with 2.5GB of space: tmpfs on /tmpfs type tmpfs (rw,size=2560m) Then I decided to increase my journal size to 5G, but forgot to increase the limit on /tmpfs. =) osd journal size = 5000 Predictably, things didn't go well when I ran a rados bench that filled up the journal. I'm not sure if such a case can be handled more gracefully: -4 2012-06-20 12:39:36.648773 7fc042a5f780 1 journal _open /tmpfs/osd.2.journal fd 30: 524288 bytes, block size 4096 bytes, directio = 0, aio = 0 -3 2012-06-20 12:42:23.179164 7fc02e1ad700 1 CephxAuthorizeHandler::verify_authorizer isvalid=1 -2 2012-06-20 12:42:46.643205 7fc0396cf700 -1 journal FileJournal::write_bl : write_fd failed: (28) No space left on device -1 2012-06-20 12:42:46.643245 7fc0396cf700 -1 journal FileJournal::do_write: write_bl(pos=2678079488) failed 0 2012-06-20 12:42:46.676991 7fc0396cf700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist)' thread 7fc0396cf700 time 2012-06-20 12:42:46.643315 os/FileJournal.cc: 994: FAILED assert(0) ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082] 2: (FileJournal::write_thread_entry()+0x735) [0x659545] 3: (FileJournal::Writer::entry()+0xd) [0x5de41d] 4: (()+0x7e9a) [0x7fc042434e9a] 5: (clone()+0x6d) [0x7fc0409e94bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal (Aborted) ** in thread 7fc0396cf700 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: /usr/bin/ceph-osd() [0x6eb32a] 2: (()+0xfcb0) [0x7fc04243ccb0] 3: (gsignal()+0x35) [0x7fc04092d445] 4: (abort()+0x17b) [0x7fc040930bab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d] 6: (()+0xb5846) [0x7fc041279846] 7: (()+0xb5873) [0x7fc041279873] 8: (()+0xb596e) [0x7fc04127996e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x282) [0x79dd02] 10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082] 11: (FileJournal::write_thread_entry()+0x735) [0x659545] 12: (FileJournal::Writer::entry()+0xd) [0x5de41d] 13: (()+0x7e9a) [0x7fc042434e9a] 14: (clone()+0x6d) [0x7fc0409e94bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- 0 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal (Aborted) ** in thread 7fc0396cf700 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: /usr/bin/ceph-osd() [0x6eb32a] 2: (()+0xfcb0) [0x7fc04243ccb0] 3: (gsignal()+0x35) [0x7fc04092d445] 4: (abort()+0x17b) [0x7fc040930bab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d] 6: (()+0xb5846) [0x7fc041279846] 7: (()+0xb5873) [0x7fc041279873] 8: (()+0xb596e) [0x7fc04127996e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x282) [0x79dd02] 10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082] 11: (FileJournal::write_thread_entry()+0x735) [0x659545] 12: (FileJournal::Writer::entry()+0xd) [0x5de41d] 13: (()+0x7e9a) [0x7fc042434e9a] 14: (clone()+0x6d) [0x7fc0409e94bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
THANKS a lot. This fixes it. I've merged your branch into next and i wsn't able to trigger the osd crash again. So please include this into 0.48. Greets Stefan Am 26.06.2012 20:01, schrieb Sam Just: Stefan, Sorry for the delay, I think I've found the problem. Could you give wip_ms_handle_reset_race a try? -Sam On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 26.06.2012 18:05, schrieb Tommi Virtanen: On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote: Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this can happen. For building I use the provided Debian scripts. Perhaps you upgraded the debs but did not restart the daemons? That would make the on-disk executable with that name not match the in-memory one. No, i reboot after each upgrade ;-) Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the issue. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
On Wed, 27 Jun 2012, Stefan Priebe - Profihost AG wrote: THANKS a lot. This fixes it. I've merged your branch into next and i wsn't able to trigger the osd crash again. So please include this into 0.48. Excellent. Thanks for testing! This now in next. sage Greets Stefan Am 26.06.2012 20:01, schrieb Sam Just: Stefan, Sorry for the delay, I think I've found the problem. Could you give wip_ms_handle_reset_race a try? -Sam On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 26.06.2012 18:05, schrieb Tommi Virtanen: On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote: Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this can happen. For building I use the provided Debian scripts. Perhaps you upgraded the debs but did not restart the daemons? That would make the on-disk executable with that name not match the in-memory one. No, i reboot after each upgrade ;-) Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the issue. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote: Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this can happen. For building I use the provided Debian scripts. Perhaps you upgraded the debs but did not restart the daemons? That would make the on-disk executable with that name not match the in-memory one. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
Am 26.06.2012 18:05, schrieb Tommi Virtanen: On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote: Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this can happen. For building I use the provided Debian scripts. Perhaps you upgraded the debs but did not restart the daemons? That would make the on-disk executable with that name not match the in-memory one. No, i reboot after each upgrade ;-) Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the issue. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
Stefan, Sorry for the delay, I think I've found the problem. Could you give wip_ms_handle_reset_race a try? -Sam On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 26.06.2012 18:05, schrieb Tommi Virtanen: On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote: Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this can happen. For building I use the provided Debian scripts. Perhaps you upgraded the debs but did not restart the daemons? That would make the on-disk executable with that name not match the in-memory one. No, i reboot after each upgrade ;-) Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the issue. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
I've yet to make the core match the binary. On Jun 22, 2012, at 11:32 PM, Stefan Priebe s.pri...@profihost.ag wrote: Thanks did you find anything? Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com: I am still looking into the logs. -Sam On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote: Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3Mceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
Thanks yes it is from the next branch. Am 23.06.2012 um 02:26 schrieb Dan Mick dan.m...@inktank.com: The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, which is not quite 0.47.3. You can get the version with binary -v, or (in my case) examining strings in the binary. I'm retrieving that version to analyze the core dump. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3Mceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
Thanks did you find anything? Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com: I am still looking into the logs. -Sam On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote: Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3Mceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
I'm still able to crash the ceph cluster while doing a lot of random I/O and then shut down the KVM. Stefan Am 21.06.2012 21:57, schrieb Stefan Priebe: OK i discovered this time that all osds had the same disk usage before crash. After starting the osd again i got this one: /dev/sdb1 224G 23G 191G 11% /srv/osd.30 /dev/sdc1 224G 1,5G 213G 1% /srv/osd.31 /dev/sdd1 224G 1,5G 213G 1% /srv/osd.32 /dev/sde1 224G 1,6G 213G 1% /srv/osd.33 So instead of 1,5GB osd 30 now uses 23G. Stefan Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG: Mhm is this normal (ceph health is NOW OK again) /dev/sdb1 224G 655M 214G 1% /srv/osd.20 /dev/sdc1 224G 640M 214G 1% /srv/osd.21 /dev/sdd1 224G 34G 181G 16% /srv/osd.22 /dev/sde1 224G 608M 214G 1% /srv/osd.23 Why does one OSD has so much more used space than the others? On my other OSD nodes all have around 600MB-700MB. Even when i reformat /dev/sdd1 after the backfill it has again 34GB? Stefan Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG: Another strange thing. Why does THIS OSD have 24GB and the others just 650MB? /dev/sdb1 224G 654M 214G 1% /srv/osd.20 /dev/sdc1 224G 638M 214G 1% /srv/osd.21 /dev/sdd1 224G 24G 190G 12% /srv/osd.22 /dev/sde1 224G 607M 214G 1% /srv/osd.23 When i start now the OSD again it seems to hang for forever. Load goes up to 200 and I/O Waits rise vom 0% to 20%. Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3M ceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3Mceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
I am still looking into the logs. -Sam On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote: Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3M ceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, which is not quite 0.47.3. You can get the version with binary -v, or (in my case) examining strings in the binary. I'm retrieving that version to analyze the core dump. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3Mceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
reproducable osd crash
Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3Mceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
When i start now the OSD again it seems to hang for forever. Load goes up to 200 and I/O Waits rise vom 0% to 20%. Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3M ceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
Another strange thing. Why does THIS OSD have 24GB and the others just 650MB? /dev/sdb1 224G 654M 214G 1% /srv/osd.20 /dev/sdc1 224G 638M 214G 1% /srv/osd.21 /dev/sdd1 224G 24G 190G 12% /srv/osd.22 /dev/sde1 224G 607M 214G 1% /srv/osd.23 When i start now the OSD again it seems to hang for forever. Load goes up to 200 and I/O Waits rise vom 0% to 20%. Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3M ceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
Mhm is this normal (ceph health is NOW OK again) /dev/sdb1 224G 655M 214G 1% /srv/osd.20 /dev/sdc1 224G 640M 214G 1% /srv/osd.21 /dev/sdd1 224G 34G 181G 16% /srv/osd.22 /dev/sde1 224G 608M 214G 1% /srv/osd.23 Why does one OSD has so much more used space than the others? On my other OSD nodes all have around 600MB-700MB. Even when i reformat /dev/sdd1 after the backfill it has again 34GB? Stefan Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG: Another strange thing. Why does THIS OSD have 24GB and the others just 650MB? /dev/sdb1 224G 654M 214G 1% /srv/osd.20 /dev/sdc1 224G 638M 214G 1% /srv/osd.21 /dev/sdd1 224G 24G 190G 12% /srv/osd.22 /dev/sde1 224G 607M 214G 1% /srv/osd.23 When i start now the OSD again it seems to hang for forever. Load goes up to 200 and I/O Waits rise vom 0% to 20%. Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3M ceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
OK i discovered this time that all osds had the same disk usage before crash. After starting the osd again i got this one: /dev/sdb1 224G 23G 191G 11% /srv/osd.30 /dev/sdc1 224G 1,5G 213G 1% /srv/osd.31 /dev/sdd1 224G 1,5G 213G 1% /srv/osd.32 /dev/sde1 224G 1,6G 213G 1% /srv/osd.33 So instead of 1,5GB osd 30 now uses 23G. Stefan Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG: Mhm is this normal (ceph health is NOW OK again) /dev/sdb1 224G 655M 214G 1% /srv/osd.20 /dev/sdc1 224G 640M 214G 1% /srv/osd.21 /dev/sdd1 224G 34G 181G 16% /srv/osd.22 /dev/sde1 224G 608M 214G 1% /srv/osd.23 Why does one OSD has so much more used space than the others? On my other OSD nodes all have around 600MB-700MB. Even when i reformat /dev/sdd1 after the backfill it has again 34GB? Stefan Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG: Another strange thing. Why does THIS OSD have 24GB and the others just 650MB? /dev/sdb1 224G 654M 214G 1% /srv/osd.20 /dev/sdc1 224G 638M 214G 1% /srv/osd.21 /dev/sdd1 224G 24G 190G 12% /srv/osd.22 /dev/sde1 224G 607M 214G 1% /srv/osd.23 When i start now the OSD again it seems to hang for forever. Load goes up to 200 and I/O Waits rise vom 0% to 20%. Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3M ceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Should an OSD crash when journal device is out of space?
Not sure if this is a bug or not. It was definitely user error -- but since the OSD process bailed, figured I would report it. I had /tmpfs mounted with 2.5GB of space: tmpfs on /tmpfs type tmpfs (rw,size=2560m) Then I decided to increase my journal size to 5G, but forgot to increase the limit on /tmpfs. =) osd journal size = 5000 Predictably, things didn't go well when I ran a rados bench that filled up the journal. I'm not sure if such a case can be handled more gracefully: -4 2012-06-20 12:39:36.648773 7fc042a5f780 1 journal _open /tmpfs/osd.2.journal fd 30: 524288 bytes, block size 4096 bytes, directio = 0, aio = 0 -3 2012-06-20 12:42:23.179164 7fc02e1ad700 1 CephxAuthorizeHandler::verify_authorizer isvalid=1 -2 2012-06-20 12:42:46.643205 7fc0396cf700 -1 journal FileJournal::write_bl : write_fd failed: (28) No space left on device -1 2012-06-20 12:42:46.643245 7fc0396cf700 -1 journal FileJournal::do_write: write_bl(pos=2678079488) failed 0 2012-06-20 12:42:46.676991 7fc0396cf700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist)' thread 7fc0396cf700 time 2012-06-20 12:42:46.643315 os/FileJournal.cc: 994: FAILED assert(0) ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082] 2: (FileJournal::write_thread_entry()+0x735) [0x659545] 3: (FileJournal::Writer::entry()+0xd) [0x5de41d] 4: (()+0x7e9a) [0x7fc042434e9a] 5: (clone()+0x6d) [0x7fc0409e94bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal (Aborted) ** in thread 7fc0396cf700 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: /usr/bin/ceph-osd() [0x6eb32a] 2: (()+0xfcb0) [0x7fc04243ccb0] 3: (gsignal()+0x35) [0x7fc04092d445] 4: (abort()+0x17b) [0x7fc040930bab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d] 6: (()+0xb5846) [0x7fc041279846] 7: (()+0xb5873) [0x7fc041279873] 8: (()+0xb596e) [0x7fc04127996e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x282) [0x79dd02] 10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082] 11: (FileJournal::write_thread_entry()+0x735) [0x659545] 12: (FileJournal::Writer::entry()+0xd) [0x5de41d] 13: (()+0x7e9a) [0x7fc042434e9a] 14: (clone()+0x6d) [0x7fc0409e94bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- 0 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal (Aborted) ** in thread 7fc0396cf700 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: /usr/bin/ceph-osd() [0x6eb32a] 2: (()+0xfcb0) [0x7fc04243ccb0] 3: (gsignal()+0x35) [0x7fc04092d445] 4: (abort()+0x17b) [0x7fc040930bab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d] 6: (()+0xb5846) [0x7fc041279846] 7: (()+0xb5873) [0x7fc041279873] 8: (()+0xb596e) [0x7fc04127996e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x282) [0x79dd02] 10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082] 11: (FileJournal::write_thread_entry()+0x735) [0x659545] 12: (FileJournal::Writer::entry()+0xd) [0x5de41d] 13: (()+0x7e9a) [0x7fc042434e9a] 14: (clone()+0x6d) [0x7fc0409e94bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should an OSD crash when journal device is out of space?
I hit this a couple times and wondered the same thing. Why does the OSD need to bail when it runs out of journal space? On Wed, Jun 20, 2012 at 3:56 PM, Travis Rhoden trho...@gmail.com wrote: Not sure if this is a bug or not. It was definitely user error -- but since the OSD process bailed, figured I would report it. I had /tmpfs mounted with 2.5GB of space: tmpfs on /tmpfs type tmpfs (rw,size=2560m) Then I decided to increase my journal size to 5G, but forgot to increase the limit on /tmpfs. =) osd journal size = 5000 Predictably, things didn't go well when I ran a rados bench that filled up the journal. I'm not sure if such a case can be handled more gracefully: -4 2012-06-20 12:39:36.648773 7fc042a5f780 1 journal _open /tmpfs/osd.2.journal fd 30: 524288 bytes, block size 4096 bytes, directio = 0, aio = 0 -3 2012-06-20 12:42:23.179164 7fc02e1ad700 1 CephxAuthorizeHandler::verify_authorizer isvalid=1 -2 2012-06-20 12:42:46.643205 7fc0396cf700 -1 journal FileJournal::write_bl : write_fd failed: (28) No space left on device -1 2012-06-20 12:42:46.643245 7fc0396cf700 -1 journal FileJournal::do_write: write_bl(pos=2678079488) failed 0 2012-06-20 12:42:46.676991 7fc0396cf700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist)' thread 7fc0396cf700 time 2012-06-20 12:42:46.643315 os/FileJournal.cc: 994: FAILED assert(0) ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082] 2: (FileJournal::write_thread_entry()+0x735) [0x659545] 3: (FileJournal::Writer::entry()+0xd) [0x5de41d] 4: (()+0x7e9a) [0x7fc042434e9a] 5: (clone()+0x6d) [0x7fc0409e94bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal (Aborted) ** in thread 7fc0396cf700 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: /usr/bin/ceph-osd() [0x6eb32a] 2: (()+0xfcb0) [0x7fc04243ccb0] 3: (gsignal()+0x35) [0x7fc04092d445] 4: (abort()+0x17b) [0x7fc040930bab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d] 6: (()+0xb5846) [0x7fc041279846] 7: (()+0xb5873) [0x7fc041279873] 8: (()+0xb596e) [0x7fc04127996e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x282) [0x79dd02] 10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082] 11: (FileJournal::write_thread_entry()+0x735) [0x659545] 12: (FileJournal::Writer::entry()+0xd) [0x5de41d] 13: (()+0x7e9a) [0x7fc042434e9a] 14: (clone()+0x6d) [0x7fc0409e94bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- 0 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal (Aborted) ** in thread 7fc0396cf700 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: /usr/bin/ceph-osd() [0x6eb32a] 2: (()+0xfcb0) [0x7fc04243ccb0] 3: (gsignal()+0x35) [0x7fc04092d445] 4: (abort()+0x17b) [0x7fc040930bab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d] 6: (()+0xb5846) [0x7fc041279846] 7: (()+0xb5873) [0x7fc041279873] 8: (()+0xb596e) [0x7fc04127996e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x282) [0x79dd02] 10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082] 11: (FileJournal::write_thread_entry()+0x735) [0x659545] 12: (FileJournal::Writer::entry()+0xd) [0x5de41d] 13: (()+0x7e9a) [0x7fc042434e9a] 14: (clone()+0x6d) [0x7fc0409e94bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
Am 17.06.2012 23:16, schrieb Sage Weil: Hi Stefan, I opened http://tracker.newdream.net/issues/2599 to track this, but the dump strangely does not include the ceph version or commit sha1. What version were you running? Sorry that was my build system it accidently removed the .git dir while builing so the version string couldn't be compiled in. It was 5efaa8d7799347dfae38333b1fd6e1a87dc76b28 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD crash
Hi, today i got another osd crash ;-( Strangely the osd logs are all empty. It seems the logrotate hasn't reloaded the daemons but i still have the core dump file? What's next? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
and another crash again ;-( 0 2012-06-16 15:31:32.524369 7fd8935c4700 -1 ./common/Mutex.h: In function 'void Mutex::Lock(bool)' thread 7fd8935c4700 time 2012-06-16 15:31:32.522446 ./common/Mutex.h: 110: FAILED assert(r == 0) ceph version (commit:) 1: /usr/bin/ceph-osd() [0x51a07d] 2: (ReplicatedPG::C_OSD_OndiskWriteUnlock::finish(int)+0x2a) [0x579c5a] 3: (FileStore::_finish_op(FileStore::OpSequencer*)+0x2e4) [0x684374] 4: (ThreadPool::worker()+0xbb7) [0x7bc087] 5: (ThreadPool::WorkThread::entry()+0xd) [0x5f144d] 6: (()+0x68ca) [0x7fd89db3a8ca] 7: (clone()+0x6d) [0x7fd89c1bec0d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- 2012-06-16 15:31:32.531567 7fd8935c4700 -1 *** Caught signal (Aborted) ** in thread 7fd8935c4700 ceph version (commit:) 1: /usr/bin/ceph-osd() [0x70e4b9] 2: (()+0xeff0) [0x7fd89db42ff0] 3: (gsignal()+0x35) [0x7fd89c121225] 4: (abort()+0x180) [0x7fd89c124030] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fd89c9b5dc5] 6: (()+0xcb166) [0x7fd89c9b4166] 7: (()+0xcb193) [0x7fd89c9b4193] 8: (()+0xcb28e) [0x7fd89c9b428e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x940) [0x78af20] 10: /usr/bin/ceph-osd() [0x51a07d] 11: (ReplicatedPG::C_OSD_OndiskWriteUnlock::finish(int)+0x2a) [0x579c5a] 12: (FileStore::_finish_op(FileStore::OpSequencer*)+0x2e4) [0x684374] 13: (ThreadPool::worker()+0xbb7) [0x7bc087] 14: (ThreadPool::WorkThread::entry()+0xd) [0x5f144d] 15: (()+0x68ca) [0x7fd89db3a8ca] 16: (clone()+0x6d) [0x7fd89c1bec0d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- 0 2012-06-16 15:31:32.531567 7fd8935c4700 -1 *** Caught signal (Aborted) ** in thread 7fd8935c4700 ceph version (commit:) 1: /usr/bin/ceph-osd() [0x70e4b9] 2: (()+0xeff0) [0x7fd89db42ff0] 3: (gsignal()+0x35) [0x7fd89c121225] 4: (abort()+0x180) [0x7fd89c124030] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fd89c9b5dc5] 6: (()+0xcb166) [0x7fd89c9b4166] 7: (()+0xcb193) [0x7fd89c9b4193] 8: (()+0xcb28e) [0x7fd89c9b428e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x940) [0x78af20] 10: /usr/bin/ceph-osd() [0x51a07d] 11: (ReplicatedPG::C_OSD_OndiskWriteUnlock::finish(int)+0x2a) [0x579c5a] 12: (FileStore::_finish_op(FileStore::OpSequencer*)+0x2e4) [0x684374] 13: (ThreadPool::worker()+0xbb7) [0x7bc087] 14: (ThreadPool::WorkThread::entry()+0xd) [0x5f144d] 15: (()+0x68ca) [0x7fd89db3a8ca] 16: (clone()+0x6d) [0x7fd89c1bec0d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- Am 16.06.2012 14:57, schrieb Stefan Priebe: Hi, today i got another osd crash ;-( Strangely the osd logs are all empty. It seems the logrotate hasn't reloaded the daemons but i still have the core dump file? What's next? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
domino-style OSD crash
Hello, Besides the performance inconsistency (see other thread titled poor OSD performance using kernel 3.4) where I promised some tests (will run this afternoon), we tried this week-end to stress test ceph, making backups with bacula on a rbd volume of 15T (8 osd nodes, using 8 physical machines) Results : Worked like a charm during two days, apart btrfs warn messages then OSD begin to crash 1 after all 'domino style'. This morning, only 2 OSD of 8 are left. 1 of the physical machine was in kernel oops state - Nothing was remote logged, don't know what happened, there were no clear stack message. I suspect btrfs , but I have no proof. This node (OSD.7) seems to have been the 1st one to crash, generated reconstruction between OSD then lead to the cascade osd crash. The other physical machines are still up, but with no osd running. here are some trace found in osd log : -3 2012-06-03 12:43:32.524671 7ff1352b8700 0 log [WRN] : slow request 30.506952 seconds old, rec eived at 2012-06-03 12:43:01.997386: osd_sub_op(osd.0.0:1842628 2.57 ea8d5657/label5_17606_object7068/ head [push] v 191'628 snapset=0=[]:[] snapc=0=[]) v6 currently queued for pg -2 2012-06-03 12:44:32.869852 7ff1352b8700 0 log [WRN] : 1 slow requests, 1 included below; olde st blocked for 30.073136 secs -1 2012-06-03 12:44:32.869886 7ff1352b8700 0 log [WRN] : slow request 30.073136 seconds old, rec eived at 2012-06-03 12:44:02.796651: osd_sub_op(osd.6.0:1837430 2.59 97e62059/rb.0.1.000a2cdf/head [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started 0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal (Aborted) ** in thread 7ff1237f6700 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: /usr/bin/ceph-osd() [0x708ea9] 2: (()+0xeff0) [0x7ff13af2cff0] 3: (gsignal()+0x35) [0x7ff13950b1b5] 4: (abort()+0x180) [0x7ff13950dfc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7ff139d9fdc5] 6: (()+0xcb166) [0x7ff139d9e166] 7: (()+0xcb193) [0x7ff139d9e193] 8: (()+0xcb28e) [0x7ff139d9e28e] 9: (std::__throw_length_error(char const*)+0x67) [0x7ff139d39307] 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocatorchar const)+0x72) [0x7ff139d7ab42] 11: (()+0xa8565) [0x7ff139d7b565] 12: (std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(char const*, unsigned long, std::allocatorchar const)+0x1b) [0x7ff139d7b7ab] 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const) const+0x4d) [0x6ef69d] 14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice const)+0x9f) [0x6fdd9f] 15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x4d3) [0x6eaba3] 16: (leveldb::DBImpl::BackgroundCompaction()+0x222) [0x6ebb02] 17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6ec378] 18: /usr/bin/ceph-osd() [0x704981] 19: (()+0x68ca) [0x7ff13af248ca] 20: (clone()+0x6d) [0x7ff1395a892d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2 OSD exhibit similar traces. --- 4 other had traces like this one : -5 2012-06-03 13:31:39.393489 7f74fd9c7700 -1 osd.3 1513 heartbeat_check: no reply from osd.5 sin ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:19.393488) -4 2012-06-03 13:31:40.393689 7f74fd9c7700 -1 osd.3 1513 heartbeat_check: no reply from osd.5 sin ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:20.393687) -3 2012-06-03 13:31:41.402873 7f74fd9c7700 -1 osd.3 1513 heartbeat_check: no reply from osd.5 sin ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:21.402872) -2 2012-06-03 13:31:42.363270 7f74f08ac700 -1 osd.3 1513 heartbeat_check: no reply from osd.5 sin ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.363269) -1 2012-06-03 13:31:42.416968 7f74fd9c7700 -1 osd.3 1513 heartbeat_check: no reply from osd.5 sin ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.416966) 0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)' thread 7f74f58b6700 time 2012-06-03 13:36:48.100157 osd/PG.cc: 402: FAILED assert(log.head = olog.tail olog.head = log.tail) ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372) 1: (PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)+0x1eae) [0x649cce] 2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec const)+0x2b1) [0x649fc1] 3: (boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::react_impl(boost::statechart::event_base const, void const*)+0x203) [0x660343] 4: (boost
Re: domino-style OSD crash
On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Results : Worked like a charm during two days, apart btrfs warn messages then OSD begin to crash 1 after all 'domino style'. Sorry to hear that. Reading through your message, there seem to be several problems; whether they are because of the same root cause, I can't tell. Quick triage to benefit the other devs: #1: kernel crash, no details available 1 of the physical machine was in kernel oops state - Nothing was remote #2: leveldb corruption? may be memory corruption that started elsewhere.. Sam, does this look like the leveldb issue you saw? [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started 0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal (Aborted) ** ... 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const) const+0x4d) [0x6ef69d] 14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice const)+0x9f) [0x6fdd9f] #3: PG::merge_log assertion while recovering from the above; Sam, any ideas? 0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)' thread 7f74f58b6700 time 2012-06-03 13:36:48.100157 osd/PG.cc: 402: FAILED assert(log.head = olog.tail olog.head = log.tail) #4: unknown btrfs warnings, there should an actual message above this traceback; believed fixed in latest kernel Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278] [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328] [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379] [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415] [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460] [8105a9f0] ? add_wait_queue+0x60/0x60 Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493] [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543] [a026abb1] ? do_async_commit+0x11/0x20 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572] -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Can you send the osd logs? The merge_log crashes are probably fixable if I can see the logs. The leveldb crash is almost certainly a result of memory corruption. Thanks -Sam On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com wrote: On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Results : Worked like a charm during two days, apart btrfs warn messages then OSD begin to crash 1 after all 'domino style'. Sorry to hear that. Reading through your message, there seem to be several problems; whether they are because of the same root cause, I can't tell. Quick triage to benefit the other devs: #1: kernel crash, no details available 1 of the physical machine was in kernel oops state - Nothing was remote #2: leveldb corruption? may be memory corruption that started elsewhere.. Sam, does this look like the leveldb issue you saw? [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started 0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal (Aborted) ** ... 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const) const+0x4d) [0x6ef69d] 14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice const)+0x9f) [0x6fdd9f] #3: PG::merge_log assertion while recovering from the above; Sam, any ideas? 0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)' thread 7f74f58b6700 time 2012-06-03 13:36:48.100157 osd/PG.cc: 402: FAILED assert(log.head = olog.tail olog.head = log.tail) #4: unknown btrfs warnings, there should an actual message above this traceback; believed fixed in latest kernel Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278] [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328] [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379] [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415] [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460] [8105a9f0] ? add_wait_queue+0x60/0x60 Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493] [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543] [a026abb1] ? do_async_commit+0x11/0x20 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572] -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
This is probably the same/similar to http://tracker.newdream.net/issues/2462, no? There's a log there, though I've no idea how helpful it is. On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote: Can you send the osd logs? The merge_log crashes are probably fixable if I can see the logs. The leveldb crash is almost certainly a result of memory corruption. Thanks -Sam On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com (mailto:t...@inktank.com) wrote: On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr (mailto:yann.dup...@univ-nantes.fr) wrote: Results : Worked like a charm during two days, apart btrfs warn messages then OSD begin to crash 1 after all 'domino style'. Sorry to hear that. Reading through your message, there seem to be several problems; whether they are because of the same root cause, I can't tell. Quick triage to benefit the other devs: #1: kernel crash, no details available 1 of the physical machine was in kernel oops state - Nothing was remote #2: leveldb corruption? may be memory corruption that started elsewhere.. Sam, does this look like the leveldb issue you saw? [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started 0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal (Aborted) ** ... 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const) const+0x4d) [0x6ef69d] 14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice const)+0x9f) [0x6fdd9f] #3: PG::merge_log assertion while recovering from the above; Sam, any ideas? 0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc (http://PG.cc): In function 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)' thread 7f74f58b6700 time 2012-06-03 13:36:48.100157 osd/PG.cc (http://PG.cc): 402: FAILED assert(log.head = olog.tail olog.head = log.tail) #4: unknown btrfs warnings, there should an actual message above this traceback; believed fixed in latest kernel Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278] [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328] [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379] [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415] [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460] [8105a9f0] ? add_wait_queue+0x60/0x60 Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493] [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543] [a026abb1] ? do_async_commit+0x11/0x20 [btrfs] Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572] -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem after ceph-osd crash
On Mon, 20 Feb 2012, Oliver Francke wrote: Hi Sage, On 02/20/2012 06:41 PM, Sage Weil wrote: On Mon, 20 Feb 2012, Oliver Francke wrote: Hi, we are just in trouble after some mess with trying to include a new OSD-node into our cluster. We get some weird libceph: corrupt inc osdmap epoch 880 off 102 (c9001db8990a of c9001db898a4-c9001db89dae) I just retested the kernel client against the new server code and I don't see this. If you can pull the osdmap/880 file from the monitor data directory (soon, please, the monitor will delete it once things fully recover and move on) I can see what the data looks like. on the console. The whole system is in a state ala: 012-02-20 17:56:27.585295pg v942504: 2046 pgs: 1348 active+clean, 43 active+recovering+degraded+remapped+backfill, 218 active+recovering, 437 active+recovering+remapped+backfill; 1950 GB data, 3734 GB used, 26059 GB / 29794 GB avail; 272914/1349073 degraded (20.230%) and sometimes the ceph-osd on node0 is crashing. At the moment of writing, the degrading continues to shrink down below 20%. How did ceph-osd crash? Is there a dump in the log? 'course I will provide all logs, uhm, a bit later, we are busy to start all VM's, and handle first customer-tickets right now ;-) To be most complete for the collection, would you be so kind to give a list of all necessary kern.log osdX.log etc.? I think just the crashed osd log will be enough. It looks like the rest of the cluster is recovering ok... Are the VMs running on top of the kernel rbd client, or KVM+librbd? sage Thnx for the fast reaction, Oliver. sage Any clues? Thnx in @vance, Oliver. -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash during resync
Hi Sage, I uploaded the osd.0 log as well. http://85.214.49.87/ceph/20120124/osd.0.log.bz2 -martin Am 25.01.2012 23:08, schrieb Sage Weil: Hi Martin, On Tue, 24 Jan 2012, Martin Mailand wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3 crashed. I am not sure, if the crashes happened because I played with osd.0, or if they are bugs. osd.2 -rw--- 1 root root 1.1G 2012-01-24 12:19 core-ceph-osd-1000-1327403927-s-brick-002 log: 2012-01-24 12:15:45.563135 7f1fdd42c700 log [INF] : 2.a restarting backfill on osd.0 from (185'113859,185'113859] 0//0 to 196'114038 osd/PG.cc: In function 'void PG::finish_recovery_op(const hobject_t, bool)', in thread '7f1fdab26700' osd/PG.cc: 1553: FAILED assert(recovery_ops_active 0) -rw--- 1 root root 758M 2012-01-24 15:58 core-ceph-osd-20755-1327417128-s-brick-002 Can you post the log for osd.0 too? Thanks! sage log: 2012-01-24 15:58:48.356892 7fe26acbf700 osd.2 379 pg[2.ff( v 379'286211 lc 202'286160 (185'285159,379'286211] n=112 ec=1 les/c 379/310 373/376/376) [2,1] r=0 lpr=376 rops=1 mlcod 202'286160 active m=6] * oi-watcher: client.4478 cookie=1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread '7fe26fdca700' osd/ReplicatedPG.cc: 3199: FAILED assert(obc-watchers.size() == 0) osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread '7fe26fdca700' http://85.214.49.87/ceph/20120124/osd.2.log.bz2 osd.3 -rw--- 1 root root 986M 2012-01-24 12:24 core-ceph-osd-962-1327404263-s-brick-003 log: 2012-01-24 12:15:50.241321 7f30c8fde700 log [INF] : 2.2e restarting backfill on osd.0 from (185'338312,185'338312] 0//0 to 196'339910 2012-01-24 12:21:48.420242 7f30c5ed7700 log [INF] : 2.9d scrub ok osd/PG.cc: In function 'void PG::activate(ObjectStore::Transaction, std::listContext*, std::mapint, std::mappg_t, PG::Query , std::mapint, MOSDPGInfo**)', in thread '7f30c8fde700' http://85.214.49.87/ceph/20120124/osd.3.log.bz2 -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash during resync
Hi Martin, On Tue, 24 Jan 2012, Martin Mailand wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3 crashed. I am not sure, if the crashes happened because I played with osd.0, or if they are bugs. osd.2 -rw--- 1 root root 1.1G 2012-01-24 12:19 core-ceph-osd-1000-1327403927-s-brick-002 log: 2012-01-24 12:15:45.563135 7f1fdd42c700 log [INF] : 2.a restarting backfill on osd.0 from (185'113859,185'113859] 0//0 to 196'114038 osd/PG.cc: In function 'void PG::finish_recovery_op(const hobject_t, bool)', in thread '7f1fdab26700' osd/PG.cc: 1553: FAILED assert(recovery_ops_active 0) -rw--- 1 root root 758M 2012-01-24 15:58 core-ceph-osd-20755-1327417128-s-brick-002 Can you post the log for osd.0 too? Thanks! sage log: 2012-01-24 15:58:48.356892 7fe26acbf700 osd.2 379 pg[2.ff( v 379'286211 lc 202'286160 (185'285159,379'286211] n=112 ec=1 les/c 379/310 373/376/376) [2,1] r=0 lpr=376 rops=1 mlcod 202'286160 active m=6] * oi-watcher: client.4478 cookie=1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread '7fe26fdca700' osd/ReplicatedPG.cc: 3199: FAILED assert(obc-watchers.size() == 0) osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread '7fe26fdca700' http://85.214.49.87/ceph/20120124/osd.2.log.bz2 osd.3 -rw--- 1 root root 986M 2012-01-24 12:24 core-ceph-osd-962-1327404263-s-brick-003 log: 2012-01-24 12:15:50.241321 7f30c8fde700 log [INF] : 2.2e restarting backfill on osd.0 from (185'338312,185'338312] 0//0 to 196'339910 2012-01-24 12:21:48.420242 7f30c5ed7700 log [INF] : 2.9d scrub ok osd/PG.cc: In function 'void PG::activate(ObjectStore::Transaction, std::listContext*, std::mapint, std::mappg_t, PG::Query , std::mapint, MOSDPGInfo**)', in thread '7f30c8fde700' http://85.214.49.87/ceph/20120124/osd.3.log.bz2 -martin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash during resync
On Tue, Jan 24, 2012 at 10:48 AM, Martin Mailand mar...@tuxadero.com wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3 crashed. I am not sure, if the crashes happened because I played with osd.0, or if they are bugs. These are OSD-level issues not caused by btrfs, so your new kernel definitely didn't do it. It's probably fallout from the backfill changes that got merged in last week. I created new bugs to track them: http://tracker.newdream.net/issues/1982 (1983, 1984). Sam and Josh are going wild on some other issues that we've turned up and these have been added to the queue as soon as somebody qualified can get to them. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash during resync
Hi Greg, ok, do you guys still need the core files, or could I delete them? -martin Am 24.01.2012 22:13, schrieb Gregory Farnum: On Tue, Jan 24, 2012 at 10:48 AM, Martin Mailandmar...@tuxadero.com wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3 crashed. I am not sure, if the crashes happened because I played with osd.0, or if they are bugs. These are OSD-level issues not caused by btrfs, so your new kernel definitely didn't do it. It's probably fallout from the backfill changes that got merged in last week. I created new bugs to track them: http://tracker.newdream.net/issues/1982 (1983, 1984). Sam and Josh are going wild on some other issues that we've turned up and these have been added to the queue as soon as somebody qualified can get to them. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash during resync
On Tue, Jan 24, 2012 at 1:22 PM, Martin Mailand mar...@tuxadero.com wrote: Hi Greg, ok, do you guys still need the core files, or could I delete them? Sam thinks probably not since we have the backtraces and the logs...thanks for asking, though! :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
This is an interesting one -- the invariant that assert is checking isn't too complicated (that the object lives on the RecoveryWQ's queue) and seems to hold everywhere the RecoveryWQ is called. And the functions modifying the queue are always called under the workqueue lock, and do maintenance if the xlist::item is on a different list. Which makes me think that the problem must be from conflating the RecoveryWQ lock and the PG lock in the few places that modify the PG::recovery_item directly, rather than via RecoveryWQ functions. Anybody more familiar than me with this have ideas? Fyodor, based on the time stamps and output you've given us, I assume you don't have more detailed logs? -Greg On Thu, May 26, 2011 at 5:12 PM, Fyodor Ustinov u...@ufm.su wrote: Hi! 2011-05-27 02:35:22.046798 7fa8ff058700 journal check_for_full at 837623808 : JOURNAL FULL 837623808 = 147455 (max_size 996147200 start 837771264) 2011-05-27 02:35:23.479379 7fa8f7f49700 journal throttle: waited for bytes 2011-05-27 02:35:34.730418 7fa8ff058700 journal check_for_full at 836984832 : JOURNAL FULL 836984832 = 638975 (max_size 996147200 start 837623808) 2011-05-27 02:35:36.050384 7fa8f7f49700 journal throttle: waited for bytes 2011-05-27 02:35:47.226789 7fa8ff058700 journal check_for_full at 836882432 : JOURNAL FULL 836882432 = 102399 (max_size 996147200 start 836984832) 2011-05-27 02:35:48.937259 7fa8f874a700 journal throttle: waited for bytes 2011-05-27 02:35:59.985040 7fa8ff058700 journal check_for_full at 836685824 : JOURNAL FULL 836685824 = 196607 (max_size 996147200 start 836882432) 2011-05-27 02:36:01.654955 7fa8f874a700 journal throttle: waited for bytes 2011-05-27 02:36:12.362896 7fa8ff058700 journal check_for_full at 835723264 : JOURNAL FULL 835723264 = 962559 (max_size 996147200 start 836685824) 2011-05-27 02:36:14.375435 7fa8f7f49700 journal throttle: waited for bytes ./include/xlist.h: In function 'void xlistT::remove(xlistT::item*) [with T = PG*]', in thread '0x7fa8f7748700' ./include/xlist.h: 107: FAILED assert(i-_list == this) ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632) 1: (xlistPG*::pop_front()+0xbb) [0x54f28b] 2: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3] 3: (ThreadPool::worker()+0x10a) [0x65799a] 4: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d] 5: (()+0x6d8c) [0x7fa904294d8c] 6: (clone()+0x6d) [0x7fa90314704d] ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632) 1: (xlistPG*::pop_front()+0xbb) [0x54f28b] 2: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3] 3: (ThreadPool::worker()+0x10a) [0x65799a] 4: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d] 5: (()+0x6d8c) [0x7fa904294d8c] 6: (clone()+0x6d) [0x7fa90314704d] *** Caught signal (Aborted) ** in thread 0x7fa8f7748700 ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632) 1: /usr/bin/cosd() [0x6729f9] 2: (()+0xfc60) [0x7fa90429dc60] 3: (gsignal()+0x35) [0x7fa903094d05] 4: (abort()+0x186) [0x7fa903098ab6] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa90394b6dd] 6: (()+0xb9926) [0x7fa903949926] 7: (()+0xb9953) [0x7fa903949953] 8: (()+0xb9a5e) [0x7fa903949a5e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x362) [0x655e32] 10: (xlistPG*::pop_front()+0xbb) [0x54f28b] 11: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3] 12: (ThreadPool::worker()+0x10a) [0x65799a] 13: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d] 14: (()+0x6d8c) [0x7fa904294d8c] 15: (clone()+0x6d) [0x7fa90314704d] WBR, Fyodor. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
On 05/27/2011 06:16 PM, Gregory Farnum wrote: This is an interesting one -- the invariant that assert is checking isn't too complicated (that the object lives on the RecoveryWQ's queue) and seems to hold everywhere the RecoveryWQ is called. And the functions modifying the queue are always called under the workqueue lock, and do maintenance if the xlist::item is on a different list. Which makes me think that the problem must be from conflating the RecoveryWQ lock and the PG lock in the few places that modify the PG::recovery_item directly, rather than via RecoveryWQ functions. Anybody more familiar than me with this have ideas? Fyodor, based on the time stamps and output you've given us, I assume you don't have more detailed logs? -Greg Greg, i got this crash again. Let me tell you the configuration and what is happening: Configuration: 6 osd servers. 4G RAM, 4*1T hdd (mdadmed to raid0), 2*1G etherchannel ethernet, Ubuntu server 11.04/64 with kernel 2.6.39 (hand compiled) mon+mds server 24G RAM, the same os. On each OSD Journal placed on 1G tempfs. OSD data - on xfs in this case. Configuration file: [global] max open files = 131072 log file = /var/log/ceph/$name.log pid file = /var/run/ceph/$name.pid [mon] mon data = /mfs/mon$id [mon.0] mon addr = 10.5.51.230:6789 [mds] keyring = /mfs/mds/keyring.$name [mds.0] host = mds0 [osd] osd data = /$name osd journal = /journal/$name osd journal size = 950 journal dio = false [osd.0] host = osd0 cluster addr = 10.5.51.10 public addr = 10.5.51.140 [osd.1] host = osd1 cluster addr = 10.5.51.11 public addr = 10.5.51.141 [osd.2] host = osd2 cluster addr = 10.5.51.12 public addr = 10.5.51.142 [osd.3] host = osd3 cluster addr = 10.5.51.13 public addr = 10.5.51.143 [osd.4] host = osd4 cluster addr = 10.5.51.14 public addr = 10.5.51.144 [osd.5] host = osd5 cluster addr = 10.5.51.15 public addr = 10.5.51.145 What happening: osd2 was crashed, rebooted, osd data and journal created from scratch by cosd --mkfs -i 2 --monmap /tmp/monmap and server started. Additional - on osd2 enables writeahaed, but I think it's not principal in this case. Well, server start rebalancing: 2011-05-27 15:12:49.323558 7f3b69de5740 ceph version 0.28.1.commit: d66c6ca19bbde3c363b135b66072de44e67c6632. process: cosd. pid: 1694 2011-05-27 15:12:49.325331 7f3b69de5740 filestore(/osd.2) mount FIEMAP ioctl is NOT supported 2011-05-27 15:12:49.325378 7f3b69de5740 filestore(/osd.2) mount did NOT detect btrfs 2011-05-27 15:12:49.325467 7f3b69de5740 filestore(/osd.2) mount found snaps 2011-05-27 15:12:49.325512 7f3b69de5740 filestore(/osd.2) mount: WRITEAHEAD journal mode explicitly enabled in conf 2011-05-27 15:12:49.325526 7f3b69de5740 filestore(/osd.2) mount WARNING: not btrfs or ext3; data may be lost 2011-05-27 15:12:49.325606 7f3b69de5740 journal _open /journal/osd.2 fd 11: 996147200 bytes, block size 4096 bytes, directio = 0 2011-05-27 15:12:49.325641 7f3b69de5740 journal read_entry 4096 : seq 1 203 bytes 2011-05-27 15:12:49.325698 7f3b69de5740 journal _open /journal/osd.2 fd 11: 996147200 bytes, block size 4096 bytes, directio = 0 2011-05-27 15:12:49.544716 7f3b59656700 -- 10.5.51.12:6801/1694 10.5.51.14:6801/5070 pipe(0x1239d20 sd=27 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION 2011-05-27 15:12:49.544798 7f3b59c5c700 -- 10.5.51.12:6801/1694 10.5.51.13:6801/5165 pipe(0x104b950 sd=14 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION 2011-05-27 15:12:49.544864 7f3b59757700 -- 10.5.51.12:6801/1694 10.5.51.15:6801/1574 pipe(0x11e7cd0 sd=16 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION 2011-05-27 15:12:49.544909 7f3b59959700 -- 10.5.51.12:6801/1694 10.5.51.10:6801/6148 pipe(0x11d7d30 sd=15 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION 2011-05-27 15:13:23.015637 7f3b64579700 journal check_for_full at 66404352 : JOURNAL FULL 66404352 = 851967 (max_size 996147200 start 67256320) 2011-05-27 15:13:25.586081 7f3b5dc6b700 journal throttle: waited for bytes 2011-05-27 15:13:25.601789 7f3b5d46a700 journal throttle: waited for bytes [...] and after 2 hours: 2011-05-27 17:30:21.355034 7f3b64579700 journal check_for_full at 415199232 : JOURNAL FULL 415199232 = 778239 (max_size 996147200 start 415977472) 2011-05-27 17:30:23.441445 7f3b5d46a700 journal throttle: waited for bytes 2011-05-27 17:30:36.362877 7f3b64579700 journal check_for_full at 414326784 : JOURNAL FULL 414326784 = 872447 (max_size 996147200 start 415199232) 2011-05-27 17:30:38.391372 7f3b5d46a700 journal throttle: waited for bytes 2011-05-27 17:30:50.373936 7f3b64579700 journal check_for_full at 414314496 : JOURNAL FULL 414314496 = 12287 (max_size 996147200
Re: OSD crash
On 05/27/2011 10:18 PM, Gregory Farnum wrote: Can you check out the recoverywq_fix branch and see if that prevents this issue? Or just apply the patch I've included below. :) -Greg Looks as though this patch has helped. At least this osd has completd rebalancing. Great! Thanks! WBR, Fyodor. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html