Yeah!!..Looks similar but not entirely..
There is another potential race condition that may cause this.

We are protecting the TrackedOp::events structure only during 
TrackedOp::mark_event with lock mutex. I couldn't find it anywhere else. The 
events structure should also be protected during dump and more specifically 
within _dump().
I am taking care of it as well.

Thanks & Regards
Somnath
-----Original Message-----
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Monday, September 08, 2014 5:59 PM
To: Somnath Roy
Cc: Samuel Just; ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: RE: OSD is crashing while running admin socket

On Tue, 9 Sep 2014, Somnath Roy wrote:
> Created the following tracker and assigned to me.
> 
> http://tracker.ceph.com/issues/9384

By the way, this might be the same as or similar to
http://tracker.ceph.com/issues/8885

Thanks!
sage


> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Monday, September 08, 2014 5:22 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OSD is crashing while running admin socket
> 
> That seems reasonable.  Bug away!
> -Sam
> 
> On Mon, Sep 8, 2014 at 5:11 PM, Somnath Roy <somnath....@sandisk.com> wrote:
> > Hi Sage/Sam,
> >
> >
> >
> > I faced a crash in OSD with latest Ceph master. Here is the log 
> > trace for the same.
> >
> >
> >
> > ceph version 0.85-677-gd5777c4
> > (d5777c421548e7f039bb2c77cb0df2e9c7404723)
> >
> > 1: ceph-osd() [0x990def]
> >
> > 2: (()+0xfbb0) [0x7f72ae6e6bb0]
> >
> > 3: (gsignal()+0x37) [0x7f72acc08f77]
> >
> > 4: (abort()+0x148) [0x7f72acc0c5e8]
> >
> > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f72ad5146e5]
> >
> > 6: (()+0x5e856) [0x7f72ad512856]
> >
> > 7: (()+0x5e883) [0x7f72ad512883]
> >
> > 8: (()+0x5eaae) [0x7f72ad512aae]
> >
> > 9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, 
> > unsigned int, unsigned int)+0x277) [0xa88747]
> >
> > 10: (ceph::buffer::list::write(int, int, std::ostream&) const+0x81) 
> > [0xa89541]
> >
> > 11: (operator<<(std::ostream&, OSDOp const&)+0x1f6) [0x717a16]
> >
> > 12: (MOSDOp::print(std::ostream&) const+0x172) [0x6e5e32]
> >
> > 13: (TrackedOp::dump(utime_t, ceph::Formatter*) const+0x223) 
> > [0x6b6483]
> >
> > 14: (OpTracker::dump_ops_in_flight(ceph::Formatter*)+0xa7) 
> > [0x6b7057]
> >
> > 15: (OSD::asok_command(std::string, std::map<std::string, 
> > boost::variant<std::string, bool, long, double, 
> > std::vector<std::string, std::allocator<std::string> >, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_>, std::less<std::string>, 
> > std::allocator<std::pair<std::string const, 
> > boost::variant<std::string, bool, long, double, 
> > std::vector<std::string, std::allocator<std::string> >, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_> > > >&, std::string,
> > std::ostream&)+0x1d7) [0x612cb7]
> >
> > 16: (OSDSocketHook::call(std::string, std::map<std::string, 
> > boost::variant<std::string, bool, long, double, 
> > std::vector<std::string, std::allocator<std::string> >, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_>, std::less<std::string>, 
> > std::allocator<std::pair<std::string const, 
> > boost::variant<std::string, bool, long, double, 
> > std::vector<std::string, std::allocator<std::string> >, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_, boost::detail::variant::void_, 
> > boost::detail::variant::void_> > > >&, std::string,
> > ceph::buffer::list&)+0x67) [0x67c8b7]
> >
> > 17: (AdminSocket::do_accept()+0x1007) [0xa79817]
> >
> > 18: (AdminSocket::entry()+0x258) [0xa7b448]
> >
> > 19: (()+0x7f6e) [0x7f72ae6def6e]
> >
> > 20: (clone()+0x6d) [0x7f72acccc9cd]
> >
> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> > needed to interpret this.
> >
> >
> >
> > Steps to reproduce:
> >
> > -----------------------
> >
> >
> >
> > 1.       Run ios
> >
> > 2.       While ios running , run the following command continuously.
> >
> >
> >
> > ?ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight?
> >
> >
> >
> > 3.       At some point the osd will be crashed.
> >
> >
> >
> > I think I have root caused it..
> >
> >
> >
> > 1.       OpTracker::RemoveOnDelete::operator() is calling
> > op->_unregistered() which clears out message->data() and payload
> >
> > 2.       After that, if optracking is enabled we are calling
> > unregister_inflight_op() which removed the op from the xlist.
> >
> > 3.       Now, while dumping ops, we are calling
> > _dump_op_descriptor_unlocked() from TrackedOP::dump, which tries to 
> > print the message.
> >
> > 4.       So, there is a race condition when it tries to print the message
> > whoes ops (data) field is already cleared.
> >
> >
> >
> > Fix could be, call this op->_unregistered (in case optracking is
> > enabled) after it is removed from xlist.
> >
> >
> >
> > With this fix, I am not getting the crash anymore.
> >
> >
> >
> > If my observation is correct, please let me know. I will raise a bug 
> > and will fix that as part of the overall optracker performance 
> > improvement (I will submit that pull request soon).
> >
> >
> >
> > Thanks & Regards
> >
> > Somnath
> >
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail 
> > message is intended only for the use of the designated recipient(s) 
> > named above. If the reader of this message is not the intended 
> > recipient, you are hereby notified that you have received this 
> > message in error and that any review, dissemination, distribution, 
> > or copying of this message is strictly prohibited. If you have 
> > received this communication in error, please notify the sender by 
> > telephone or e-mail (as shown above) immediately and destroy any and 
> > all copies of this message in your possession (whether hard copies or 
> > electronically stored copies).
> >
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to