We trashed one OSD and started backfilling it. After about 90 minutes it 
started crashing again:

I found http://tracker.ceph.com/issues/6101 
<http://tracker.ceph.com/issues/6101>
We'll disable snap trimming so it at least runs, but could someone suggest what 
the root cause is? Can OSD get backfilledl inconsistently from another 
(possibly also bad?) OSD? But then I would expect both OSDs to crash at the 
same time when trimming the same snap?


1111 2015-09-25 14:06:44.195039 2c9b65ed700 -1 osd.14 pg_epoch: 1216281 
pg[4.3d85( v 1216271'13915838 (1209157'13912606,1216271'13915838] 
local-les=1216278 n=241 ec=3 les/c 1216278/1216278 1215709/1216277/1216277) 
[14,36,59] r=     0 lpr=1216277 mlcod 0'0 active+clean 
snaptrimq=[857c0~1,857c3~1,857cc~1,857d0~1,857d3~1,857d8~1,857da~1,857dd~1,857e2~1,857e9~1,857ed~1,857f1~1,857f4~1,857fc~1,857fe~1,85800~1,85803~1,8580a~1,8580c~1,8580e~1,85811~1,85813~
     
1,85817~1,8581a~1,8581c~1,8581f~1,85821~1,85823~1,8582a~1,8582c~1,8582f~1,85838~1,8583a~1,8583d~1,85843~1,85848~1,8584b~1,8584d~1,85851~1,85854~1,85856~1,85859~1,8585b~1,8585d~1,85862~1,85a2f~2,85a34~1,85a43~16,85a5a~16,85
     
a71~1,85a73~2,85a76~6,85a7d~1,85a81~1,85a86~2,85a8a~1,85a8c~2,85a8f~2,85a93~f,85aa4~1,85aa9~2,85aac~1,85ab2~1,85ab5~3,85abc~1,85ac1~2]]
 trim_objectcould not find coid 
783dbd85/rbd_data.1a785181f15746a.00000000000238df/857c     0//4
1112 2015-09-25 14:06:44.266974 2c9b65ed700 -1 osd/ReplicatedPG.cc 
<http://replicatedpg.cc/>: In function 'ReplicatedPG::RepGather* 
ReplicatedPG::trim_object(const hobject_t&)' thread 2c9b65ed700 time 2015-09-25 
14:06:44.206977
1113 osd/ReplicatedPG.cc <http://replicatedpg.cc/>: 1510: FAILED assert(0)
1114 
1115  ceph version 0.67.11-82-ge5b6eea 
(e5b6eea91cc37434f78a987d2dd1d3edd4a23f3f)
1116  1: (ReplicatedPG::trim_object(hobject_t const&)+0x150) [0x6e8bd0]
1117  2: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim 
const&)+0x2e7) [0x6ee0d7]
1118  3: (boost::statechart::detail::reaction_result 
boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, 
ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::n     a, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost:
     :mpl::list<boost::statechart::custom_reaction<ReplicatedPG::SnapTrim>, 
boost::statechart::transition<ReplicatedPG::Reset, ReplicatedPG::NotTrimming, 
boost::statechart::detail::no_context<ReplicatedPG::Reset>, &(boost::stat     
echart::detail::no_context<ReplicatedPG::Reset>::no_function(ReplicatedPG::Reset
 const&))>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,      
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, 
ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_     ::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> 
>(boost::statechart::simple_state<ReplicatedPG::T     rimmingObjects, 
ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mp     l_::na, mpl_::na, 
mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, 
boost::statechart::event_base const&, void const*)+0x96) [0x740fa6]
1119  4: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, 
ReplicatedPG::NotTrimming, std::allocator<void>, 
boost::statechart::null_exception_translator>::process_queued_events()+0x137) 
[0x71bdf7]
1120  5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, 
ReplicatedPG::NotTrimming, std::allocator<void>, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
 const&)+0x26) [0x     71cfe6]
1121  6: (ReplicatedPG::snap_trimmer()+0x4ed) [0x6b59ad]
1122  7: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x790c54]
1123  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68c) [0x9a69cc]
1124  9: (ThreadPool::WorkThread::entry()+0x10) [0x9a7c20]
1125  10: (()+0x7e9a) [0x2ca012a8e9a]
1126  11: (clone()+0x6d) [0x2c9ff81d38d]
1127  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed 
to interpret this.
1128 
1129 --- begin dump of recent events ---
1130    -47> 2015-09-25 13:30:40.023697 2ca018cd780  5 asok(0x3005650) 
register_command perfcounters_dump hook 0x3006d90
1131    -46> 2015-09-25 13:30:40.023726 2ca018cd780  5 asok(0x3005650) 
register_command 1 hook 0x3006d90
1132    -45> 2015-09-25 13:30:40.023731 2ca018cd780  5 asok(0x3005650) 
register_command perf dump hook 0x3006d90
1133    -44> 2015-09-25 13:30:40.023743 2ca018cd780  5 asok(0x3005650) 
register_command perfcounters_schema hook 0x3006d90
1134    -43> 2015-09-25 13:30:40.023749 2ca018cd780  5 asok(0x3005650) 
register_command 2 hook 0x3006d90
1135    -42> 2015-09-25 13:30:40.023751 2ca018cd780  5 asok(0x3005650) 
register_command perf schema hook 0x3006d90
1136    -41> 2015-09-25 13:30:40.023756 2ca018cd780  5 asok(0x3005650) 
register_command config show hook 0x3006d90
1137    -40> 2015-09-25 13:30:40.023763 2ca018cd780  5 asok(0x3005650) 
register_command config set hook 0x3006d90
1138    -39> 2015-09-25 13:30:40.023766 2ca018cd780  5 asok(0x3005650) 
register_command config get hook 0x3006d90
1139    -38> 2015-09-25 13:30:40.023770 2ca018cd780  5 asok(0x3005650) 
register_command log flush hook 0x3006d90
1140    -37> 2015-09-25 13:30:40.023773 2ca018cd780  5 asok(0x3005650) 
register_command log dump hook 0x3006d90
1141    -36> 2015-09-25 13:30:40.023777 2ca018cd780  5 asok(0x3005650) 
register_command log reopen hook 0x3006d90
1142    -35> 2015-09-25 13:30:40.025429 2ca018cd780  0 ceph version 
0.67.11-82-ge5b6eea (e5b6eea91cc37434f78a987d2dd1d3edd4a23f3f), process 
ceph-osd, pid 3251242



> On 25 Sep 2015, at 13:42, Jan Schermer <[email protected]> wrote:
> 
> Hi,
> we recently migrated some of our nodes to Ubuntu 12, which helped everything 
> quite a bit.
> 
> But we hit a snag where the upstart initscript would not set the file open 
> ulimit correctly and some OSDs ran out of fds.
> 
> Some roblems manifested since then on the node where this happened such as 
> scrub errors (which were corrected - don't ask me how, I was sleeping :-))  - 
> but not two of the OSDs on this node started failing with SIGABORT:
> 
> 2015-09-25 10:45:17.860461 361ea17e700 -1 osd.12 pg_epoch: 1209679 pg[4.3d85( 
> v 1209679'13913518 (1209090'13910508,1209679'13913518] local-les=1209679 
> n=235 ec=3 les/c 1209679/1209679 1209678/1209678/1209678) [12,36,59] r=0 
> lpr=1209678 mlcod 1209656'13913517 active+clean 
> snaptrimq=[857c0~1,857f4~1,85a43~2]] trim_objectcould not find coid 
> 783dbd85/rbd_data.1a785181f15746a.00000000000238df/857c0//4
> 2015-09-25 10:45:17.862019 361ea17e700 -1 osd/ReplicatedPG.cc 
> <http://replicatedpg.cc/>: In function 'ReplicatedPG::RepGather* 
> ReplicatedPG::trim_object(const hobject_t&)' thread 361ea17e700 time 
> 2015-09-25 10:45:17.860501
> osd/ReplicatedPG.cc <http://replicatedpg.cc/>: 1510: FAILED assert(0)
> 
>  ceph version 0.67.11-82-ge5b6eea (e5b6eea91cc37434f78a987d2dd1d3edd4a23f3f)
>  1: (ReplicatedPG::trim_object(hobject_t const&)+0x150) [0x6e8bd0]
>  2: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim 
> const&)+0x2e7) [0x6ee0d7]
>  3: (boost::statechart::detail::reaction_result 
> boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, 
> ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na>, 
> (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<ReplicatedPG::SnapTrim>,
>  boost::statechart::transition<ReplicatedPG::Reset, 
> ReplicatedPG::NotTrimming, 
> boost::statechart::detail::no_context<ReplicatedPG::Reset>, 
> &(boost::statechart::detail::no_context<ReplicatedPG::Reset>::no_function(ReplicatedPG::Reset
>  const&))>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
> boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, 
> ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> 
> >(boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, 
> ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, 
> boost::statechart::event_base const&, void const*)+0x96) [0x740fa6]
>  4: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, 
> ReplicatedPG::NotTrimming, std::allocator<void>, 
> boost::statechart::null_exception_translator>::process_queued_events()+0x137) 
> [0x71bdf7]
>  5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, 
> ReplicatedPG::NotTrimming, std::allocator<void>, 
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
>  const&)+0x26) [0x71cfe6]
>  6: (ReplicatedPG::snap_trimmer()+0x4ed) [0x6b59ad]
>  7: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x790c54]
>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68c) [0x9a69cc]
>  9: (ThreadPool::WorkThread::entry()+0x10) [0x9a7c20]
>  10: (()+0x7e9a) [0x36258121e9a]
>  11: (clone()+0x6d) [0x3625669638d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
> interpret this.
> (full log available on request, file system will be thrashed shortly so tell 
> me if it's helpful to look for something in there)
> 
> Could this all be caused by the OSD running out of file descriptors? Is it 
> supposed to handle a problem like this (meaning both the assert that happened 
> now and the file descriptor limit) gracefuly? Or is it a known issue that 
> this could happen?
> The thing about upstart is it apparently keeps restarting the OSD, which 
> makes the problem even worse.
> 
> Luckily we caught this in time and it only happened on one node, so we are 
> thrashing all the OSDs here.
> 
> Looks like a problem that could hit anyone, and if it actually damages data 
> then it could be pretty bad and maybe worth looking into - tell me what more 
> is needed.
> 
> Config:
> XFS filesystem
> Ubuntu 12 with 3.14.37 kernel
> FIEMAP disabled
> Ceph Dumpling (0.67.11-82-ge5b6eea)
> 
> Thanks
> 
> Jan

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to