Re: [ceph-users] OSDs crash after deleting unfound object in Luminous 12.2.8

2018-10-18 Thread Mike Lovell
re-adding the list.

i'm glad to hear you got things back to a working state. one thing you
might want to check is the hit_set_history in the pg data. if the missing
hit sets are no longer in the history, then it is probably safe to go back
to the normal builds. that is until you have to mark another hit set
missing. :)  i think the code that removes the hit set from the pg data is
before that assert so its possible it still removed it from the history.

mike

On Thu, Oct 18, 2018 at 9:11 AM Lawrence Smith <
lawrence.sm...@uni-muenster.de> wrote:

> Hi Mike,
>
> Thanks a bunch for your writeup, that was exactly the problem and
> solution! All i did was comment out the assert and ad an if(obc){ } after
> to make sure i don't run into a segfault, and now the cluster is healthy
> once again. I am not sure if ceph will register a mismatch in a byte count
> while scrubbing due to the missing object, but I don't think so.
>
> Anyway, I just wanted to thank you for your help!
>
> Best wishes,
>
> Lawrence
>
> On 10/13/2018 02:00 AM, Mike Lovell wrote:
>
> what was the object name that you marked lost? was it one of the cache
> tier hit_sets?
>
> the trace you have does seem to be failing when the OSD is trying to
> remove a hit set that is no longer needed. i ran into a similar problem
> which might have been why that bug you listed was created. maybe providing
> what i have since discovered about hit sets might help.
>
> the hit sets are what the cache tier uses to know which objects have been
> accessed in a given period of time. these hit sets are then stored in the
> object store using an object name that is generated. for the version you're
> running, the code for that generation is at
> https://github.com/ceph/ceph/blob/v12.2.8/src/osd/PrimaryLogPG.cc#L12667.
> its bascially "hit_set__archive__" where the
> times are recorded in the hit set history. that hit set history is stored
> as part of the PG metadata. you can get a list of all of the hit sets the
> PG has by looking at 'ceph pg  query' and looking at the
> ['info']['hit_set_history']['history'] array. each entry in that array has
> the information on each hit set for the PG and the times are what is used
> in generation of the object name. there should be one ceph object for each
> hit set listed in that array.
>
> if you told the cluster to mark one of the hit set objects as lost, its
> possible the OSD cannot get that object and is hitting the assert(obc) near
> the end of PrimaryLogPG::hit_set_trim in the same source file referenced
> above. you can potentially verify this by a couple methods. i think if you
> set debug_osd at 20 that it should log a line saying something like
> "hit_set_trim removing hit_set__archive." if that name matches
> one of the ones you marked lost, then is this almost certainly the cause.
> you can also do a find on the OSD directory, if you're using file store,
> and look for the right file name. something like 'find
> /var/lib/ceph/osd/ceph-/current/_head -name
> hit\*set\*\*archive\*' should work. include the \ to escape the * so
> bash doesn't interpret it. if you're using bluestore, i think you can use
> the ceph-objectstore-tool while the osd is stopped to get a list of
> objects. you'll probably want to only look in the .ceph-internal namespace
> since the hit sets are stored in that namespace.
>
> there are a couple potential ways to get around this. what we did when we
> had the problem was run a custom build of the ceph-osd where we commented
> out the assert(obc); line in hit_set_trim. that build was only run for long
> enough to get the cluster back online and then to flush and evict the
> entire cache, remove the cache, restart using the normal ceph builds, and
> then recreate the cache.
>
> the other options are things that i don't know for sure if they'll work.
> if you're using file store, you might be able to just copy another hit set
> to the file name of the missing hit set object. this should be pretty
> benign and its just going to remove the object in a moment anyways. also,
> i'm not entirely sure how to come up with what directory to put the object
> in if the osd has done any directory splitting. maybe someone on the list
> will know how to do this. there might be a way with the
> ceph-objectstore-tool to write in the object but i couldn't find one in my
> testing on hammer.
>
> the last option i can think of, is that if you can completely stop any
> traffic to the pools in question, its possible the OSDs wont crash.
> hit_set_trim doesn't appear to get called if there is no client traffic
> reaching the osds and the hit sets aren't being updated. if you can stop
> anything from using the pools in question and guarantee nothing will come
> in, then it might be possible to keep the OSDs long enough to flush
> everything from the cache tier, remove it, and recreate it. this option
> seems like a long shot and i don't know for sure it'll work. it just seemed
> to me like the OSDs would stay up 

Re: [ceph-users] OSDs crash after deleting unfound object in Luminous 12.2.8

2018-10-12 Thread Mike Lovell
what was the object name that you marked lost? was it one of the cache tier
hit_sets?

the trace you have does seem to be failing when the OSD is trying to remove
a hit set that is no longer needed. i ran into a similar problem which
might have been why that bug you listed was created. maybe providing what i
have since discovered about hit sets might help.

the hit sets are what the cache tier uses to know which objects have been
accessed in a given period of time. these hit sets are then stored in the
object store using an object name that is generated. for the version you're
running, the code for that generation is at
https://github.com/ceph/ceph/blob/v12.2.8/src/osd/PrimaryLogPG.cc#L12667.
its bascially "hit_set__archive__" where the
times are recorded in the hit set history. that hit set history is stored
as part of the PG metadata. you can get a list of all of the hit sets the
PG has by looking at 'ceph pg  query' and looking at the
['info']['hit_set_history']['history'] array. each entry in that array has
the information on each hit set for the PG and the times are what is used
in generation of the object name. there should be one ceph object for each
hit set listed in that array.

if you told the cluster to mark one of the hit set objects as lost, its
possible the OSD cannot get that object and is hitting the assert(obc) near
the end of PrimaryLogPG::hit_set_trim in the same source file referenced
above. you can potentially verify this by a couple methods. i think if you
set debug_osd at 20 that it should log a line saying something like
"hit_set_trim removing hit_set__archive." if that name matches
one of the ones you marked lost, then is this almost certainly the cause.
you can also do a find on the OSD directory, if you're using file store,
and look for the right file name. something like 'find
/var/lib/ceph/osd/ceph-/current/_head -name
hit\*set\*\*archive\*' should work. include the \ to escape the * so
bash doesn't interpret it. if you're using bluestore, i think you can use
the ceph-objectstore-tool while the osd is stopped to get a list of
objects. you'll probably want to only look in the .ceph-internal namespace
since the hit sets are stored in that namespace.

there are a couple potential ways to get around this. what we did when we
had the problem was run a custom build of the ceph-osd where we commented
out the assert(obc); line in hit_set_trim. that build was only run for long
enough to get the cluster back online and then to flush and evict the
entire cache, remove the cache, restart using the normal ceph builds, and
then recreate the cache.

the other options are things that i don't know for sure if they'll work. if
you're using file store, you might be able to just copy another hit set to
the file name of the missing hit set object. this should be pretty benign
and its just going to remove the object in a moment anyways. also, i'm not
entirely sure how to come up with what directory to put the object in if
the osd has done any directory splitting. maybe someone on the list will
know how to do this. there might be a way with the ceph-objectstore-tool to
write in the object but i couldn't find one in my testing on hammer.

the last option i can think of, is that if you can completely stop any
traffic to the pools in question, its possible the OSDs wont crash.
hit_set_trim doesn't appear to get called if there is no client traffic
reaching the osds and the hit sets aren't being updated. if you can stop
anything from using the pools in question and guarantee nothing will come
in, then it might be possible to keep the OSDs long enough to flush
everything from the cache tier, remove it, and recreate it. this option
seems like a long shot and i don't know for sure it'll work. it just seemed
to me like the OSDs would stay up in a similar scenario on my hammer test
cluster. its possible things have changed in luminous and hit_set_trim gets
called more often. i also didn't test whether the process of flushing and
evicting the objects in the cache caused hit_set_trim to get called.

hopefully that gives you some more info on what might be going on and ways
around it. i'm not entirely sure why there is still the assert(obj); in
hit_set_trim still. there was a little bit of discussion about removing it
since it means the object its trying to remove is gone anyways. i think
that just happened for a little bit in irc. i guess it didn't happen cause
no one followed up on it.

good luck and hopefully you don't blame me if things get worse. :)
mike

On Fri, Oct 12, 2018 at 7:34 AM Lawrence Smith <
lawrence.sm...@uni-muenster.de> wrote:

> Hi all,
>
> we are running a luminous 12.2.8 cluster with a 3 fold replicated cache
> pool with a min_size of 2. We recently encountered an "object unfound"
> error in one of our pgs in this pool. After marking this object lost,
> two of the acting osds crashed and were unable to start up again, with
> only the primary osd staying up. Hoping the cluster might 

Re: [ceph-users] OSDs crash after deleting unfound object in Luminous 12.2.8

2018-10-12 Thread Paul Emmerich
It would be helpful to have a full crash log with debug osd = 0/20 and
the information in which pool and pg you marked the object as lost.
You might be able to use ceph-objectstore-tool to remove the bad
object from the OSD if it still exists in either the cache pool or
underlying pool.

Ugly fix if that doesn't work: patch the code to just ignore that
instead of asserting. Only do that after verifying that it's acutally
crashing on the object you deleted, ideally you'd also check the
object name and only do that on the known bad object. Also, no
guarantee that this is safe to do, but I think it should be here
(after a very short look at the crashing code)

I also had to hardcode an object name of a badly corrupted object into
an OSD to ignore it to prevent a crash an OSD with the last surving
copy of a PG, fun times... (ceph-objectstore-tool wouldn't even
recognize that the object with that name exists in my case)


Paul
Am Fr., 12. Okt. 2018 um 15:34 Uhr schrieb Lawrence Smith
:
>
> Hi all,
>
> we are running a luminous 12.2.8 cluster with a 3 fold replicated cache
> pool with a min_size of 2. We recently encountered an "object unfound"
> error in one of our pgs in this pool. After marking this object lost,
> two of the acting osds crashed and were unable to start up again, with
> only the primary osd staying up. Hoping the cluster might remap the
> copies of this pg, we marked the two crashed osds as out. Now the
> primary osd of this pg has also gone down leaving again only one active
> osd with the cluster reporting a degraded filesystem. All the affected
> osds are running filestore, while about half the cluster has already
> been upgraded to run bluestore osds.
>
> All three of the crashed osds fail to restart, reporting the following
> error during startup:
>
> Oct 12 13:19:12 kaa-109 ceph-osd[166266]:  0> 2018-10-12
> 13:19:12.782652 7f1f2d79b700 -1
> /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
> In function '
> void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
> int)' thread 7f1f2d79b700 time 2018-10-12 13:19:12.779813
> /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
> 12985: FAILED assert(obc)
>
>   ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> luminous (stable)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x109) [0x562265bfda9c]
>   2:
> (PrimaryLogPG::hit_set_trim(std::unique_ptr std::default_delete >&, unsigned int)+0x942)
> [0x5622657d6cea]
>   3: (PrimaryLogPG::hit_set_persist()+0xa4b) [0x5622657e5fab]
>   4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x426a)
> [0x562265800c64]
>   5: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0xc1f) [0x5622657b94ed]
>   6: (OSD::dequeue_op(boost::intrusive_ptr,
> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x396)
> [0x562265655cf8]
>   7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr
> const&)+0x5a) [0x5622658c09a6]
>   8: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x1ab6) [0x562265657918]
>   9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5cd)
> [0x562265c026f5]
>   10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562265c05e72]
>   11: (()+0x751e) [0x7f1f4fd7f51e]
>   12: (clone()+0x3f) [0x7f1f4ed7ef0f]
>
> A search in the bug tracker revealed that a similar error has been
> resolved for jewel http://tracker.ceph.com/issues/19223, yet I don't
> know if this is in any way relevant.
>
> We are currently at a loss how to get these osds back up. Any
> suggestions how to approach this would be very welcome. If there is any
> further information that is needed or additional context please let me know.
>
> Thanks,
>
> Lawrence
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs crash after deleting unfound object in Luminous 12.2.8

2018-10-12 Thread Lawrence Smith

Hi all,

we are running a luminous 12.2.8 cluster with a 3 fold replicated cache 
pool with a min_size of 2. We recently encountered an "object unfound" 
error in one of our pgs in this pool. After marking this object lost, 
two of the acting osds crashed and were unable to start up again, with 
only the primary osd staying up. Hoping the cluster might remap the 
copies of this pg, we marked the two crashed osds as out. Now the 
primary osd of this pg has also gone down leaving again only one active 
osd with the cluster reporting a degraded filesystem. All the affected 
osds are running filestore, while about half the cluster has already 
been upgraded to run bluestore osds.


All three of the crashed osds fail to restart, reporting the following 
error during startup:


Oct 12 13:19:12 kaa-109 ceph-osd[166266]:  0> 2018-10-12 
13:19:12.782652 7f1f2d79b700 -1 
/var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 
In function '
void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned 
int)' thread 7f1f2d79b700 time 2018-10-12 13:19:12.779813
/var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 
12985: FAILED assert(obc)


 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x109) [0x562265bfda9c]
 2: 
(PrimaryLogPG::hit_set_trim(std::unique_ptrstd::default_delete >&, unsigned int)+0x942) 
[0x5622657d6cea]

 3: (PrimaryLogPG::hit_set_persist()+0xa4b) [0x5622657e5fab]
 4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x426a) 
[0x562265800c64]
 5: (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0xc1f) [0x5622657b94ed]
 6: (OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x396) 
[0x562265655cf8]
 7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr 
const&)+0x5a) [0x5622658c09a6]
 8: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x1ab6) [0x562265657918]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5cd) 
[0x562265c026f5]

 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562265c05e72]
 11: (()+0x751e) [0x7f1f4fd7f51e]
 12: (clone()+0x3f) [0x7f1f4ed7ef0f]

A search in the bug tracker revealed that a similar error has been 
resolved for jewel http://tracker.ceph.com/issues/19223, yet I don't 
know if this is in any way relevant.


We are currently at a loss how to get these osds back up. Any 
suggestions how to approach this would be very welcome. If there is any 
further information that is needed or additional context please let me know.


Thanks,

Lawrence


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com