Re: [ceph-users] Ceph strange issue after adding a cache OSD.

Daznis Wed, 23 Nov 2016 04:55:30 -0800

Thank you. That helped quite a lot. Now I'm just stuck with one OSD
crashing with:


osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*,
spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)

 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
 3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
 4: (OSD::init()+0x181a) [0x6c0e8a]
 5: (main()+0x29dd) [0x6484bd]
 6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
 7: /usr/bin/ceph-osd() [0x661ea9]

On Wed, Nov 23, 2016 at 12:31 PM, Nick Fisk <[email protected]> wrote:
>> -----Original Message-----
>> From: Daznis [mailto:[email protected]]
>> Sent: 23 November 2016 10:17
>> To: [email protected]
>> Cc: ceph-users <[email protected]>
>> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
>>
>> Hi,
>>
>>
>> Looks like one of my colleagues increased the PG number before it finished. 
>> I was flushing the whole cache tier and it's currently stuck
>> on ~80 GB of data, because of the OSD crashes. I will look into the hitset 
>> counts and check what can be done. Will provide an update if
>> I find anything or fix the issue.
>
> So I'm guessing when the PG split, the stats/hit_sets are not how the OSD is 
> expecting them to be and causes the crash. I would expect this has been 
> caused by the PG splitting rather than introducing extra OSD's. If you manage 
> to get things stable by bumping up the hitset count, then you probably want 
> to try and do a scrub to try and clean up the stats, which may then stop this 
> happening when the hitset comes round to being trimmed again.
>
>>
>>
>> On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk <[email protected]> wrote:
>> > Hi Daznis,
>> >
>> > I'm not sure how much help I can be, but I will try my best.
>> >
>> > I think the post-split stats error is probably benign, although I
>> > think this suggests you also increased the number of PG's in your cache 
>> > pool? If so did you do this before or after you added the
>> extra OSD's?  This may have been the cause.
>> >
>> > On to the actual assert, this looks like it's part of the code which
>> > trims the tiering hit set's. I don't understand why its crashing out, but 
>> > it must be related to an invalid or missing hitset I would
>> imagine.
>> >
>> > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L104
>> > 85
>> >
>> > The only thing I could think of from looking at in the code is that
>> > the function loops through all hitsets that are above the max number 
>> > (hit_set_count). I wonder if setting this number higher would
>> mean it won't try and trim any hitsets and let things recover?
>> >
>> > DISCLAIMER
>> > This is a hunch, it might not work or could possibly even make things
>> > worse. Otherwise wait for someone who has a better idea to comment.
>> >
>> > Nick
>> >
>> >
>> >
>> >> -----Original Message-----
>> >> From: ceph-users [mailto:[email protected]] On Behalf
>> >> Of Daznis
>> >> Sent: 23 November 2016 05:57
>> >> To: ceph-users <[email protected]>
>> >> Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
>> >>
>> >> Hello,
>> >>
>> >>
>> >> The story goes like this.
>> >> I have added another 3 drives to the caching layer. OSDs were added
>> >> to crush map one by one after each successful rebalance. When
>> > I
>> >> added the last OSD and went away for about an hour I noticed that
>> >> it's still not finished rebalancing. Further investigation
>> > showed me
>> >> that it one of the older cache SSD was restarting like crazy before
>> >> full boot. So I shut it down and waited for a rebalance
>> > without that
>> >> OSD. Less than an hour later I had another 2 OSD restarting like
>> >> crazy. I tried running scrubs on the PG's logs asked me to, but
>> > that did
>> >> not help. I'm currently stuck with " 8 scrub errors" and a complete dead 
>> >> cluster.
>> >>
>> >> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
>> >> stats; must scrub before tier agent can activate
>> >>
>> >>
>> >> I need help with OSD from crashing. Crash log:
>> >>      0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
>> >> osd/ReplicatedPG.cc: In function 'void
>> >> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
>> >> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
>> >> osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
>> >>
>> >>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> const*)+0x85) [0xbde2c5]
>> >>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
>> >> int)+0x75f) [0x87e89f]
>> >>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
>> >>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a)
>> >> [0x8a11aa]
>> >>  5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
>> >> ThreadPool::TPHandle&)+0x68a) [0x83c37a]
>> >>  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>> >> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405)
>> >> [0x69af05]
>> >>  7: (OSD::ShardedOpWQ::_process(unsigned int,
>> >> ceph::heartbeat_handle_d*)+0x333) [0x69b473]
>> >>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
>> >> [0xbcd9cf]
>> >>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
>> >>  10: (()+0x7dc5) [0x7f93b9df4dc5]
>> >>  11: (clone()+0x6d) [0x7f93b88d5ced]
>> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed 
>> >> to interpret this.
>> >>
>> >>
>> >> I have tried looking with  full debug enabled, but those logs didn't
>> >> help me much. I have tried to evict the cache layer, but some objects are 
>> >> stuck and can't be removed. Any suggestions would be
>> greatly appreciated.
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> [email protected]
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph strange issue after adding a cache OSD.

Reply via email to