Hi,

Looks like one of my colleagues increased the PG number before it
finished. I was flushing the whole cache tier and it's currently stuck
on ~80 GB of data, because of the OSD crashes. I will look into the
hitset counts and check what can be done. Will provide an update if I
find anything or fix the issue.


On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk <[email protected]> wrote:
> Hi Daznis,
>
> I'm not sure how much help I can be, but I will try my best.
>
> I think the post-split stats error is probably benign, although I think this 
> suggests you also increased the number of PG's in your
> cache pool? If so did you do this before or after you added the extra OSD's?  
> This may have been the cause.
>
> On to the actual assert, this looks like it's part of the code which trims 
> the tiering hit set's. I don't understand why its
> crashing out, but it must be related to an invalid or missing hitset I would 
> imagine.
>
> https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485
>
> The only thing I could think of from looking at in the code is that the 
> function loops through all hitsets that are above the max
> number (hit_set_count). I wonder if setting this number higher would mean it 
> won't try and trim any hitsets and let things recover?
>
> DISCLAIMER
> This is a hunch, it might not work or could possibly even make things worse. 
> Otherwise wait for someone who has a better idea to
> comment.
>
> Nick
>
>
>
>> -----Original Message-----
>> From: ceph-users [mailto:[email protected]] On Behalf Of 
>> Daznis
>> Sent: 23 November 2016 05:57
>> To: ceph-users <[email protected]>
>> Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
>>
>> Hello,
>>
>>
>> The story goes like this.
>> I have added another 3 drives to the caching layer. OSDs were added to crush 
>> map one by one after each successful rebalance. When
> I
>> added the last OSD and went away for about an hour I noticed that it's still 
>> not finished rebalancing. Further investigation
> showed me
>> that it one of the older cache SSD was restarting like crazy before full 
>> boot. So I shut it down and waited for a rebalance
> without that
>> OSD. Less than an hour later I had another 2 OSD restarting like crazy. I 
>> tried running scrubs on the PG's logs asked me to, but
> that did
>> not help. I'm currently stuck with " 8 scrub errors" and a complete dead 
>> cluster.
>>
>> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; 
>> must scrub before tier agent can activate
>>
>>
>> I need help with OSD from crashing. Crash log:
>>      0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
>> osd/ReplicatedPG.cc: In function 'void
>> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
>> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
>> osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
>>
>>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0xbde2c5]
>>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
>> int)+0x75f) [0x87e89f]
>>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
>>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
>>  5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
>> ThreadPool::TPHandle&)+0x68a) [0x83c37a]
>>  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
>>  7: (OSD::ShardedOpWQ::_process(unsigned int,
>> ceph::heartbeat_handle_d*)+0x333) [0x69b473]
>>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
>> [0xbcd9cf]
>>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
>>  10: (()+0x7dc5) [0x7f93b9df4dc5]
>>  11: (clone()+0x6d) [0x7f93b88d5ced]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
>> interpret this.
>>
>>
>> I have tried looking with  full debug enabled, but those logs didn't help me 
>> much. I have tried to evict the cache layer, but some
>> objects are stuck and can't be removed. Any suggestions would be greatly 
>> appreciated.
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to