Hi Daznis,

I'm not sure how much help I can be, but I will try my best.

I think the post-split stats error is probably benign, although I think this 
suggests you also increased the number of PG's in your
cache pool? If so did you do this before or after you added the extra OSD's?  
This may have been the cause.

On to the actual assert, this looks like it's part of the code which trims the 
tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing hitset I would 
imagine.

https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485

The only thing I could think of from looking at in the code is that the 
function loops through all hitsets that are above the max
number (hit_set_count). I wonder if setting this number higher would mean it 
won't try and trim any hitsets and let things recover?

DISCLAIMER
This is a hunch, it might not work or could possibly even make things worse. 
Otherwise wait for someone who has a better idea to
comment.

Nick



> -----Original Message-----
> From: ceph-users [mailto:[email protected]] On Behalf Of 
> Daznis
> Sent: 23 November 2016 05:57
> To: ceph-users <[email protected]>
> Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
> 
> Hello,
> 
> 
> The story goes like this.
> I have added another 3 drives to the caching layer. OSDs were added to crush 
> map one by one after each successful rebalance. When
I
> added the last OSD and went away for about an hour I noticed that it's still 
> not finished rebalancing. Further investigation
showed me
> that it one of the older cache SSD was restarting like crazy before full 
> boot. So I shut it down and waited for a rebalance
without that
> OSD. Less than an hour later I had another 2 OSD restarting like crazy. I 
> tried running scrubs on the PG's logs asked me to, but
that did
> not help. I'm currently stuck with " 8 scrub errors" and a complete dead 
> cluster.
> 
> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; 
> must scrub before tier agent can activate
> 
> 
> I need help with OSD from crashing. Crash log:
>      0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
> osd/ReplicatedPG.cc: In function 'void
> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
> osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
> 
>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0xbde2c5]
>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
> int)+0x75f) [0x87e89f]
>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
>  5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x68a) [0x83c37a]
>  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
>  7: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x333) [0x69b473]
>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
> [0xbcd9cf]
>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
>  10: (()+0x7dc5) [0x7f93b9df4dc5]
>  11: (clone()+0x6d) [0x7f93b88d5ced]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
> interpret this.
> 
> 
> I have tried looking with  full debug enabled, but those logs didn't help me 
> much. I have tried to evict the cache layer, but some
> objects are stuck and can't be removed. Any suggestions would be greatly 
> appreciated.
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to