Looks like it's failing to create a thread.

Try setting kernel.pid_max to 4194303 in /etc/sysctl.conf

Cheers,
Brad

----- Original Message -----
> From: "Kenneth Waegeman" <kenneth.waege...@ugent.be>
> To: ceph-users@lists.ceph.com
> Sent: Tuesday, 8 December, 2015 10:45:11 PM
> Subject: [ceph-users] ceph new installation of ceph 0.9.2 issue and crashing  
> osds
> 
> Hi,
> 
> I installed ceph 0.9.2 on a new cluster of 3 nodes, with 50 OSDs on each
> node (300GB disks, 96GB RAM)
> 
> While installing, I got some issue that I even could not login as ceph
> user. So I increased some limits:
>   security/limits.conf
> 
> ceph            -       nproc           1048576
> ceph            -       nofile                 1048576
> 
> I could then install the other OSDs.
> 
> After the cluster was installed, I added some extra pools. when creating
> the pgs of these pools, the osds of the cluster started to fail, with
> stacktraces. If I try to restart them, they keep on failing. I don't
> know if this is an actual bug of Infernalis, or a limit that is still
> not high enough.. I've increased the noproc and nofile entries even
> more, but no luck. Someone has a clue? Hereby the stacktraces I see:
> 
> Mostly this one:
> 
>     -12> 2015-12-08 10:17:18.995243 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b(unlocked)] enter Initial
>     -11> 2015-12-08 10:17:18.995279 7fa9063c5700  5 write_log with:
> dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
> dirty_divergent_priors: false, divergent_priors: 0, writeout_from:
> 4294967295'184467
> 44073709551615, trimmed:
>     -10> 2015-12-08 10:17:18.995292 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] exit Initial 0.000048
> 0 0.000000
>      -9> 2015-12-08 10:17:18.995301 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] enter Reset
>      -8> 2015-12-08 10:17:18.995310 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000008
> 1 0.000017
>      -7> 2015-12-08 10:17:18.995326 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started
>      -6> 2015-12-08 10:17:18.995332 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Start
>      -5> 2015-12-08 10:17:18.995338 7fa9063c5700  1 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] state<Start>: transi
> tioning to Primary
>      -4> 2015-12-08 10:17:18.995345 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Start 0.000012
> 0 0.000000
>      -3> 2015-12-08 10:17:18.995352 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started/Primar
> y
>      -2> 2015-12-08 10:17:18.995358 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating] enter Started/Primar
> y/Peering
>      -1> 2015-12-08 10:17:18.995365 7fa9063c5700  5 osd.12 pg_epoch: 904
> pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
> [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating+peering] enter Starte
> d/Primary/Peering/GetInfo
>       0> 2015-12-08 10:17:18.998472 7fa9063c5700 -1 common/Thread.cc: In
> function 'void Thread::create(size_t)' thread 7fa9063c5700 time
> 2015-12-08 10:17:18.995438
> common/Thread.cc: 154: FAILED assert(ret == 0)
> 
>   ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0x7fa91924ebe5]
>   2: (Thread::create(unsigned long)+0x8a) [0x7fa91923325a]
>   3: (SimpleMessenger::connect_rank(entity_addr_t const&, int,
> PipeConnection*, Message*)+0x185) [0x7fa919229105]
>   4: (SimpleMessenger::get_connection(entity_inst_t const&)+0x3ba)
> [0x7fa9192298ea]
>   5: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1ab)
> [0x7fa918c7318b]
>   6: (OSD::do_queries(std::map<int, std::map<spg_t, pg_query_t,
> std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >,
> std::less<int>, std::allocator<std::pair<int const, std::map<spg_t,
> pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const,
> pg_query_t> > > > > >&, std::shared_ptr<OSDMap const>)+0x1f1)
> [0x7fa918c9b061]
>   7: (OSD::dispatch_context(PG::RecoveryCtx&, PG*,
> std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x142)
> [0x7fa918cb5832]
>   8: (OSD::handle_pg_create(std::shared_ptr<OpRequest>)+0x133e)
> [0x7fa918cb820e]
>   9: (OSD::dispatch_op(std::shared_ptr<OpRequest>)+0x220) [0x7fa918cbc0c0]
>   10: (OSD::do_waiters()+0x1c2) [0x7fa918cbc382]
>   11: (OSD::ms_dispatch(Message*)+0x227) [0x7fa918cbd727]
>   12: (DispatchQueue::entry()+0x649) [0x7fa91930a939]
>   13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fa91922eb1d]
>   14: (()+0x7df5) [0x7fa9172e3df5]
>   15: (clone()+0x6d) [0x7fa915b8c1ad]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> Also these:
> 
> --- begin dump of recent events ---
>     -13> 2015-12-08 10:17:19.033845 7f409fa08700  5 osd.15 pg_epoch: 903
> pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
> [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
> d/Primary/Peering/GetInfo 2.225918 4 0.000124
>     -12> 2015-12-08 10:17:19.033874 7f409fa08700  5 osd.15 pg_epoch: 903
> pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
> [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] enter Start
> ed/Primary/Peering/GetLog
>     -11> 2015-12-08 10:17:19.033920 7f409fa08700  5 osd.15 pg_epoch: 903
> pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
> [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
> d/Primary/Peering/GetLog 0.000046 0 0.000000
>     -10> 2015-12-08 10:17:19.033936 7f409fa08700  5 osd.15 pg_epoch: 903
> pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
> [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] enter Start
> ed/Primary/Peering/GetMissing
>      -9> 2015-12-08 10:17:19.033949 7f409fa08700  5 osd.15 pg_epoch: 903
> pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
> [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
> d/Primary/Peering/GetMissing 0.000013 0 0.000000
>      -8> 2015-12-08 10:17:19.033962 7f409fa08700  5 osd.15 pg_epoch: 903
> pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
> [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
> d/Primary/Peering 2.226044 0 0.000000
>      -7> 2015-12-08 10:17:19.033975 7f409fa08700  5 osd.15 pg_epoch: 903
> pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
> [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating] enter Started/Prima
> ry/Active
>      -6> 2015-12-08 10:17:19.060423 7f40a4a12700  1 --
> 10.143.20.31:6863/8526 <== osd.94 10.143.20.32:0/13947 2 ====
> osd_ping(ping e903 stamp 2015-12-08 10:17:19.059261) v2 ==== 47+0+0
> (3897539321 0 0) 0x7f40bffa
> b400 con 0x7f40c3baf8c0
>      -5> 2015-12-08 10:17:19.060447 7f40a4a12700  1 --
> 10.143.20.31:6863/8526 --> 10.143.20.32:0/13947 -- osd_ping(ping_reply
> e903 stamp 2015-12-08 10:17:19.059261) v2 -- ?+0 0x7f40c33f4000 con
> 0x7f40c3baf8c0
>      -4> 2015-12-08 10:17:19.060573 7f40a320f700  1 --
> 10.143.20.31:6862/8526 <== osd.94 10.143.20.32:0/13947 2 ====
> osd_ping(ping e903 stamp 2015-12-08 10:17:19.059261) v2 ==== 47+0+0
> (3897539321 0 0) 0x7f40bffa
> b000 con 0x7f40c3bb1860
>      -3> 2015-12-08 10:17:19.069801 7f40a0a0a700 10 monclient: tick
>      -2> 2015-12-08 10:17:19.069814 7f40a0a0a700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after 2015-12-08
> 10:16:49.069813)
>      -1> 2015-12-08 10:17:19.069820 7f40a0a0a700 10 monclient: renew
> subs? (now: 2015-12-08 10:17:19.069820; renew after: 2015-12-08
> 10:19:46.766797) -- no
>       0> 2015-12-08 10:17:19.121951 7f40a6215700 -1 *** Caught signal
> (Aborted) **
>   in thread 7f40a6215700
> 
>   ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>   1: (()+0x7e6ab2) [0x7f40bb7aeab2]
>   2: (()+0xf130) [0x7f40b9940130]
>   3: (gsignal()+0x37) [0x7f40b81205d7]
>   4: (abort()+0x148) [0x7f40b8121cc8]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f40b8a249b5]
>   6: (()+0x5e926) [0x7f40b8a22926]
>   7: (()+0x5e953) [0x7f40b8a22953]
>   8: (()+0x5eb73) [0x7f40b8a22b73]
>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27a) [0x7f40bb8a3dda]
>   10: (Thread::create(unsigned long)+0x8a) [0x7f40bb88825a]
>   11: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f40bb87df0f]
>   12: (Accepter::entry()+0x365) [0x7f40bb941155]
>   13: (()+0x7df5) [0x7f40b9938df5]
>   14: (clone()+0x6d) [0x7f40b81e11ad]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> 
>     -11> 2015-12-08 10:17:19.028810 7f7c7fe48700  5 -- op tracker --
> seq: 207, time: 2015-12-08 10:17:18.969336, event: dispatched, op:
> osd_pg_create(pg1.8d,902@2015-12-08 10:17:16.612111; pg1.a6,902@2015-12-08 1
> 0:17:16.612121; pg1.f0,902@2015-12-08 10:17:16.612151;
> pg1.1f5,902@2015-12-08 10:17:16.612249; pg1.232,902@2015-12-08
> 10:17:16.612272; pg1.250,902@2015-12-08 10:17:16.612290;
> pg1.27a,902@2015-12-08 10:17:16.6123
> 03; pg1.2f8,902@2015-12-08 10:17:16.612363; pg1.322,902@2015-12-08
> 10:17:16.612381; pg1.323,902@2015-12-08 10:17:16.612381;
> pg1.421,902@2015-12-08 10:17:16.612487; pg1.44d,902@2015-12-08
> 10:17:16.612505; pg1.55e
> ,902@2015-12-08 10:17:16.612579; pg1.619,902@2015-12-08 10:17:16.612642;
> pg1.761,902@2015-12-08 10:17:16.612764; pg1.7f4,902@2015-12-08
> 10:17:16.612812; pg1.80a,902@2015-12-08 10:17:16.612820; pg1.90d,902@2015-1
> 2-08 10:17:16.612902; pg1.d81,902@2015-12-08 10:17:16.613179;
> pg1.dc3,902@2015-12-08 10:17:16.613198; pg1.df9,902@2015-12-08
> 10:17:16.613212; pg1.e21,902@2015-12-08 10:17:16.613222;
> pg1.f4a,902@2015-12-08 10:17:
> 16.613299; pg1.fa8,902@2015-12-08 10:17:16.613322; pg2.61,903@2015-12-08
> 10:17:17.753976; pg2.aa,903@2015-12-08 10:17:17.754024;
> pg2.106,903@2015-12-08 10:17:17.754086; pg3.a6,904@2015-12-08
> 10:17:18.835306; pg3
> .113,904@2015-12-08 10:17:18.835370; pg3.121,904@2015-12-08
> 10:17:18.835379; pg3.123,904@2015-12-08 10:17:18.835380; )
>     -10> 2015-12-08 10:17:19.028871 7f7c7fe48700  5 -- op tracker --
> seq: 207, time: 2015-12-08 10:17:19.028871, event: wait for new map, op:
> osd_pg_create(pg1.8d,902@2015-12-08 10:17:16.612111; pg1.a6,902@2015-1
> 2-08 10:17:16.612121; pg1.f0,902@2015-12-08 10:17:16.612151;
> pg1.1f5,902@2015-12-08 10:17:16.612249; pg1.232,902@2015-12-08
> 10:17:16.612272; pg1.250,902@2015-12-08 10:17:16.612290;
> pg1.27a,902@2015-12-08 10:17:1
> 6.612303; pg1.2f8,902@2015-12-08 10:17:16.612363; pg1.322,902@2015-12-08
> 10:17:16.612381; pg1.323,902@2015-12-08 10:17:16.612381;
> pg1.421,902@2015-12-08 10:17:16.612487; pg1.44d,902@2015-12-08
> 10:17:16.612505; p
> g1.55e,902@2015-12-08 10:17:16.612579; pg1.619,902@2015-12-08
> 10:17:16.612642; pg1.761,902@2015-12-08 10:17:16.612764;
> pg1.7f4,902@2015-12-08 10:17:16.612812; pg1.80a,902@2015-12-08
> 10:17:16.612820; pg1.90d,902@
> 2015-12-08 10:17:16.612902; pg1.d81,902@2015-12-08 10:17:16.613179;
> pg1.dc3,902@2015-12-08 10:17:16.613198; pg1.df9,902@2015-12-08
> 10:17:16.613212; pg1.e21,902@2015-12-08 10:17:16.613222;
> pg1.f4a,902@2015-12-08
> 10:17:16.613299; pg1.fa8,902@2015-12-08 10:17:16.613322;
> pg2.61,903@2015-12-08 10:17:17.753976; pg2.aa,903@2015-12-08
> 10:17:17.754024; pg2.106,903@2015-12-08 10:17:17.754086;
> pg3.a6,904@2015-12-08 10:17:18.83530
> 6; pg3.113,904@2015-12-08 10:17:18.835370; pg3.121,904@2015-12-08
> 10:17:18.835379; pg3.123,904@2015-12-08 10:17:18.835380; )
>      -9> 2015-12-08 10:17:19.028934 7f7c7fe48700  1 --
> 10.143.20.31:6948/26671 <== mon.0 10.143.20.31:6789/0 1046 ====
> osd_map(904..904 src has 251..904) v3 ==== 671+0+0 (150079995 0 0)
> 0x7f7c96edfcc0 con 0x7f7c9
> 60c1340
>      -8> 2015-12-08 10:17:19.028936 7f7c7e645700  5 -- op tracker --
> seq: 208, time: 2015-12-08 10:17:18.836032, event: header_read, op:
> pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
>      -7> 2015-12-08 10:17:19.028946 7f7c7e645700  5 -- op tracker --
> seq: 208, time: 2015-12-08 10:17:18.836034, event: throttled, op:
> pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
>      -6> 2015-12-08 10:17:19.028951 7f7c7e645700  5 -- op tracker --
> seq: 208, time: 2015-12-08 10:17:18.836069, event: all_read, op:
> pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
>      -5> 2015-12-08 10:17:19.028957 7f7c7e645700  5 -- op tracker --
> seq: 208, time: 2015-12-08 10:17:19.028592, event: dispatched, op:
> pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
>      -4> 2015-12-08 10:17:19.028962 7f7c7e645700  5 -- op tracker --
> seq: 208, time: 2015-12-08 10:17:19.028962, event: wait for new map, op:
> pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
>      -3> 2015-12-08 10:17:19.028973 7f7c7e645700  1 --
> 10.143.20.31:6949/26671 <== osd.79 10.143.20.32:6917/9273 4 ====
> osd_map(903..903 src has 251..903) v3 ==== 1643+0+0 (496228184 0 0)
> 0x7f7c962e8ac0 con 0x7f7
> c964831e0
>      -2> 2015-12-08 10:17:19.029014 7f7c7fe48700  3 osd.37 903
> handle_osd_map epochs [904,904], i have 903, src has [251,904]
>      -1> 2015-12-08 10:17:19.030416 7f7c7ae3e700 -1 common/Thread.cc: In
> function 'void Thread::create(size_t)' thread 7f7c7ae3e700 time
> 2015-12-08 10:17:19.029219
> common/Thread.cc: 154: FAILED assert(ret == 0)
> 
>   ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0x7f7c92cd1be5]
>   2: (Thread::create(unsigned long)+0x8a) [0x7f7c92cb625a]
>   3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f7c92cabf0f]
>   4: (Accepter::entry()+0x365) [0x7f7c92d6f155]
>   5: (()+0x7df5) [0x7f7c90d66df5]
>   6: (clone()+0x6d) [0x7f7c8f60f1ad]
> 
> 
> 
> Thanks for helping !
> 
> Kenneth
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to