Hi Karan,

We faced same issue and resolved after increasing the open file limit and
maximum no of threads

Config reference

/etc/security/limit.conf

root hard nofile 65535

sysctl -w kernel.pid_max=4194303
http://tracker.ceph.com/issues/10554#change-47024

Cheers

Mohamed Pakkeer

On Mon, Mar 9, 2015 at 4:20 PM, Azad Aliyar <[email protected]>
wrote:

> *Check Max Threadcount:* If you have a node with a lot of OSDs, you may
> be hitting the default maximum number of threads (e.g., usually 32k),
> especially during recovery. You can increase the number of threads using
> sysctl to see if increasing the maximum number of threads to the maximum
> possible number of threads allowed (i.e., 4194303) will help. For example:
>
> sysctl -w kernel.pid_max=4194303
>
>  If increasing the maximum thread count resolves the issue, you can make
> it permanent by including a kernel.pid_max setting in the /etc/sysctl.conf
> file. For example:
>
> kernel.pid_max = 4194303
>
>
> On Mon, Mar 9, 2015 at 4:11 PM, Karan Singh <[email protected]> wrote:
>
>> Hello Community need help to fix a long going Ceph problem.
>>
>> Cluster is unhealthy , Multiple OSDs are DOWN. When i am trying to
>> restart OSD’s i am getting this error
>>
>>
>> *2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc
>> <http://Thread.cc>: In function 'void Thread::create(size_t)' thread
>> 7f760dac9700 time 2015-03-09 12:22:16.311970*
>> *common/Thread.cc <http://Thread.cc>: 129: FAILED assert(ret == 0)*
>>
>>
>> *Environment *:  4 Nodes , OSD+Monitor , Firefly latest , CentOS6.5 ,
>> 3.17.2-1.el6.elrepo.x86_64
>>
>> Tried upgrading from 0.80.7 to 0.80.8  but no Luck
>>
>> Tried centOS stock kernel 2.6.32  but no Luck
>>
>> Memory is not a problem more then 150+GB is free
>>
>>
>> Did any one every faced this problem ??
>>
>> *Cluster status *
>>
>>  *  cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33*
>> *     health HEALTH_WARN 7334 pgs degraded; 1185 pgs down; 1 pgs
>> incomplete; 1735 pgs peering; 8938 pgs stale; 1*
>> *736 pgs stuck inactive; 8938 pgs stuck stale; 10320 pgs stuck unclean;
>> recovery 6061/31080 objects degraded (19*
>> *.501%); 111/196 in osds are down; clock skew detected on mon.pouta-s02,
>> mon.pouta-s03*
>> *     monmap e3: 3 mons at
>> {pouta-s01=10.XXX.50.1:6789/0,pouta-s02=10.XXX.50.2:6789/0,pouta-s03=10.XXX.50.3:6789*
>> */0}, election epoch 1312, quorum 0,1,2 pouta-s01,pouta-s02,pouta-s03*
>> *     osdmap e26633: 239 osds: 85 up, 196 in*
>> *      pgmap v60389: 17408 pgs, 13 pools, 42345 MB data, 10360 objects*
>> *            4699 GB used, 707 TB / 711 TB avail*
>> *            6061/31080 objects degraded (19.501%)*
>> *                  14 down+remapped+peering*
>> *                  39 active*
>> *                3289 active+clean*
>> *                 547 peering*
>> *                 663 stale+down+peering*
>> *                 705 stale+active+remapped*
>> *                   1 active+degraded+remapped*
>> *                   1 stale+down+incomplete*
>> *                 484 down+peering*
>> *                 455 active+remapped*
>> *                3696 stale+active+degraded*
>> *                   4 remapped+peering*
>> *                  23 stale+down+remapped+peering*
>> *                  51 stale+active*
>> *                3637 active+degraded*
>> *                3799 stale+active+clean*
>>
>> *OSD :  Logs *
>>
>> *2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc
>> <http://Thread.cc>: In function 'void Thread::create(size_t)' thread
>> 7f760dac9700 time 2015-03-09 12:22:16.311970*
>> *common/Thread.cc <http://Thread.cc>: 129: FAILED assert(ret == 0)*
>>
>> * ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)*
>> * 1: (Thread::create(unsigned long)+0x8a) [0xaf41da]*
>> * 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae84fa]*
>> * 3: (Accepter::entry()+0x265) [0xb5c635]*
>> * 4: /lib64/libpthread.so.0() [0x3c8a6079d1]*
>> * 5: (clone()+0x6d) [0x3c8a2e89dd]*
>> * NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.*
>>
>>
>> *More information at Ceph Tracker Issue :  *
>> http://tracker.ceph.com/issues/10988#change-49018
>>
>>
>> ****************************************************************
>> Karan Singh
>> Systems Specialist , Storage Platforms
>> CSC - IT Center for Science,
>> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
>> mobile: +358 503 812758
>> tel. +358 9 4572001
>> fax +358 9 4572302
>> http://www.csc.fi/
>> ****************************************************************
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
>    Warm Regards,  Azad Aliyar
>  Linux Server Engineer
>  *Email* :  [email protected]   *|*   *Skype* :   spark.azad
> <http://www.sparksupport.com> <http://www.sparkmycloud.com>
> <https://www.facebook.com/sparksupport>
> <http://www.linkedin.com/company/244846>
> <https://twitter.com/sparksupport>    3rd Floor, Leela Infopark, Phase
> -2,Kakanad, Kochi-30, Kerala, India  *Phone*:+91 484 6561696 , 
> *Mobile*:91-8129270421.
>   *Confidentiality Notice:* Information in this e-mail is proprietary to
> SparkSupport. and is intended for use only by the addressed, and may
> contain information that is privileged, confidential or exempt from
> disclosure. If you are not the intended recipient, you are notified that
> any use of this information in any manner is strictly prohibited. Please
> delete this mail & notify us immediately at [email protected]
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Thanks & Regards
K.Mohamed Pakkeer
Mobile- 0091-8754410114
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to