This usually happens on larger clusters when you hit the max fd limit.
Add
max open files = 131072
in the [global] section of ceph.conf to fix it (default is 16384).
sage
On Thu, 5 Jun 2014, Cao, Buddy wrote:
>
> Hi, several osds were down/out with similar logs as below, could you help?
>
>
>
> -38> 2014-06-05 10:27:54.700832 7f2ceead6700 1 --
> 192.168.40.11:6800/19542 <== osd.11 192.168.40.11:6822/20298 2 ====
> pg_notify(0.aa4(2) epoch 7) v5 ==== 812+0+0 (3873498789 0 0) 0x57a0540 con
> 0x49d14a0
>
> -37> 2014-06-05 10:27:54.700874 7f2ceead6700 5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.396463, event: header_read, request:
> pg_notify(0.aa4(2) epoch 7) v5
>
> -36> 2014-06-05 10:27:54.701091 7f2ceead6700 5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.396465, event: throttled, request:
> pg_notify(0.aa4(2) epoch 7) v5
>
> -35> 2014-06-05 10:27:54.701126 7f2ceead6700 5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.396543, event: all_read, request:
> pg_notify(0.aa4(2) epoch 7) v5
>
> -34> 2014-06-05 10:27:54.701187 7f2ceead6700 5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.700870, event: dispatched, request:
> pg_notify(0.aa4(2) epoch 7) v5
>
> -33> 2014-06-05 10:27:54.701235 7f2ceead6700 5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.701234, event: waiting_for_osdmap, request:
> pg_notify(0.aa4(2) epoch 7) v5
>
> -32> 2014-06-05 10:27:54.701267 7f2ceead6700 5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.701267, event: started, request:
> pg_notify(0.aa4(2) epoch 7) v5
>
> -31> 2014-06-05 10:27:54.700940 7f2ce92cb700 5 osd.0 pg_epoch: 7
> pg[0.1dc1( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [69,56,21] r=-1 lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] exit Start 0.003321 0 0.000000
>
> -30> 2014-06-05 10:27:54.702163 7f2ceead6700 5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.702163, event: done, request:
> pg_notify(0.aa4(2) epoch 7) v5
>
> -29> 2014-06-05 10:27:54.701690 7f2ce92cb700 5 osd.0 pg_epoch: 7
> pg[0.1dc1( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [69,56,21] r=-1 lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] enter Started/Stray
>
> -28> 2014-06-05 10:27:54.697805 7f2ce8aca700 5 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 lpr=3
> crt=0'0 mlcod 0'0 active] exit Started/Primary/Active 7.527967 0 0.000000
>
> -27> 2014-06-05 10:27:54.702437 7f2ce8aca700 5 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 lpr=3
> crt=0'0 mlcod 0'0 active] exit Started/Primary 8.986091 0 0.000000
>
> -26> 2014-06-05 10:27:54.702467 7f2ce8aca700 5 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 lpr=3
> crt=0'0 mlcod 0'0 active] exit Started 8.986256 0 0.000000
>
> -25> 2014-06-05 10:27:54.702486 7f2ce8aca700 5 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 lpr=3
> crt=0'0 mlcod 0'0 active] enter Reset
>
> -24> 2014-06-05 10:27:54.702822 7f2ce8aca700 5 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] exit Reset 0.000336 1 0.005574
>
> -23> 2014-06-05 10:27:54.702900 7f2ce8aca700 5 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] enter Started
>
> -22> 2014-06-05 10:27:54.703025 7f2ce8aca700 5 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] enter Start
>
> -21> 2014-06-05 10:27:54.703235 7f2ce8aca700 1 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
>
> -20> 2014-06-05 10:27:54.704919 7f2ce8aca700 5 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] exit Start 0.001894 0 0.000000
>
> -19> 2014-06-05 10:27:54.704835 7f2ceead6700 1 --
> 192.168.40.11:6800/19542 <== osd.15 192.168.40.11:6830/20581 1 ====
> pg_notify(2.60(2) epoch 7) v5 ==== 812+0+0 (3557830937 0 0) 0x4a6f540 con
> 0x3e3bb20
>
> -18> 2014-06-05 10:27:54.705050 7f2ceead6700 5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.598458, event: header_read, request:
> pg_notify(2.60(2) epoch 7) v5
>
> -17> 2014-06-05 10:27:54.705079 7f2ceead6700 5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.598461, event: throttled, request:
> pg_notify(2.60(2) epoch 7) v5
>
> -16> 2014-06-05 10:27:54.705124 7f2ce8aca700 5 osd.0 pg_epoch: 7
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] enter Started/Stray
>
> -15> 2014-06-05 10:27:54.705353 7f2ceead6700 5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.598689, event: all_read, request:
> pg_notify(2.60(2) epoch 7) v5
>
> -14> 2014-06-05 10:27:54.706925 7f2ceead6700 5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.705046, event: dispatched, request:
> pg_notify(2.60(2) epoch 7) v5
>
> -13> 2014-06-05 10:27:54.707038 7f2ceead6700 5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.707037, event: waiting_for_osdmap, request:
> pg_notify(2.60(2) epoch 7) v5
>
> -12> 2014-06-05 10:27:54.707075 7f2ceead6700 5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.707075, event: started, request:
> pg_notify(2.60(2) epoch 7) v5
>
> -11> 2014-06-05 10:27:54.708042 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60(unlocked)] enter Initial
>
> -10> 2014-06-05 10:27:54.709885 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=0
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] exit Initial 0.001842 0 0.000000
>
> -9> 2014-06-05 10:27:54.711556 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=0
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Reset
>
> -8> 2014-06-05 10:27:54.711606 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000051 1 0.001718
>
> -7> 2014-06-05 10:27:54.711648 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Started
>
> -6> 2014-06-05 10:27:54.711714 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Start
>
> -5> 2014-06-05 10:27:54.711781 7f2ceead6700 1 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] state<Start>: transitioning to Primary
>
> -4> 2014-06-05 10:27:54.713010 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] exit Start 0.001295 0 0.000000
>
> -3> 2014-06-05 10:27:54.713164 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Started/Primary
>
> -2> 2014-06-05 10:27:54.713234 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Started/Primary/Peering
>
> -1> 2014-06-05 10:27:54.713319 7f2ceead6700 5 osd.0 pg_epoch: 7
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 peering] enter Started/Primary/Peering/GetInfo
>
> 0> 2014-06-05 10:27:54.711223 7f2ce92cb700 -1 common/Thread.cc: In
> function 'void Thread::create(size_t)' thread 7f2ce92cb700 time 2014-06-05
> 10:27:54.703693
>
> common/Thread.cc: 110: FAILED assert(ret == 0)
>
>
>
> ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003)
>
> 1: (Thread::create(unsigned long)+0x8a) [0xa82dea]
>
> 2: (SimpleMessenger::connect_rank(entity_addr_t const&, int, Connection*,
> Message*)+0x17b) [0xa2b01b]
>
> 3: (SimpleMessenger::get_connection(entity_inst_t const&)+0x180) [0xa2f6c0]
>
> 4: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1d9) [0x6061c9]
>
> 5: (OSD::compat_must_dispatch_immediately(PG*)+0x296) [0x6064a6]
>
> 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&,
> ThreadPool::TPHandle&)+0x4a6) [0x6475d6]
>
> 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&,
> ThreadPool::TPHandle&)+0x16) [0x69dbf6]
>
> 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x551) [0xab4581]
>
> 9: (ThreadPool::WorkThread::entry()+0x10) [0xab75c0]
>
> 10: /lib64/libpthread.so.0() [0x3eebc079d1]
>
> 11: (clone()+0x6d) [0x3eeb4e8b6d]
>
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
>
> --- logging levels ---
>
> 0/ 5 none
>
> 0/ 1 lockdep
>
> 0/ 1 context
>
> 1/ 1 crush
>
> 1/ 5 mds
>
> 1/ 5 mds_balancer
>
> 1/ 5 mds_locker
>
> 1/ 5 mds_log
>
> 1/ 5 mds_log_expire
>
> 1/ 5 mds_migrator
>
> 0/ 1 buffer
>
> 0/ 1 timer
>
> 0/ 1 filer
>
> 0/ 1 striper
>
> 0/ 1 objecter
>
> 0/ 5 rados
>
> 0/ 5 rbd
>
> 0/ 5 journaler
>
> 0/ 5 objectcacher
>
> 0/ 5 client
>
> 0/ 5 osd
>
> 0/ 5 optracker
>
> 0/ 5 objclass
>
> 1/ 3 filestore
>
> 1/ 3 keyvaluestore
>
> 1/ 3 journal
>
> 0/ 5 ms
>
> 1/ 5 mon
>
> 0/10 monc
>
> 1/ 5 paxos
>
> 0/ 5 tp
>
> 1/ 5 auth
>
> 1/ 5 crypto
>
> 1/ 1 finisher
>
> 1/ 5 heartbeatmap
>
> 1/ 5 perfcounter
>
> 1/ 5 rgw
>
> 1/ 5 javaclient
>
> 1/ 5 asok
>
> 1/ 1 throttle
>
> -2/-2 (syslog threshold)
>
> -1/-1 (stderr threshold)
>
> max_recent 10000
>
> max_new 1000
>
> log_file /var/log/ceph/ceph-osd.0.log
>
> --- end dump of recent events ---
>
> 2014-06-05 10:27:55.096296 7f2ca8628700 0 -- 192.168.40.11:6800/19542 >>
> 192.168.40.11:6808/19811 pipe(0x4adc100 sd=629 :6800 s=2 pgs=111 cs=1 l=0
> c=0x4a0c780).fault with nothing to send, going to standby
>
> 2014-06-05 10:27:55.129281 7f2ce92cb700 -1 *** Caught signal (Aborted) **
>
> in thread 7f2ce92cb700
>
>
>
> ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003)
>
> 1: /usr/bin/ceph-osd() [0x9aa211]
>
> 2: /lib64/libpthread.so.0() [0x3eebc0f710]
>
> 3: (gsignal()+0x35) [0x3eeb432925]
>
> 4: (abort()+0x175) [0x3eeb434105]
>
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3eed8bea5d]
>
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
>
> --- begin dump of recent events ---
>
> -6> 2014-06-05 10:27:54.716422 7f2cec2d1700 1 --
> 192.168.40.11:6801/19542 <== osd.119 192.168.40.14:0/29977 3 ====
> osd_ping(ping e7 stamp 2014-06-05 10:27:54.713281) v2 ==== 47+0+0 (243568413
> 0 0) 0x4a68380 con 0x3e3c200
>
> -5> 2014-06-05 10:27:54.716480 7f2cec2d1700 1 --
> 192.168.40.11:6801/19542 --> 192.168.40.14:0/29977 -- osd_ping(ping_reply e7
> stamp 2014-06-05 10:27:54.713281) v2 -- ?+0 0x5436740 con 0x3e3c200
>
> -4> 2014-06-05 10:27:54.717351 7f2ced2d3700 1 --
> 192.168.50.11:6801/19542 <== osd.119 192.168.40.14:0/29977 3 ====
> osd_ping(ping e7 stamp 2014-06-05 10:27:54.713281) v2 ==== 47+0+0 (243568413
> 0 0) 0x4a68fc0 con 0x3e3c8e0
>
> -3> 2014-06-05 10:27:55.096211 7f2ca8628700 2 --
> 192.168.40.11:6800/19542 >> 192.168.40.11:6808/19811 pipe(0x4adc100 sd=629
> :6800 s=2 pgs=111 cs=1 l=0 c=0x4a0c780).reader couldn't read tag, (0)
> Success
>
> -2> 2014-06-05 10:27:55.096258 7f2ca8628700 2 --
> 192.168.40.11:6800/19542 >> 192.168.40.11:6808/19811 pipe(0x4adc100 sd=629
> :6800 s=2 pgs=111 cs=1 l=0 c=0x4a0c780).fault (0) Success
>
> -1> 2014-06-05 10:27:55.096296 7f2ca8628700 0 --
> 192.168.40.11:6800/19542 >> 192.168.40.11:6808/19811 pipe(0x4adc100 sd=629
> :6800 s=2 pgs=111 cs=1 l=0 c=0x4a0c780).fault with nothing to send, going to
> standby
>
> 0> 2014-06-05 10:27:55.129281 7f2ce92cb700 -1 *** Caught signal
> (Aborted) **
>
> in thread 7f2ce92cb700
>
>
>
> ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003)
>
> 1: /usr/bin/ceph-osd() [0x9aa211]
>
> 2: /lib64/libpthread.so.0() [0x3eebc0f710]
>
> 3: (gsignal()+0x35) [0x3eeb432925]
>
> 4: (abort()+0x175) [0x3eeb434105]
>
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3eed8bea5d]
>
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
>
> --- logging levels ---
>
> 0/ 5 none
>
> 0/ 1 lockdep
>
> 0/ 1 context
>
> 1/ 1 crush
>
> 1/ 5 mds
>
> 1/ 5 mds_balancer
>
> 1/ 5 mds_locker
>
> 1/ 5 mds_log
>
> 1/ 5 mds_log_expire
>
> 1/ 5 mds_migrator
>
> 0/ 1 buffer
>
> 0/ 1 timer
>
> 0/ 1 filer
>
> 0/ 1 striper
>
> 0/ 1 objecter
>
> 0/ 5 rados
>
> 0/ 5 rbd
>
> 0/ 5 journaler
>
> 0/ 5 objectcacher
>
> 0/ 5 client
>
> 0/ 5 osd
>
> 0/ 5 optracker
>
> 0/ 5 objclass
>
> 1/ 3 filestore
>
> 1/ 3 keyvaluestore
>
> 1/ 3 journal
>
> 0/ 5 ms
>
> 1/ 5 mon
>
> 0/10 monc
>
> 1/ 5 paxos
>
> 0/ 5 tp
>
> 1/ 5 auth
>
> 1/ 5 crypto
>
> 1/ 1 finisher
>
> 1/ 5 heartbeatmap
>
> 1/ 5 perfcounter
>
> 1/ 5 rgw
>
> 1/ 5 javaclient
>
> 1/ 5 asok
>
> 1/ 1 throttle
>
> -2/-2 (syslog threshold)
>
> -1/-1 (stderr threshold)
>
> max_recent 10000
>
> max_new 1000
>
> log_file /var/log/ceph/ceph-osd.0.log
>
> --- end dump of recent events ---
>
>
>
>
>
> Wei Cao (Buddy)
>
>
>
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com