Re: [ceph-users] OSD state flipping when cluster-network in high utilization

Chen, Xiaoxi Wed, 22 May 2013 19:08:15 -0700

Hi, Sage,
        Any news about this issue ? We are keeping reproducing it again and 
again.
        We plug out the disk and carefully test the raw disk performance, it's 
pretty normal.
                                                                                
                                                Xiaoxi


-----Original Message-----
From: Chen, Xiaoxi 
Sent: 2013年5月16日 6:38
To: 'Sage Weil'
Subject: RE: [ceph-users] OSD state flipping when cluster-network in high 
utilization

Uploaded to /home/cephdrop/xiaoxi_flip_osd/osdlog.tar.gz
Thanks

-----Original Message-----
From: Sage Weil [mailto:[email protected]]
Sent: 2013年5月15日 23:52
To: Chen, Xiaoxi
Subject: RE: [ceph-users] OSD state flipping when cluster-network in high 
utilization

Can you sftp it to [email protected], password asdf?  Thanks!

sage

On Wed, 15 May 2013, Chen, Xiaoxi wrote:

> Hi Sage,
>       I can reproduce it with log ,but the log is quite big (30MB+) , where 
> should I put the log?
>       From the log and reading the code,seems that with uncertain reason, the 
> OSD_OP_THREAD will be blocked there for more than 15s, which results a 
> time-out. Then in " handle_osd_ping " function, it dropped the ping request 
> because of "is healthy"= false.
>       I have a full log with debug osd 20, but still have no idea why op 
> thread will be blocked for that long, part of the log is pasted below:
> 
> 2013-05-15 15:29:06.870834 7f0253340700 10 osd.319 4113 dequeue_op
> 0xf4821e0 prio 63 cost 4096 latency 0.000173 osd_op(client.4703.0
> :2114483 rbd_data.11ff6b8b4567.0000000000000339 [write 1605632~4096]
> 3.3e4c40d7) v4 pg pg[3.d7( v 4103'5330 (3853'4329,4103'5330] lo
> cal-les=4092 n=154 ec=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0
> lpr=4091 mlcod 4103'5329 active+clean]
> 2013-05-15 15:29:06.870862 7f0253340700 20 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] op_has_sufficient_caps pool=3 (xiaoxi_rbd)
> owner=0 need_read_cap=0 need_write_cap=1 need_class_read_cap=0
> need_class_write_cap=0 -> yes
> 2013-05-15 15:29:06.870871 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] do_op osd_op(client.4703.0:2114483 rbd_data
> .11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4 
> may_write
> 2013-05-15 15:29:22.117110 7f025d354700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.127345 7f0269b6d700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.276280 7f025d354700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.356731 7f025d354700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.359808 7f025d354700  1 heartbeat_map is_healthy 
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.513247 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] get_snapset_context rbd_data.11ff6b8b4567.0
> 000000000000339 0 -> 1
> 2013-05-15 15:29:22.513270 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] populate_obc_watchers 3e4c40d7/rbd_data.11f
> f6b8b4567.0000000000000339/head//3
> 2013-05-15 15:29:22.513276 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] get_object_context 0xd896840 3e4c40d7/rbd_d
> ata.11ff6b8b4567.0000000000000339/head//3 0 -> 1 read 
> 3e4c40d7/rbd_data.11ff6b8b4567.0000000000000339/head//3(4103'5267 client.4703.
> 0:2109755 wrlock_by=unknown.0.0:0)
> 2013-05-15 15:29:22.513284 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] find_object_context 3e4c40d7/rbd_data.11ff6
> b8b4567.0000000000000339/head//3 @head
> 2013-05-15 15:29:22.513290 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] do_op mode is idle(wr=0)
> 2013-05-15 15:29:22.513295 7f0253340700 10 osd.319 pg_epoch: 4113 
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] do_op mode now rmw(wr=0)
> 
> -----Original Message-----
> From: Sage Weil [mailto:[email protected]]
> Sent: 2013?5?15? 11:40
> To: Chen, Xiaoxi
> Cc: Mark Nelson; [email protected]; [email protected]
> Subject: RE: [ceph-users] OSD state flipping when cluster-network in 
> high utilization
> 
> On Wed, 15 May 2013, Chen, Xiaoxi wrote:
> 
> > >How responsive generally is the machine under load?  Is there available 
> > >CPU?
> >     The machine works well, and the issued OSDs are likely the same, seems 
> > because they have relative slower disk( disk type are the same but the 
> > latency is a bit higher ,8ms -> 10ms).
> >     
> >     Top show no idle % but still have 30+% of io_wait,  my colleague 
> > educate me that io_wait can be treated as free.
> > 
> > Another information is offload the heartbeat to 1Gb nic doesn't 
> > solve the problem, what's more, when we doing random write test, we 
> > can still see this flipping happen. So I would like to say it may 
> > related with CPU scheduler ? The heartbeat thread (in busy OSD ) 
> > failed to get enough cpu cycle.
> 
> Can you try reproducing with logs?  If we can reproduce that will give us 
> some clues.
> 
> sage
> 
> > 
> > -----Original Message-----
> > From: [email protected] 
> > [mailto:[email protected]] On Behalf Of Sage Weil
> > Sent: 2013?5?15? 7:23
> > To: Chen, Xiaoxi
> > Cc: Mark Nelson; [email protected]; [email protected]
> > Subject: Re: [ceph-users] OSD state flipping when cluster-network in 
> > high utilization
> > 
> > On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > > I like the idea to leave ping in cluster network because it can 
> > > help us detect switch?nic failure.
> > > 
> > > What confuse me is I keep pinging every ceph node's cluster ip?it 
> > > is OK during the whole run with less than 1 ms latency?why the 
> > > heartbeat still suffer? TOP show my cpu not 100% utilized?with 
> > > ?30% io wait?.Enabling jumbo frame **seems** make things 
> > > worth.(just feeling.no data supports)
> > 
> > I say "ping" in teh general sense.. it's not using ICMP, but sending small 
> > messages over a TCP session, doing some minimal processing on the other 
> > end, and sending them back.  If the machine is heavily loaded and that 
> > thread doesn't get scheduled or somehow blocks, it may be problematic.
> > 
> > How responsive generally is the machine under load?  Is there available CPU?
> > 
> > We can try to enable debugging to see what is going on.. 'debug ms = 1' 
> > and 'debug osd = 20' is everything we would need, but will incur additoinal 
> > load itself and may spoil the experiment...
> > 
> > sage
> > 
> > > 
> > > ???? iPhone
> > > 
> > > ? 2013-5-14?23:36?"Mark Nelson" <[email protected]> ???
> > > 
> > > > On 05/14/2013 10:30 AM, Sage Weil wrote:
> > > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > > >>> 
> > > >>> Hi
> > > >>> 
> > > >>>   We are suffering our OSD flipping between up and down ( OSD 
> > > >>> X be voted to down due to 3 missing ping, and after a while it 
> > > >>> tells the monitor ?map xxx wrongly mark me down? ). Because we 
> > > >>> are running sequential write performance test on top of RBDs, 
> > > >>> and the cluster network nics is really in high utilization (8Gb/s+ 
> > > >>> for a 10Gb network).
> > > >>> 
> > > >>>          Is this a expected behavior ? or how can I prevent this 
> > > >>> happen?
> > > >> 
> > > >> You an increase the heartbeat grace period.  The pings are 
> > > >> handled by a separate thread on the backside interface (if there is 
> > > >> one).
> > > >> If you are missing pings then the network or scheduler is 
> > > >> preventing those (small) messages from being processed (there 
> > > >> is almost no lock contention in that path).  Which means it 
> > > >> really is taking ~20 seconds or wahtever to handle those 
> > > >> messages.  It's really a questin of how unresponsive you want to 
> > > >> permit the OSDs to be before you consider it a failure..
> > > >> 
> > > >> sage
> > > >> 
> > > >> 
> > > > 
> > > > It might be worth testing out how long pings or other network traffic 
> > > > are taking during these tests.  There may be some tcp tunning you can 
> > > > do here, or even consider using a separate network for the mons.
> > > > 
> > > > Mark
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to [email protected] More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > N????y????b?????v?????{.n??????z??ay????????j 
> > ???f????????????????:+v???????? ??zZ+??????"?!?
> 
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

Reply via email to