Hi, Sage,
Any news about this issue ? We are keeping reproducing it again and
again.
We plug out the disk and carefully test the raw disk performance, it's
pretty normal.
Xiaoxi
-----Original Message-----
From: Chen, Xiaoxi
Sent: 2013年5月16日 6:38
To: 'Sage Weil'
Subject: RE: [ceph-users] OSD state flipping when cluster-network in high
utilization
Uploaded to /home/cephdrop/xiaoxi_flip_osd/osdlog.tar.gz
Thanks
-----Original Message-----
From: Sage Weil [mailto:[email protected]]
Sent: 2013年5月15日 23:52
To: Chen, Xiaoxi
Subject: RE: [ceph-users] OSD state flipping when cluster-network in high
utilization
Can you sftp it to [email protected], password asdf? Thanks!
sage
On Wed, 15 May 2013, Chen, Xiaoxi wrote:
> Hi Sage,
> I can reproduce it with log ,but the log is quite big (30MB+) , where
> should I put the log?
> From the log and reading the code,seems that with uncertain reason, the
> OSD_OP_THREAD will be blocked there for more than 15s, which results a
> time-out. Then in " handle_osd_ping " function, it dropped the ping request
> because of "is healthy"= false.
> I have a full log with debug osd 20, but still have no idea why op
> thread will be blocked for that long, part of the log is pasted below:
>
> 2013-05-15 15:29:06.870834 7f0253340700 10 osd.319 4113 dequeue_op
> 0xf4821e0 prio 63 cost 4096 latency 0.000173 osd_op(client.4703.0
> :2114483 rbd_data.11ff6b8b4567.0000000000000339 [write 1605632~4096]
> 3.3e4c40d7) v4 pg pg[3.d7( v 4103'5330 (3853'4329,4103'5330] lo
> cal-les=4092 n=154 ec=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0
> lpr=4091 mlcod 4103'5329 active+clean]
> 2013-05-15 15:29:06.870862 7f0253340700 20 osd.319 pg_epoch: 4113
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] op_has_sufficient_caps pool=3 (xiaoxi_rbd)
> owner=0 need_read_cap=0 need_write_cap=1 need_class_read_cap=0
> need_class_write_cap=0 -> yes
> 2013-05-15 15:29:06.870871 7f0253340700 10 osd.319 pg_epoch: 4113
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] do_op osd_op(client.4703.0:2114483 rbd_data
> .11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4
> may_write
> 2013-05-15 15:29:22.117110 7f025d354700 1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.127345 7f0269b6d700 1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.276280 7f025d354700 1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.356731 7f025d354700 1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.359808 7f025d354700 1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
> 2013-05-15 15:29:22.513247 7f0253340700 10 osd.319 pg_epoch: 4113
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] get_snapset_context rbd_data.11ff6b8b4567.0
> 000000000000339 0 -> 1
> 2013-05-15 15:29:22.513270 7f0253340700 10 osd.319 pg_epoch: 4113
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] populate_obc_watchers 3e4c40d7/rbd_data.11f
> f6b8b4567.0000000000000339/head//3
> 2013-05-15 15:29:22.513276 7f0253340700 10 osd.319 pg_epoch: 4113
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] get_object_context 0xd896840 3e4c40d7/rbd_d
> ata.11ff6b8b4567.0000000000000339/head//3 0 -> 1 read
> 3e4c40d7/rbd_data.11ff6b8b4567.0000000000000339/head//3(4103'5267 client.4703.
> 0:2109755 wrlock_by=unknown.0.0:0)
> 2013-05-15 15:29:22.513284 7f0253340700 10 osd.319 pg_epoch: 4113
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] find_object_context 3e4c40d7/rbd_data.11ff6
> b8b4567.0000000000000339/head//3 @head
> 2013-05-15 15:29:22.513290 7f0253340700 10 osd.319 pg_epoch: 4113
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] do_op mode is idle(wr=0)
> 2013-05-15 15:29:22.513295 7f0253340700 10 osd.319 pg_epoch: 4113
> pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
> =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod
> 4103'5329 active+clean] do_op mode now rmw(wr=0)
>
> -----Original Message-----
> From: Sage Weil [mailto:[email protected]]
> Sent: 2013?5?15? 11:40
> To: Chen, Xiaoxi
> Cc: Mark Nelson; [email protected]; [email protected]
> Subject: RE: [ceph-users] OSD state flipping when cluster-network in
> high utilization
>
> On Wed, 15 May 2013, Chen, Xiaoxi wrote:
>
> > >How responsive generally is the machine under load? Is there available
> > >CPU?
> > The machine works well, and the issued OSDs are likely the same, seems
> > because they have relative slower disk( disk type are the same but the
> > latency is a bit higher ,8ms -> 10ms).
> >
> > Top show no idle % but still have 30+% of io_wait, my colleague
> > educate me that io_wait can be treated as free.
> >
> > Another information is offload the heartbeat to 1Gb nic doesn't
> > solve the problem, what's more, when we doing random write test, we
> > can still see this flipping happen. So I would like to say it may
> > related with CPU scheduler ? The heartbeat thread (in busy OSD )
> > failed to get enough cpu cycle.
>
> Can you try reproducing with logs? If we can reproduce that will give us
> some clues.
>
> sage
>
> >
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Sage Weil
> > Sent: 2013?5?15? 7:23
> > To: Chen, Xiaoxi
> > Cc: Mark Nelson; [email protected]; [email protected]
> > Subject: Re: [ceph-users] OSD state flipping when cluster-network in
> > high utilization
> >
> > On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > > I like the idea to leave ping in cluster network because it can
> > > help us detect switch?nic failure.
> > >
> > > What confuse me is I keep pinging every ceph node's cluster ip?it
> > > is OK during the whole run with less than 1 ms latency?why the
> > > heartbeat still suffer? TOP show my cpu not 100% utilized?with
> > > ?30% io wait?.Enabling jumbo frame **seems** make things
> > > worth.(just feeling.no data supports)
> >
> > I say "ping" in teh general sense.. it's not using ICMP, but sending small
> > messages over a TCP session, doing some minimal processing on the other
> > end, and sending them back. If the machine is heavily loaded and that
> > thread doesn't get scheduled or somehow blocks, it may be problematic.
> >
> > How responsive generally is the machine under load? Is there available CPU?
> >
> > We can try to enable debugging to see what is going on.. 'debug ms = 1'
> > and 'debug osd = 20' is everything we would need, but will incur additoinal
> > load itself and may spoil the experiment...
> >
> > sage
> >
> > >
> > > ???? iPhone
> > >
> > > ? 2013-5-14?23:36?"Mark Nelson" <[email protected]> ???
> > >
> > > > On 05/14/2013 10:30 AM, Sage Weil wrote:
> > > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > > >>>
> > > >>> Hi
> > > >>>
> > > >>> We are suffering our OSD flipping between up and down ( OSD
> > > >>> X be voted to down due to 3 missing ping, and after a while it
> > > >>> tells the monitor ?map xxx wrongly mark me down? ). Because we
> > > >>> are running sequential write performance test on top of RBDs,
> > > >>> and the cluster network nics is really in high utilization (8Gb/s+
> > > >>> for a 10Gb network).
> > > >>>
> > > >>> Is this a expected behavior ? or how can I prevent this
> > > >>> happen?
> > > >>
> > > >> You an increase the heartbeat grace period. The pings are
> > > >> handled by a separate thread on the backside interface (if there is
> > > >> one).
> > > >> If you are missing pings then the network or scheduler is
> > > >> preventing those (small) messages from being processed (there
> > > >> is almost no lock contention in that path). Which means it
> > > >> really is taking ~20 seconds or wahtever to handle those
> > > >> messages. It's really a questin of how unresponsive you want to
> > > >> permit the OSDs to be before you consider it a failure..
> > > >>
> > > >> sage
> > > >>
> > > >>
> > > >
> > > > It might be worth testing out how long pings or other network traffic
> > > > are taking during these tests. There may be some tcp tunning you can
> > > > do here, or even consider using a separate network for the mons.
> > > >
> > > > Mark
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to [email protected] More majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> > N????y????b?????v?????{.n??????z??ay????????j
> > ???f????????????????:+v???????? ??zZ+??????"?!?
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com