On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman
<[email protected]> wrote:
> That looks a lot like what I was seeing initially. The OSDs getting marked
> out was relatively rare and it took a bit before I saw it.
Our problem is "most of the time" and does not appear confined to a
specific ceph cluster node or OSD:
$ sudo fgrep 'waiting for subops' ceph.log | sed -e 's/.* v4 //' |
sort | uniq -c | sort -n
1 currently waiting for subops from 0
1 currently waiting for subops from 10
1 currently waiting for subops from 11
1 currently waiting for subops from 12
1 currently waiting for subops from 3
1 currently waiting for subops from 7
2 currently waiting for subops from 13
2 currently waiting for subops from 16
2 currently waiting for subops from 4
3 currently waiting for subops from 15
4 currently waiting for subops from 6
4 currently waiting for subops from 8
7 currently waiting for subops from 2
Node f16: 0, 2, and 3 (3 out of 4)
Node f17: 4, 6, 7, 8, 10, 11, 12, 13 and 15 (9 out of 12)
Node f18: 16 (1 out of 12)
So f18 seems like the odd man out, in that it has *less* problems than
the other two.
There are a grand total of 2 RX errors across all the interfaces on
all three machines. (Each one has dual 10G interfaces bonded together
as active/failover.)
The OSD log for the worst offender above (2) says:
2015-07-17 08:52:05.441607 7f562ea0c700 0 log [WRN] : 1 slow
requests, 1 included below; oldest blocked for > 30.119568 secs
2015-07-17 08:52:05.441622 7f562ea0c700 0 log [WRN] : slow request
30.119568 seconds old, received at 2015-07-17 08:51:35.321991:
osd_sub_op(client.32913524.0:3149584 2.249
2792c249/rbd_data.15322ae8944a.000000000011b487/head//2 [] v
10705'944603 snapset=0=[]:[] snapc=0=[]) v11 currently started
2015-07-17 08:52:43.229770 7f560833f700 0 --
192.168.2.216:6813/16029552 >> 192.168.2.218:6810/7028653
pipe(0x25265180 sd=25 :6813 s=2 pgs=23894 cs=41 l=0
c=0x22be4c60).fault with nothing to send, going to standby
There are a bunch of those "fault with nothing to send, going to
standby" messages.
> The messages were like "So-and-so incorrectly marked us
> out" IIRC.
Nothing like that. Nor, with "ceph -w" running constantly, any
reference to anything being marked out at any point, even when
problems are severe.
Thanks!
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com