On Wed, 26 Oct 2016, Trygve Vea wrote: > Hi, > > We have two Ceph-clusters, one exposing pools both for RGW and RBD > (OpenStack/KVM) pools - and one only for RBD. > > After upgrading both to Jewel, we have seen a significantly increased CPU > footprint on the OSDs that are a part of the cluster which includes RGW. > > This graph illustrates this: http://i.imgur.com/Z81LW5Y.png
That looks pretty significant! This doesn't ring any bells--I don't think it's something we've seen. Can you do a 'perf top -p `pidof ceph-osd`' on one of the OSDs and grab a snapshot of the output? It would be nice to compare to hammer but I expect you've long since upgraded all of the OSDs... sage > > > I wonder if anyone else have seen this behaviour, and if this is a symptom of > a regression --- or if this was to be expected after moving from hammer to > jewel. > > I have also observed that an OSD will occasionally be marked as down, but > will recover by itself. > > This manifests itself in the osd logs as a series of lines along this: > > 2016-10-26 06:32:20.106602 7fa57a942700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fa575938700' had timed out after 15 > > Some slow requests may be observed: > > 2016-10-26 06:32:35.899597 7fa5aa41b700 0 log_channel(cluster) log [WRN] : 1 > slow requests, 1 included below; oldest blocked for > 30.905777 secs > 2016-10-26 06:32:35.899605 7fa5aa41b700 0 log_channel(cluster) log [WRN] : > slow request 30.905777 seconds old, received at 2016-10-26 06:32:04.993791: > replica scrub(pg: > 3.2e,from:0'0,to:27810'772752,epoch:28538,start:3:74000000::::head,end:3:7400039b::::0,chunky:1,deep:1,seed:4294967295,version:6) > currently reached_pg > > Some failing heartbeat_checks (usually only from a single osd): > > 2016-10-26 06:32:39.323412 7fa56f92c700 -1 osd.19 28538 heartbeat_check: no > reply from osd.15 since back 2016-10-26 06:32:19.017249 front 2016-10-26 > 06:32:19.017249 (cutoff 2016-10-26 06:32:19.323409) > > > A bunch of these (with the remote address targetting different osds): > > 2016-10-26 06:32:45.522391 7fa598ec0700 0 -- 169.254.169.254:6812/151031797 > >> 169.254.169.255:6802/41700 pipe(0x7fa5ebba7400 sd=160 :6812 s=2 pgs=4298 > cs=1 l=0 c=0x7fa5d7c26400).fault with nothing to send, going to standby > > 2016-10-26 06:32:45.525524 7fa5a5158700 0 log_channel(cluster) log [WRN] : > map e28540 wrongly marked me down > > Followed by repeering, and then everything is fine again. > > > > I wonder if anyone have been suffering from similar behaviour, if this is a > bug (known or unknown). One detail to keep in mind is that the osds for the > rgw pools store replicas on different physical sites. However, we have no > reason to believe that saturation or high latency is a problem. > > > > Regards > -- > Trygve Vea > Redpill Linpro AS > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
