So I've narrowed this down a bit further, I *think* this is happening during bucket listing - I started a radosgw process with increased logging, and killed it as soon as I saw the RSS jump. This was accompanied by a ton of logs from 'RGWRados::cls_bucket_list' printing out the names of the files in one of the buckets - probably 5000 lines total.
The OP of the request that generated the bucket list was was '25RGWListBucket_ObjStore_S3', and appears to have been made by one of the RGW nodes in the other site. Any ideas? Ben. On Mon, 27 Jun 2016 at 10:47 Ben Agricola <[email protected]> wrote: > Hi Pritha, > > Urgh, not sure what happened to the formatting there - let's try again. > > At the time, the 'primary' cluster (i.e. the one with the active data set) > was receiving backup files from a small number of machines, prior to > replication being enabled it was using ~10% RAM on the RadosGW boxes. > > Without replication enabled, neither cluster sees any spikes in memory > usage under normal operation, with a slight increase when deep scrubbing > (I'm monitoring cluster memory usage as a whole so OSD memory increases > would account for that). > > Neither cluster was performing a deep scrub at the time. The 'secondary' > cluster (i.e. the one I was trying to sync data to, which now has > replication disabled again) has now had a RadosGW process running under > normal load since June 17 with replication disabled and is using 1084M > RSS. This matches with historical graphing for the primary cluster, which > has hovered around 1G RSS for RadosGW processes for the last 6 months. > > I've just tested this out this morning and enabling replication caused all > RadosGW processes to increase in memory usage (and continue increasing) > from ~1000M RSS to ~20G RSS in about 2 minutes. As soon as replication is > enabled (as in, within seconds) RSS of RadosGW on both clusters starts to > increase and does not drop. This appears to happen during metadata sync > as well as during normal data syncing. > > > I then killed all RadosGW processes on the 'primary' side, and memory > usage of the RadosGW processes on the 'secondary' side continue to increase > in usage at the same rate. There are no further messages in the RadosGW > log as this is occurring (since there is no client traffic and no further > replication traffic). If I kill the active RadosGW processes then they > start back up and normal memory usage resumes. > > Cheers, > > Ben. > > > On Mon, 27 Jun 2016 at 10:39 Ben Agricola <[email protected]> wrote: > >> Hi Pritha, >> >> >> At the time, the 'primary' cluster (i.e. the one with the active data set) >> was receiving backup files from a small number of machines, prior to >> replication being >> >> enabled it was using ~10% RAM on the RadosGW boxes. >> >> >> Without replication enabled, neither cluster sees any spikes in memory usage >> under normal operation, with a slight increase when deep scrubbing (I'm >> monitoring >> >> cluster memory usage as a whole so OSD memory increases would account for >> that). Neither cluster was performing a deep scrub at the time. The >> 'secondary' cluster >> >> (i.e. the one I was trying to sync data to, which now has replication >> disabled again) has now had a RadosGW process running under normal load >> since June 17 >> >> with replication disabled and is using 1084M RSS. This matches with >> historical graphing for the primary cluster, which has hovered around 1G RSS >> for RadosGW >> >> processes for the last 6 months. >> >> >> I've just tested this out this morning and enabling replication caused all >> RadosGW processes to increase in memory usage (and continue increasing) from >> ~1000M RSS >> >> to ~20G RSS in about 2 minutes. As soon as replication is enabled (as in, >> within seconds) RSS of RadosGW on both clusters starts to increase and does >> not drop. This >> >> appears to happen during metadata sync as well as during normal data syncing >> as well. >> >> >> I then killed all RadosGW processes on the 'primary' side, and memory usage >> of the RadosGW processes on the 'secondary' side continue to increase in >> usage at >> >> the same rate. There are no further messages in the RadosGW log as this is >> occurring (since there is no client traffic and no further replication >> traffic). >> >> If I kill the active RadosGW processes then they start back up and normal >> memory usage resumes. >> >> Cheers, >> >> Ben. >> >> >> ----- Original Message ----- >> > From: "Pritha Srivastava" <prsrivas@... >> > <http://gmane.org/get-address.php?address=prsrivas%2dH%2bwXaHxf7aLQT0dZR%2bAlfA%40public.gmane.org>> >> > To: ceph-users@... >> > <http://gmane.org/get-address.php?address=ceph%2dusers%2didqoXFIVOFJgJs9I8MT0rw%40public.gmane.org> >> > Sent: Monday, June 27, 2016 07:32:23 >> > Subject: Re: [ceph-users] Jewel Multisite RGW Memory Issues >> >> > Do you know if the memory usage is high only during load from clients and >> > is >> > steady otherwise? >> > What was the nature of the workload at the time of the sync operation? >> >> > Thanks, >> > Pritha >> >>
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
