Re: [ceph-users] Jewel Multisite RGW Memory Issues

Ben Agricola Fri, 08 Jul 2016 01:42:59 -0700

So I've narrowed this down a bit further, I *think* this is happening
during bucket listing - I started a radosgw process with increased logging,
and killed it as soon as I saw the RSS jump. This was accompanied by a ton
of logs from 'RGWRados::cls_bucket_list' printing out the names of the
files in one of the buckets - probably 5000 lines total.


The OP of the request that generated the bucket list was was
'25RGWListBucket_ObjStore_S3', and appears to have been made by one of the
RGW nodes in the other site.

Any ideas?

Ben.


On Mon, 27 Jun 2016 at 10:47 Ben Agricola <[email protected]> wrote:

> Hi Pritha,
>
> Urgh, not sure what happened to the formatting there - let's try again.
>
> At the time, the 'primary' cluster (i.e. the one with the active data set)
> was receiving backup files from a small number of machines, prior to
> replication being enabled it was using ~10% RAM on the RadosGW boxes.
>
> Without replication enabled, neither cluster sees any spikes in memory
> usage under normal operation, with a slight increase when deep scrubbing
> (I'm monitoring cluster memory usage as a whole so OSD memory increases
> would account for that).
>
> Neither cluster was performing a deep scrub at the time. The 'secondary'
> cluster (i.e. the one I was trying to sync data to, which now has
> replication disabled again) has now had a RadosGW process running under
> normal load since June 17 with replication disabled and is using 1084M
> RSS. This matches with historical graphing for the primary cluster, which
> has hovered around 1G RSS for RadosGW processes for the last 6 months.
>
> I've just tested this out this morning and enabling replication caused all
> RadosGW processes to increase in memory usage (and continue increasing)
> from ~1000M RSS to ~20G RSS in about 2 minutes. As soon as replication is
> enabled (as in, within seconds) RSS of RadosGW on both clusters starts to
> increase and does not drop. This appears to happen during metadata sync
> as well as during normal data syncing.
>
>
> I then killed all RadosGW processes on the 'primary' side, and memory
> usage of the RadosGW processes on the 'secondary' side continue to increase
> in usage at the same rate. There are no further messages in the RadosGW
> log as this is occurring (since there is no client traffic and no further
> replication traffic). If I kill the active RadosGW processes then they
> start back up and normal memory usage resumes.
>
> Cheers,
>
> Ben.
>
>
> On Mon, 27 Jun 2016 at 10:39 Ben Agricola <[email protected]> wrote:
>
>> Hi Pritha,
>>
>>
>> At the time, the 'primary' cluster (i.e. the one with the active data set) 
>> was receiving backup files from a small number of machines, prior to 
>> replication being
>>
>> enabled it was using ~10% RAM on the RadosGW boxes.
>>
>>
>> Without replication enabled, neither cluster sees any spikes in memory usage 
>> under normal operation, with a slight increase when deep scrubbing (I'm 
>> monitoring
>>
>> cluster memory usage as a whole so OSD memory increases would account for 
>> that). Neither cluster was performing a deep scrub at the time. The 
>> 'secondary' cluster
>>
>> (i.e. the one I was trying to sync data to, which now has replication 
>> disabled again) has now had a RadosGW process running under normal load 
>> since June 17
>>
>> with replication disabled and is using 1084M RSS. This matches with 
>> historical graphing for the primary cluster, which has hovered around 1G RSS 
>> for RadosGW
>>
>> processes for the last 6 months.
>>
>>
>> I've just tested this out this morning and enabling replication caused all 
>> RadosGW processes to increase in memory usage (and continue increasing) from 
>> ~1000M RSS
>>
>> to ~20G RSS in about 2 minutes. As soon as replication is enabled (as in, 
>> within seconds) RSS of RadosGW on both clusters starts to increase and does 
>> not drop. This
>>
>> appears to happen during metadata sync as well as during normal data syncing 
>> as well.
>>
>>
>> I then killed all RadosGW processes on the 'primary' side, and memory usage 
>> of the RadosGW processes on the 'secondary' side continue to increase in 
>> usage at
>>
>> the same rate. There are no further messages in the RadosGW log as this is 
>> occurring (since there is no client traffic and no further replication 
>> traffic).
>>
>> If I kill the active RadosGW processes then they start back up and normal 
>> memory usage resumes.
>>
>> Cheers,
>>
>> Ben.
>>
>>
>> ----- Original Message -----
>> > From: "Pritha Srivastava" <prsrivas@... 
>> > <http://gmane.org/get-address.php?address=prsrivas%2dH%2bwXaHxf7aLQT0dZR%2bAlfA%40public.gmane.org>>
>> > To: ceph-users@... 
>> > <http://gmane.org/get-address.php?address=ceph%2dusers%2didqoXFIVOFJgJs9I8MT0rw%40public.gmane.org>
>> > Sent: Monday, June 27, 2016 07:32:23
>> > Subject: Re: [ceph-users] Jewel Multisite RGW Memory Issues
>>
>> > Do you know if the memory usage is high only during load from clients and 
>> > is
>> > steady otherwise?
>> > What was the nature of the workload at the time of the sync operation?
>>
>> > Thanks,
>> > Pritha
>>
>>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Jewel Multisite RGW Memory Issues

Reply via email to