Re: [Gluster-users] Geo-Replication memory leak on slave node

Sunny Kumar Fri, 13 Jul 2018 04:37:08 -0700

Hi Mark,

Currently I am looking at this issue (Kotresh is busy with some other
work) so can you please share the latest log with me.


Thanks,
Sunny


On Fri, Jul 13, 2018 at 12:41 PM Mark Betham <[email protected]> wrote:
>
> Hi Kotresh,
>
> I was wondering if you had found any time t take a look at the issue I am 
> currently experiencing with geo-replication and memory usage.
>
> If you require any further information then please do not hesitate to ask.
>
> Many thanks,
>
> Mark Betham
>
>
> On Wed, 20 Jun 2018 at 11:27, Mark Betham 
> <[email protected]> wrote:
>>
>> Hi Kotresh,
>>
>> Many thanks for your prompt response.  No need to apologise, any help you 
>> can provide is greatly appreciated.
>>
>> I look forward to receiving your update next week.
>>
>> Many thanks,
>>
>> Mark Betham
>>
>> On Wed, 20 Jun 2018 at 10:55, Kotresh Hiremath Ravishankar 
>> <[email protected]> wrote:
>>>
>>> Hi Mark,
>>>
>>> Sorry, I was busy and could not take a serious look at the logs. I can 
>>> update you on Monday.
>>>
>>> Thanks,
>>> Kotresh HR
>>>
>>> On Wed, Jun 20, 2018 at 12:32 PM, Mark Betham 
>>> <[email protected]> wrote:
>>>>
>>>> Hi Kotresh,
>>>>
>>>> I was wondering if you had made any progress with regards to the issue I 
>>>> am currently experiencing with geo-replication.
>>>>
>>>> For info the fault remains and effectively requires a restart of the 
>>>> geo-replication service on a daily basis to reclaim the used memory on the 
>>>> slave node.
>>>>
>>>> If you require any further information then please do not hesitate to ask.
>>>>
>>>> Many thanks,
>>>>
>>>> Mark Betham
>>>>
>>>>
>>>> On Mon, 11 Jun 2018 at 08:24, Mark Betham 
>>>> <[email protected]> wrote:
>>>>>
>>>>> Hi Kotresh,
>>>>>
>>>>> Many thanks.  I will shortly setup a share on my GDrive and send the link 
>>>>> directly to yourself.
>>>>>
>>>>> For Info;
>>>>> The Geo-Rep slave failed again over the weekend but it did not recover 
>>>>> this time.  It looks to have become unresponsive at around 14:40 UTC on 
>>>>> 9th June.  I have attached an image showing the mem usage and you can see 
>>>>> from this when the system failed.  The system was totally unresponsive 
>>>>> and required a cold power off and then power on in order to recover the 
>>>>> server.
>>>>>
>>>>> Many thanks for your help.
>>>>>
>>>>> Mark Betham.
>>>>>
>>>>> On 11 June 2018 at 05:53, Kotresh Hiremath Ravishankar 
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> Google drive works for me.
>>>>>>
>>>>>> Thanks,
>>>>>> Kotresh HR
>>>>>>
>>>>>> On Fri, Jun 8, 2018 at 3:00 PM, Mark Betham 
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi Kotresh,
>>>>>>>
>>>>>>> The memory issue re-occurred again.  This is indicating it will occur 
>>>>>>> around once a day.
>>>>>>>
>>>>>>> Again no traceback listed in the log, the only update in the log was as 
>>>>>>> follows;
>>>>>>> [2018-06-08 08:26:43.404261] I [resource(slave):1020:service_loop] 
>>>>>>> GLUSTER: connection inactive, stopping timeout=120
>>>>>>> [2018-06-08 08:29:19.357615] I [syncdutils(slave):271:finalize] <top>: 
>>>>>>> exiting.
>>>>>>> [2018-06-08 08:31:02.432002] I [resource(slave):1502:connect] GLUSTER: 
>>>>>>> Mounting gluster volume locally...
>>>>>>> [2018-06-08 08:31:03.716967] I [resource(slave):1515:connect] GLUSTER: 
>>>>>>> Mounted gluster volume duration=1.2729
>>>>>>> [2018-06-08 08:31:03.717411] I [resource(slave):1012:service_loop] 
>>>>>>> GLUSTER: slave listening
>>>>>>>
>>>>>>> I have attached an image showing the latest memory usage pattern.
>>>>>>>
>>>>>>> Can you please advise how I can pass the log data across to you?  As 
>>>>>>> soon as I know this I will get the data uploaded for your review.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Mark Betham
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 7 June 2018 at 08:19, Mark Betham 
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi Kotresh,
>>>>>>>>
>>>>>>>> Many thanks for your prompt response.
>>>>>>>>
>>>>>>>> Below are my responses to your questions;
>>>>>>>>
>>>>>>>> 1. Is this trace back consistently hit? I just wanted to confirm 
>>>>>>>> whether it's transient which occurs once in a while and gets back to 
>>>>>>>> normal?
>>>>>>>> It appears not.  As soon as the geo-rep recovered yesterday from the 
>>>>>>>> high memory usage it immediately began rising again until it consumed 
>>>>>>>> all of the available ram.  But this time nothing was committed to the 
>>>>>>>> log file.
>>>>>>>> I would like to add here that this current instance of geo-rep was 
>>>>>>>> only brought online at the start of this week due to the issues with 
>>>>>>>> glibc on CentOS 7.5.  This is the first time I have had geo-rep 
>>>>>>>> running with Gluster ver 3.12.9, both storage clusters at each 
>>>>>>>> physical site were only rebuilt approx. 4 weeks ago, due to the 
>>>>>>>> previous version in use going EOL.  Prior to this I had been running 
>>>>>>>> 3.13.2 (3.13.X now EOL) at each of the sites and it is worth noting 
>>>>>>>> that the same behaviour was also seen on this version of Gluster, 
>>>>>>>> unfortunately I do not have any of the log data from then but I do not 
>>>>>>>> recall seeing any instances of the trace back message mentioned.
>>>>>>>>
>>>>>>>> 2. Please upload the complete geo-rep logs from both master and slave.
>>>>>>>> I have the log files, just checking to make sure there is no 
>>>>>>>> confidential info inside.  The logfiles are too big to send via email, 
>>>>>>>> even when compressed.  Do you have a preferred method to allow me to 
>>>>>>>> share this data with you or would a share from my Google drive be 
>>>>>>>> sufficient?
>>>>>>>>
>>>>>>>> 3. Are the gluster versions same across master and slave?
>>>>>>>> Yes, all gluster versions are the same across the two sites for all 
>>>>>>>> storage nodes.  See below for version info taken from the current 
>>>>>>>> geo-rep master.
>>>>>>>>
>>>>>>>> glusterfs 3.12.9
>>>>>>>> Repository revision: git://git.gluster.org/glusterfs.git
>>>>>>>> Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
>>>>>>>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>>>>>>>> It is licensed to you under your choice of the GNU Lesser
>>>>>>>> General Public License, version 3 or any later version (LGPLv3
>>>>>>>> or later), or the GNU General Public License, version 2 (GPLv2),
>>>>>>>> in all cases as published by the Free Software Foundation.
>>>>>>>>
>>>>>>>> glusterfs-geo-replication-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-gnfs-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-libs-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-server-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-api-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-events-3.12.9-1.el7.x86_64
>>>>>>>> centos-release-gluster312-1.0-1.el7.centos.noarch
>>>>>>>> glusterfs-client-xlators-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-cli-3.12.9-1.el7.x86_64
>>>>>>>> python2-gluster-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-rdma-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-fuse-3.12.9-1.el7.x86_64
>>>>>>>>
>>>>>>>> I have also attached another screenshot showing the memory usage from 
>>>>>>>> the Gluster slave for the last 48 hours.  This shows memory saturation 
>>>>>>>> from yesterday, which correlates with the trace back sent yesterday, 
>>>>>>>> and the subsequent memory saturation which occurred over the last 24 
>>>>>>>> hours.  For info, all times are in UTC.
>>>>>>>>
>>>>>>>> Please advise the preferred method to get the log data across to you 
>>>>>>>> and also if you require any further information.
>>>>>>>>
>>>>>>>> Many thanks,
>>>>>>>>
>>>>>>>> Mark Betham
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7 June 2018 at 04:42, Kotresh Hiremath Ravishankar 
>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Hi Mark,
>>>>>>>>>
>>>>>>>>> Few questions.
>>>>>>>>>
>>>>>>>>> 1. Is this trace back consistently hit? I just wanted to confirm 
>>>>>>>>> whether it's transient which occurs once in a while and gets back to 
>>>>>>>>> normal?
>>>>>>>>> 2. Please upload the complete geo-rep logs from both master and slave.
>>>>>>>>> 3. Are the gluster versions same across master and slave?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Kotresh HR
>>>>>>>>>
>>>>>>>>> On Wed, Jun 6, 2018 at 7:10 PM, Mark Betham 
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> Dear Gluster-Users,
>>>>>>>>>>
>>>>>>>>>> I have geo-replication setup and configured between 2 Gluster pools 
>>>>>>>>>> located at different sites.  What I am seeing is an error being 
>>>>>>>>>> reported within the geo-replication slave log as follows;
>>>>>>>>>>
>>>>>>>>>> [2018-06-05 12:05:26.767615] E 
>>>>>>>>>> [syncdutils(slave):331:log_raise_exception] <top>: FAIL:
>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", 
>>>>>>>>>> line 361, in twrap
>>>>>>>>>>     tf(*aa)
>>>>>>>>>>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 
>>>>>>>>>> 1009, in <lambda>
>>>>>>>>>>     t = syncdutils.Thread(target=lambda: (repce.service_loop(),
>>>>>>>>>>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 90, 
>>>>>>>>>> in service_loop
>>>>>>>>>>     self.q.put(recv(self.inf))
>>>>>>>>>>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 61, 
>>>>>>>>>> in recv
>>>>>>>>>>     return pickle.load(inf)
>>>>>>>>>> ImportError: No module named 
>>>>>>>>>> h_2013-04-26-04:02:49-2013-04-26_11:02:53.gz.15WBuUh
>>>>>>>>>> [2018-06-05 12:05:26.768085] E [repce(slave):117:worker] <top>: call 
>>>>>>>>>> failed:
>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>   File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 
>>>>>>>>>> 113, in worker
>>>>>>>>>>     res = getattr(self.obj, rmeth)(*in_data[2:])
>>>>>>>>>> TypeError: getattr(): attribute name must be string
>>>>>>>>>>
>>>>>>>>>> From this point in time the slave server begins to consume all of 
>>>>>>>>>> its available RAM until it becomes non-responsive.  Eventually the 
>>>>>>>>>> gluster service seems to kill off the offending process and the 
>>>>>>>>>> memory is returned to the system.  Once the memory has been returned 
>>>>>>>>>> to the remote slave system the geo-replication often recovers and 
>>>>>>>>>> data transfer resumes.
>>>>>>>>>>
>>>>>>>>>> I have attached the full geo-replication slave log containing the 
>>>>>>>>>> error shown above.  I have also attached an image file showing the 
>>>>>>>>>> memory usage of the affected storage server.
>>>>>>>>>>
>>>>>>>>>> We are currently running Gluster version 3.12.9 on top of CentOS 7.5 
>>>>>>>>>> x86_64.  The system has been fully patched and is running the latest 
>>>>>>>>>> software, excluding glibc which had to be downgraded to get 
>>>>>>>>>> geo-replication working.
>>>>>>>>>>
>>>>>>>>>> The Gluster volume runs on a dedicated partition using the XFS 
>>>>>>>>>> filesystem which in turn is running on a LVM thin volume.  The 
>>>>>>>>>> physical storage is presented as a single drive due to the 
>>>>>>>>>> underlying disks being part of a raid 10 array.
>>>>>>>>>>
>>>>>>>>>> The Master volume which is being replicated has a total of 2.2 TB of 
>>>>>>>>>> data to be replicated.  The total size of the volume fluctuates very 
>>>>>>>>>> little as data being removed equals the new data coming in.  This 
>>>>>>>>>> data is made up of many thousands of files across many separated 
>>>>>>>>>> directories.  Data file sizes vary from the very small (>1K) to the 
>>>>>>>>>> large (>1Gb).  The Gluster service itself is running with a single 
>>>>>>>>>> volume in a replicated configuration across 3 bricks at each of the 
>>>>>>>>>> sites.  The delta changes being replicated are on average about 
>>>>>>>>>> 100GB per day, where this includes file creation / deletion / 
>>>>>>>>>> modification.
>>>>>>>>>>
>>>>>>>>>> The config for the geo-replication session is as follows, taken from 
>>>>>>>>>> the current source server;
>>>>>>>>>>
>>>>>>>>>> special_sync_mode: partial
>>>>>>>>>> gluster_log_file: 
>>>>>>>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.gluster.log
>>>>>>>>>> ssh_command: ssh -oPasswordAuthentication=no 
>>>>>>>>>> -oStrictHostKeyChecking=no -i 
>>>>>>>>>> /var/lib/glusterd/geo-replication/secret.pem
>>>>>>>>>> change_detector: changelog
>>>>>>>>>> session_owner: 40e9e77a-034c-44a2-896e-59eec47e8a84
>>>>>>>>>> state_file: 
>>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.status
>>>>>>>>>> gluster_params: aux-gfid-mount acl
>>>>>>>>>> log_rsync_performance: true
>>>>>>>>>> remote_gsyncd: /nonexistent/gsyncd
>>>>>>>>>> working_dir: 
>>>>>>>>>> /var/lib/misc/glusterfsd/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1
>>>>>>>>>> state_detail_file: 
>>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-detail.status
>>>>>>>>>> gluster_command_dir: /usr/sbin/
>>>>>>>>>> pid_file: 
>>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.pid
>>>>>>>>>> georep_session_working_dir: 
>>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/
>>>>>>>>>> ssh_command_tar: ssh -oPasswordAuthentication=no 
>>>>>>>>>> -oStrictHostKeyChecking=no -i 
>>>>>>>>>> /var/lib/glusterd/geo-replication/tar_ssh.pem
>>>>>>>>>> master.stime_xattr_name: 
>>>>>>>>>> trusted.glusterfs.40e9e77a-034c-44a2-896e-59eec47e8a84.ccfaed9b-ff4b-4a55-acfa-03f092cdf460.stime
>>>>>>>>>> changelog_log_file: 
>>>>>>>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-changes.log
>>>>>>>>>> socketdir: /var/run/gluster
>>>>>>>>>> volume_id: 40e9e77a-034c-44a2-896e-59eec47e8a84
>>>>>>>>>> ignore_deletes: false
>>>>>>>>>> state_socket_unencoded: 
>>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.socket
>>>>>>>>>> log_file: 
>>>>>>>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.log
>>>>>>>>>>
>>>>>>>>>> If any further information is required in order to troubleshoot this 
>>>>>>>>>> issue then please let me know.
>>>>>>>>>>
>>>>>>>>>> I would be very grateful for any help or guidance received.
>>>>>>>>>>
>>>>>>>>>> Many thanks,
>>>>>>>>>>
>>>>>>>>>> Mark Betham.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This email may contain confidential material; unintended recipients 
>>>>>>>>>> must not disseminate, use, or act upon any information in it. If you 
>>>>>>>>>> received this email in error, please contact the sender and 
>>>>>>>>>> permanently delete the email.
>>>>>>>>>> Performance Horizon Group Limited | Registered in England & Wales 
>>>>>>>>>> 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 
>>>>>>>>>> 3PA
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Kotresh H R
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> MARK BETHAM
>>>>>>>> Senior System Administrator
>>>>>>>> +44 (0) 191 261 2444
>>>>>>>> performancehorizon.com
>>>>>>>> PerformanceHorizon
>>>>>>>> tweetphg
>>>>>>>> performance-horizon-group
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> MARK BETHAM
>>>>>>> Senior System Administrator
>>>>>>> +44 (0) 191 261 2444
>>>>>>> performancehorizon.com
>>>>>>> PerformanceHorizon
>>>>>>> tweetphg
>>>>>>> performance-horizon-group
>>>>>>>
>>>>>>>
>>>>>>> This email may contain confidential material; unintended recipients 
>>>>>>> must not disseminate, use, or act upon any information in it. If you 
>>>>>>> received this email in error, please contact the sender and permanently 
>>>>>>> delete the email.
>>>>>>> Performance Horizon Group Limited | Registered in England & Wales 
>>>>>>> 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Kotresh H R
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> MARK BETHAM
>>>>> Senior System Administrator
>>>>> +44 (0) 191 261 2444
>>>>> performancehorizon.com
>>>>> PerformanceHorizon
>>>>> tweetphg
>>>>> performance-horizon-group
>>>>
>>>>
>>>>
>>>> --
>>>> MARK BETHAM
>>>> Senior System Administrator
>>>> +44 (0) 191 261 2444
>>>> performancehorizon.com
>>>> PerformanceHorizon
>>>> tweetphg
>>>> performance-horizon-group
>>>>
>>>>
>>>> This email may contain confidential material; unintended recipients must 
>>>> not disseminate, use, or act upon any information in it. If you received 
>>>> this email in error, please contact the sender and permanently delete the 
>>>> email.
>>>> Performance Horizon Group Limited | Registered in England & Wales 07188234 
>>>> | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Kotresh H R
>>
>>
>>
>> --
>> MARK BETHAM
>> Senior System Administrator
>> +44 (0) 191 261 2444
>> performancehorizon.com
>> PerformanceHorizon
>> tweetphg
>> performance-horizon-group
>
>
>
> --
> MARK BETHAM
> Senior Systems Administrator
> +44 (0) 191 261 2444
>
>
> This email may contain confidential material; unintended recipients must not 
> disseminate, use, or act upon any information in it. If you received this 
> email in error, please contact the sender and permanently delete the email.
> Performance Horizon Group Limited | Registered in England & Wales 07188234 | 
> Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>
>
_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Geo-Replication memory leak on slave node

Reply via email to