Hi Mark, Currently I am looking at this issue (Kotresh is busy with some other work) so can you please share the latest log with me.
Thanks, Sunny On Fri, Jul 13, 2018 at 12:41 PM Mark Betham <[email protected]> wrote: > > Hi Kotresh, > > I was wondering if you had found any time t take a look at the issue I am > currently experiencing with geo-replication and memory usage. > > If you require any further information then please do not hesitate to ask. > > Many thanks, > > Mark Betham > > > On Wed, 20 Jun 2018 at 11:27, Mark Betham > <[email protected]> wrote: >> >> Hi Kotresh, >> >> Many thanks for your prompt response. No need to apologise, any help you >> can provide is greatly appreciated. >> >> I look forward to receiving your update next week. >> >> Many thanks, >> >> Mark Betham >> >> On Wed, 20 Jun 2018 at 10:55, Kotresh Hiremath Ravishankar >> <[email protected]> wrote: >>> >>> Hi Mark, >>> >>> Sorry, I was busy and could not take a serious look at the logs. I can >>> update you on Monday. >>> >>> Thanks, >>> Kotresh HR >>> >>> On Wed, Jun 20, 2018 at 12:32 PM, Mark Betham >>> <[email protected]> wrote: >>>> >>>> Hi Kotresh, >>>> >>>> I was wondering if you had made any progress with regards to the issue I >>>> am currently experiencing with geo-replication. >>>> >>>> For info the fault remains and effectively requires a restart of the >>>> geo-replication service on a daily basis to reclaim the used memory on the >>>> slave node. >>>> >>>> If you require any further information then please do not hesitate to ask. >>>> >>>> Many thanks, >>>> >>>> Mark Betham >>>> >>>> >>>> On Mon, 11 Jun 2018 at 08:24, Mark Betham >>>> <[email protected]> wrote: >>>>> >>>>> Hi Kotresh, >>>>> >>>>> Many thanks. I will shortly setup a share on my GDrive and send the link >>>>> directly to yourself. >>>>> >>>>> For Info; >>>>> The Geo-Rep slave failed again over the weekend but it did not recover >>>>> this time. It looks to have become unresponsive at around 14:40 UTC on >>>>> 9th June. I have attached an image showing the mem usage and you can see >>>>> from this when the system failed. The system was totally unresponsive >>>>> and required a cold power off and then power on in order to recover the >>>>> server. >>>>> >>>>> Many thanks for your help. >>>>> >>>>> Mark Betham. >>>>> >>>>> On 11 June 2018 at 05:53, Kotresh Hiremath Ravishankar >>>>> <[email protected]> wrote: >>>>>> >>>>>> Hi Mark, >>>>>> >>>>>> Google drive works for me. >>>>>> >>>>>> Thanks, >>>>>> Kotresh HR >>>>>> >>>>>> On Fri, Jun 8, 2018 at 3:00 PM, Mark Betham >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Hi Kotresh, >>>>>>> >>>>>>> The memory issue re-occurred again. This is indicating it will occur >>>>>>> around once a day. >>>>>>> >>>>>>> Again no traceback listed in the log, the only update in the log was as >>>>>>> follows; >>>>>>> [2018-06-08 08:26:43.404261] I [resource(slave):1020:service_loop] >>>>>>> GLUSTER: connection inactive, stopping timeout=120 >>>>>>> [2018-06-08 08:29:19.357615] I [syncdutils(slave):271:finalize] <top>: >>>>>>> exiting. >>>>>>> [2018-06-08 08:31:02.432002] I [resource(slave):1502:connect] GLUSTER: >>>>>>> Mounting gluster volume locally... >>>>>>> [2018-06-08 08:31:03.716967] I [resource(slave):1515:connect] GLUSTER: >>>>>>> Mounted gluster volume duration=1.2729 >>>>>>> [2018-06-08 08:31:03.717411] I [resource(slave):1012:service_loop] >>>>>>> GLUSTER: slave listening >>>>>>> >>>>>>> I have attached an image showing the latest memory usage pattern. >>>>>>> >>>>>>> Can you please advise how I can pass the log data across to you? As >>>>>>> soon as I know this I will get the data uploaded for your review. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Mark Betham >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 7 June 2018 at 08:19, Mark Betham >>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> Hi Kotresh, >>>>>>>> >>>>>>>> Many thanks for your prompt response. >>>>>>>> >>>>>>>> Below are my responses to your questions; >>>>>>>> >>>>>>>> 1. Is this trace back consistently hit? I just wanted to confirm >>>>>>>> whether it's transient which occurs once in a while and gets back to >>>>>>>> normal? >>>>>>>> It appears not. As soon as the geo-rep recovered yesterday from the >>>>>>>> high memory usage it immediately began rising again until it consumed >>>>>>>> all of the available ram. But this time nothing was committed to the >>>>>>>> log file. >>>>>>>> I would like to add here that this current instance of geo-rep was >>>>>>>> only brought online at the start of this week due to the issues with >>>>>>>> glibc on CentOS 7.5. This is the first time I have had geo-rep >>>>>>>> running with Gluster ver 3.12.9, both storage clusters at each >>>>>>>> physical site were only rebuilt approx. 4 weeks ago, due to the >>>>>>>> previous version in use going EOL. Prior to this I had been running >>>>>>>> 3.13.2 (3.13.X now EOL) at each of the sites and it is worth noting >>>>>>>> that the same behaviour was also seen on this version of Gluster, >>>>>>>> unfortunately I do not have any of the log data from then but I do not >>>>>>>> recall seeing any instances of the trace back message mentioned. >>>>>>>> >>>>>>>> 2. Please upload the complete geo-rep logs from both master and slave. >>>>>>>> I have the log files, just checking to make sure there is no >>>>>>>> confidential info inside. The logfiles are too big to send via email, >>>>>>>> even when compressed. Do you have a preferred method to allow me to >>>>>>>> share this data with you or would a share from my Google drive be >>>>>>>> sufficient? >>>>>>>> >>>>>>>> 3. Are the gluster versions same across master and slave? >>>>>>>> Yes, all gluster versions are the same across the two sites for all >>>>>>>> storage nodes. See below for version info taken from the current >>>>>>>> geo-rep master. >>>>>>>> >>>>>>>> glusterfs 3.12.9 >>>>>>>> Repository revision: git://git.gluster.org/glusterfs.git >>>>>>>> Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/> >>>>>>>> GlusterFS comes with ABSOLUTELY NO WARRANTY. >>>>>>>> It is licensed to you under your choice of the GNU Lesser >>>>>>>> General Public License, version 3 or any later version (LGPLv3 >>>>>>>> or later), or the GNU General Public License, version 2 (GPLv2), >>>>>>>> in all cases as published by the Free Software Foundation. >>>>>>>> >>>>>>>> glusterfs-geo-replication-3.12.9-1.el7.x86_64 >>>>>>>> glusterfs-gnfs-3.12.9-1.el7.x86_64 >>>>>>>> glusterfs-libs-3.12.9-1.el7.x86_64 >>>>>>>> glusterfs-server-3.12.9-1.el7.x86_64 >>>>>>>> glusterfs-3.12.9-1.el7.x86_64 >>>>>>>> glusterfs-api-3.12.9-1.el7.x86_64 >>>>>>>> glusterfs-events-3.12.9-1.el7.x86_64 >>>>>>>> centos-release-gluster312-1.0-1.el7.centos.noarch >>>>>>>> glusterfs-client-xlators-3.12.9-1.el7.x86_64 >>>>>>>> glusterfs-cli-3.12.9-1.el7.x86_64 >>>>>>>> python2-gluster-3.12.9-1.el7.x86_64 >>>>>>>> glusterfs-rdma-3.12.9-1.el7.x86_64 >>>>>>>> glusterfs-fuse-3.12.9-1.el7.x86_64 >>>>>>>> >>>>>>>> I have also attached another screenshot showing the memory usage from >>>>>>>> the Gluster slave for the last 48 hours. This shows memory saturation >>>>>>>> from yesterday, which correlates with the trace back sent yesterday, >>>>>>>> and the subsequent memory saturation which occurred over the last 24 >>>>>>>> hours. For info, all times are in UTC. >>>>>>>> >>>>>>>> Please advise the preferred method to get the log data across to you >>>>>>>> and also if you require any further information. >>>>>>>> >>>>>>>> Many thanks, >>>>>>>> >>>>>>>> Mark Betham >>>>>>>> >>>>>>>> >>>>>>>> On 7 June 2018 at 04:42, Kotresh Hiremath Ravishankar >>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> Hi Mark, >>>>>>>>> >>>>>>>>> Few questions. >>>>>>>>> >>>>>>>>> 1. Is this trace back consistently hit? I just wanted to confirm >>>>>>>>> whether it's transient which occurs once in a while and gets back to >>>>>>>>> normal? >>>>>>>>> 2. Please upload the complete geo-rep logs from both master and slave. >>>>>>>>> 3. Are the gluster versions same across master and slave? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Kotresh HR >>>>>>>>> >>>>>>>>> On Wed, Jun 6, 2018 at 7:10 PM, Mark Betham >>>>>>>>> <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Dear Gluster-Users, >>>>>>>>>> >>>>>>>>>> I have geo-replication setup and configured between 2 Gluster pools >>>>>>>>>> located at different sites. What I am seeing is an error being >>>>>>>>>> reported within the geo-replication slave log as follows; >>>>>>>>>> >>>>>>>>>> [2018-06-05 12:05:26.767615] E >>>>>>>>>> [syncdutils(slave):331:log_raise_exception] <top>: FAIL: >>>>>>>>>> Traceback (most recent call last): >>>>>>>>>> File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", >>>>>>>>>> line 361, in twrap >>>>>>>>>> tf(*aa) >>>>>>>>>> File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line >>>>>>>>>> 1009, in <lambda> >>>>>>>>>> t = syncdutils.Thread(target=lambda: (repce.service_loop(), >>>>>>>>>> File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 90, >>>>>>>>>> in service_loop >>>>>>>>>> self.q.put(recv(self.inf)) >>>>>>>>>> File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 61, >>>>>>>>>> in recv >>>>>>>>>> return pickle.load(inf) >>>>>>>>>> ImportError: No module named >>>>>>>>>> h_2013-04-26-04:02:49-2013-04-26_11:02:53.gz.15WBuUh >>>>>>>>>> [2018-06-05 12:05:26.768085] E [repce(slave):117:worker] <top>: call >>>>>>>>>> failed: >>>>>>>>>> Traceback (most recent call last): >>>>>>>>>> File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line >>>>>>>>>> 113, in worker >>>>>>>>>> res = getattr(self.obj, rmeth)(*in_data[2:]) >>>>>>>>>> TypeError: getattr(): attribute name must be string >>>>>>>>>> >>>>>>>>>> From this point in time the slave server begins to consume all of >>>>>>>>>> its available RAM until it becomes non-responsive. Eventually the >>>>>>>>>> gluster service seems to kill off the offending process and the >>>>>>>>>> memory is returned to the system. Once the memory has been returned >>>>>>>>>> to the remote slave system the geo-replication often recovers and >>>>>>>>>> data transfer resumes. >>>>>>>>>> >>>>>>>>>> I have attached the full geo-replication slave log containing the >>>>>>>>>> error shown above. I have also attached an image file showing the >>>>>>>>>> memory usage of the affected storage server. >>>>>>>>>> >>>>>>>>>> We are currently running Gluster version 3.12.9 on top of CentOS 7.5 >>>>>>>>>> x86_64. The system has been fully patched and is running the latest >>>>>>>>>> software, excluding glibc which had to be downgraded to get >>>>>>>>>> geo-replication working. >>>>>>>>>> >>>>>>>>>> The Gluster volume runs on a dedicated partition using the XFS >>>>>>>>>> filesystem which in turn is running on a LVM thin volume. The >>>>>>>>>> physical storage is presented as a single drive due to the >>>>>>>>>> underlying disks being part of a raid 10 array. >>>>>>>>>> >>>>>>>>>> The Master volume which is being replicated has a total of 2.2 TB of >>>>>>>>>> data to be replicated. The total size of the volume fluctuates very >>>>>>>>>> little as data being removed equals the new data coming in. This >>>>>>>>>> data is made up of many thousands of files across many separated >>>>>>>>>> directories. Data file sizes vary from the very small (>1K) to the >>>>>>>>>> large (>1Gb). The Gluster service itself is running with a single >>>>>>>>>> volume in a replicated configuration across 3 bricks at each of the >>>>>>>>>> sites. The delta changes being replicated are on average about >>>>>>>>>> 100GB per day, where this includes file creation / deletion / >>>>>>>>>> modification. >>>>>>>>>> >>>>>>>>>> The config for the geo-replication session is as follows, taken from >>>>>>>>>> the current source server; >>>>>>>>>> >>>>>>>>>> special_sync_mode: partial >>>>>>>>>> gluster_log_file: >>>>>>>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.gluster.log >>>>>>>>>> ssh_command: ssh -oPasswordAuthentication=no >>>>>>>>>> -oStrictHostKeyChecking=no -i >>>>>>>>>> /var/lib/glusterd/geo-replication/secret.pem >>>>>>>>>> change_detector: changelog >>>>>>>>>> session_owner: 40e9e77a-034c-44a2-896e-59eec47e8a84 >>>>>>>>>> state_file: >>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.status >>>>>>>>>> gluster_params: aux-gfid-mount acl >>>>>>>>>> log_rsync_performance: true >>>>>>>>>> remote_gsyncd: /nonexistent/gsyncd >>>>>>>>>> working_dir: >>>>>>>>>> /var/lib/misc/glusterfsd/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1 >>>>>>>>>> state_detail_file: >>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-detail.status >>>>>>>>>> gluster_command_dir: /usr/sbin/ >>>>>>>>>> pid_file: >>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.pid >>>>>>>>>> georep_session_working_dir: >>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ >>>>>>>>>> ssh_command_tar: ssh -oPasswordAuthentication=no >>>>>>>>>> -oStrictHostKeyChecking=no -i >>>>>>>>>> /var/lib/glusterd/geo-replication/tar_ssh.pem >>>>>>>>>> master.stime_xattr_name: >>>>>>>>>> trusted.glusterfs.40e9e77a-034c-44a2-896e-59eec47e8a84.ccfaed9b-ff4b-4a55-acfa-03f092cdf460.stime >>>>>>>>>> changelog_log_file: >>>>>>>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-changes.log >>>>>>>>>> socketdir: /var/run/gluster >>>>>>>>>> volume_id: 40e9e77a-034c-44a2-896e-59eec47e8a84 >>>>>>>>>> ignore_deletes: false >>>>>>>>>> state_socket_unencoded: >>>>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.socket >>>>>>>>>> log_file: >>>>>>>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.log >>>>>>>>>> >>>>>>>>>> If any further information is required in order to troubleshoot this >>>>>>>>>> issue then please let me know. >>>>>>>>>> >>>>>>>>>> I would be very grateful for any help or guidance received. >>>>>>>>>> >>>>>>>>>> Many thanks, >>>>>>>>>> >>>>>>>>>> Mark Betham. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This email may contain confidential material; unintended recipients >>>>>>>>>> must not disseminate, use, or act upon any information in it. If you >>>>>>>>>> received this email in error, please contact the sender and >>>>>>>>>> permanently delete the email. >>>>>>>>>> Performance Horizon Group Limited | Registered in England & Wales >>>>>>>>>> 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 >>>>>>>>>> 3PA >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> [email protected] >>>>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Thanks and Regards, >>>>>>>>> Kotresh H R >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> MARK BETHAM >>>>>>>> Senior System Administrator >>>>>>>> +44 (0) 191 261 2444 >>>>>>>> performancehorizon.com >>>>>>>> PerformanceHorizon >>>>>>>> tweetphg >>>>>>>> performance-horizon-group >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> MARK BETHAM >>>>>>> Senior System Administrator >>>>>>> +44 (0) 191 261 2444 >>>>>>> performancehorizon.com >>>>>>> PerformanceHorizon >>>>>>> tweetphg >>>>>>> performance-horizon-group >>>>>>> >>>>>>> >>>>>>> This email may contain confidential material; unintended recipients >>>>>>> must not disseminate, use, or act upon any information in it. If you >>>>>>> received this email in error, please contact the sender and permanently >>>>>>> delete the email. >>>>>>> Performance Horizon Group Limited | Registered in England & Wales >>>>>>> 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks and Regards, >>>>>> Kotresh H R >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> MARK BETHAM >>>>> Senior System Administrator >>>>> +44 (0) 191 261 2444 >>>>> performancehorizon.com >>>>> PerformanceHorizon >>>>> tweetphg >>>>> performance-horizon-group >>>> >>>> >>>> >>>> -- >>>> MARK BETHAM >>>> Senior System Administrator >>>> +44 (0) 191 261 2444 >>>> performancehorizon.com >>>> PerformanceHorizon >>>> tweetphg >>>> performance-horizon-group >>>> >>>> >>>> This email may contain confidential material; unintended recipients must >>>> not disseminate, use, or act upon any information in it. If you received >>>> this email in error, please contact the sender and permanently delete the >>>> email. >>>> Performance Horizon Group Limited | Registered in England & Wales 07188234 >>>> | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA >>>> >>>> >>> >>> >>> >>> -- >>> Thanks and Regards, >>> Kotresh H R >> >> >> >> -- >> MARK BETHAM >> Senior System Administrator >> +44 (0) 191 261 2444 >> performancehorizon.com >> PerformanceHorizon >> tweetphg >> performance-horizon-group > > > > -- > MARK BETHAM > Senior Systems Administrator > +44 (0) 191 261 2444 > > > This email may contain confidential material; unintended recipients must not > disseminate, use, or act upon any information in it. If you received this > email in error, please contact the sender and permanently delete the email. > Performance Horizon Group Limited | Registered in England & Wales 07188234 | > Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA > > _______________________________________________ Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
