Hello Richard, Thank you for the logs.
I am wondering if this could be a different memory leak than the one addressed in the bug. Would it be possible for you to obtain a statedump of the client so that we can understand the memory allocation pattern better? Details about gathering a statedump can be found at [1]. Please ensure that /var/run/gluster is present before triggering a statedump. Regards, Vijay [1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/ On Fri, Sep 21, 2018 at 12:14 AM Richard Neuboeck <[email protected]> wrote: > Hi again, > > in my limited - non full time programmer - understanding it's a memory > leak in the gluster fuse client. > > Should I reopen the mentioned bugreport or open a new one? Or would the > community prefer an entirely different approach? > > Thanks > Richard > > On 13.09.18 10:07, Richard Neuboeck wrote: > > Hi, > > > > I've created excerpts from the brick and client logs +/- 1 minute to > > the kill event. Still the logs are ~400-500MB so will put them > > somewhere to download since I have no idea what I should be looking > > for and skimming them didn't reveal obvious problems to me. > > > > http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log > > http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log > > > > I was pointed in the direction of the following Bugreport > > https://bugzilla.redhat.com/show_bug.cgi?id=1613512 > > It sounds right but seems to have been addressed already. > > > > If there is anything I can do to help solve this problem please let > > me know. Thanks for your help! > > > > Cheers > > Richard > > > > > > On 9/11/18 10:10 AM, Richard Neuboeck wrote: > >> Hi, > >> > >> since I feared that the logs would fill up the partition (again) I > >> checked the systems daily and finally found the reason. The glusterfs > >> process on the client runs out of memory and get's killed by OOM after > >> about four days. Since rsync runs for a couple of days longer till it > >> ends I never checked the whole time frame in the system logs and never > >> stumbled upon the OOM message. > >> > >> Running out of memory on a 128GB RAM system even with a DB occupying > >> ~40% of that is kind of strange though. Might there be a leak? > >> > >> But this would explain the erratic behavior I've experienced over the > >> last 1.5 years while trying to work with our homes on glusterfs. > >> > >> Here is the kernel log message for the killed glusterfs process. > >> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a > >> > >> I'm checking the brick and client trace logs. But those are respectively > >> 1TB and 2TB in size so searching in them takes a while. I'll be creating > >> gists for both logs about the time when the process died. > >> > >> As soon as I have more details I'll post them. > >> > >> Here you can see a graphical representation of the memory usage of this > >> system: https://imgur.com/a/4BINtfr > >> > >> Cheers > >> Richard > >> > >> > >> > >> On 31.08.18 08:13, Raghavendra Gowdappa wrote: > >>> > >>> > >>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck > >>> <[email protected] <mailto:[email protected]>> wrote: > >>> > >>> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote: > >>> > +Mohit. +Milind > >>> > > >>> > @Mohit/Milind, > >>> > > >>> > Can you check logs and see whether you can find anything > relevant? > >>> > >>> From glances at the system logs nothing out of the ordinary > >>> occurred. However I'll start another rsync and take a closer look. > >>> It will take a few days. > >>> > >>> > > >>> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck > >>> > <[email protected] <mailto:[email protected]> > >>> <mailto:[email protected] <mailto:[email protected]>>> > wrote: > >>> > > >>> > Hi, > >>> > > >>> > I'm attaching a shortened version since the whole is about > 5.8GB of > >>> > the client mount log. It includes the initial mount messages > and the > >>> > last two minutes of log entries. > >>> > > >>> > It ends very anticlimactic without an obvious error. Is there > >>> > anything specific I should be looking for? > >>> > > >>> > > >>> > Normally I look logs around disconnect msgs to find out the > reason. > >>> > But as you said, sometimes one can see just disconnect msgs > without > >>> > any reason. That normally points to reason for disconnect in the > >>> > network rather than a Glusterfs initiated disconnect. > >>> > >>> The rsync source is serving our homes currently so there are NFS > >>> connections 24/7. There don't seem to be any network related > >>> interruptions > >>> > >>> > >>> Can you set diagnostics.client-log-level and > diagnostics.brick-log-level > >>> to TRACE and check logs of both ends of connections - client and brick? > >>> To reduce the logsize, I would suggest to logrotate existing logs and > >>> start with fresh logs when you are about to start so that only relevant > >>> logs are captured. Also, can you take strace of client and brick > process > >>> using: > >>> > >>> strace -o <outputfile> -ff -v -p <pid> > >>> > >>> attach both logs and strace. Let's trace through what syscalls on > socket > >>> return and then decide whether to inspect tcpdump or not. If you don't > >>> want to repeat tests again, please capture tcpdump too (on both ends of > >>> connection) and send them to us. > >>> > >>> > >>> - a co-worker would be here faster than I could check > >>> the logs if the connection to home would be broken ;-) > >>> The three gluster machines are due to this problem reduced to only > >>> testing so there is nothing else running. > >>> > >>> > >>> > > >>> > Cheers > >>> > Richard > >>> > > >>> > On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote: > >>> > > Normally client logs will give a clue on why the > disconnections are > >>> > > happening (ping-timeout, wrong port etc). Can you look > into client > >>> > > logs to figure out what's happening? If you can't find > anything, can > >>> > > you send across client logs? > >>> > > > >>> > > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck > >>> > > <[email protected] <mailto:[email protected]> > >>> <mailto:[email protected] <mailto:[email protected]>> > >>> > <mailto:[email protected] <mailto:[email protected]> > >>> <mailto:[email protected] <mailto:[email protected]>>>> > >>> > wrote: > >>> > > > >>> > > Hi Gluster Community, > >>> > > > >>> > > I have problems with a glusterfs 'Transport endpoint > not > >>> > connected' > >>> > > connection abort during file transfers that I can > >>> > replicate (all the > >>> > > time now) but not pinpoint as to why this is happening. > >>> > > > >>> > > The volume is set up in replica 3 mode and accessed > with > >>> > the fuse > >>> > > gluster client. Both client and server are running > CentOS > >>> > and the > >>> > > supplied 3.12.11 version of gluster. > >>> > > > >>> > > The connection abort happens at different times during > >>> > rsync but > >>> > > occurs every time I try to sync all our files (1.1TB) > to > >>> > the empty > >>> > > volume. > >>> > > > >>> > > Client and server side I don't find errors in the > gluster > >>> > log files. > >>> > > rsync logs the obvious transfer problem. The only log > that > >>> > shows > >>> > > anything related is the server brick log which states > >>> that the > >>> > > connection is shutting down: > >>> > > > >>> > > [2018-08-18 22:40:35.502510] I [MSGID: 115036] > >>> > > [server.c:527:server_rpc_notify] 0-home-server: > >>> disconnecting > >>> > > connection from > >>> > > > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0 > >>> > > [2018-08-18 22:40:35.502620] W > >>> > > [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: > >>> > releasing lock > >>> > > on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by > >>> > > {client=0x7f83ec0b3ce0, pid=110423 > >>> lk-owner=d0fd5ffb427f0000} > >>> > > [2018-08-18 22:40:35.502692] W > >>> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: > >>> > releasing lock > >>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by > >>> > > {client=0x7f83ec0b3ce0, pid=110423 > >>> lk-owner=703dd4cc407f0000} > >>> > > [2018-08-18 22:40:35.502719] W > >>> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: > >>> > releasing lock > >>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by > >>> > > {client=0x7f83ec0b3ce0, pid=110423 > >>> lk-owner=703dd4cc407f0000} > >>> > > [2018-08-18 22:40:35.505950] I [MSGID: 101055] > >>> > > [client_t.c:443:gf_client_unref] 0-home-server: > Shutting > >>> down > >>> > > connection > >>> > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0 > >>> > > > >>> > > Since I'm running another replica 3 setup for oVirt > for a > >>> > long time > >>> > > now which is completely stable I thought I made a > mistake > >>> > setting > >>> > > different options at first. However even when I reset > >>> > those options > >>> > > I'm able to reproduce the connection problem. > >>> > > > >>> > > The unoptimized volume setup looks like this: > >>> > > > >>> > > Volume Name: home > >>> > > Type: Replicate > >>> > > Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8 > >>> > > Status: Started > >>> > > Snapshot Count: 0 > >>> > > Number of Bricks: 1 x 3 = 3 > >>> > > Transport-type: tcp > >>> > > Bricks: > >>> > > Brick1: sphere-four:/srv/gluster_home/brick > >>> > > Brick2: sphere-five:/srv/gluster_home/brick > >>> > > Brick3: sphere-six:/srv/gluster_home/brick > >>> > > Options Reconfigured: > >>> > > nfs.disable: on > >>> > > transport.address-family: inet > >>> > > cluster.quorum-type: auto > >>> > > cluster.server-quorum-type: server > >>> > > cluster.server-quorum-ratio: 50% > >>> > > > >>> > > > >>> > > The following additional options were used before: > >>> > > > >>> > > performance.cache-size: 5GB > >>> > > client.event-threads: 4 > >>> > > server.event-threads: 4 > >>> > > cluster.lookup-optimize: on > >>> > > features.cache-invalidation: on > >>> > > performance.stat-prefetch: on > >>> > > performance.cache-invalidation: on > >>> > > network.inode-lru-limit: 50000 > >>> > > features.cache-invalidation-timeout: 600 > >>> > > performance.md-cache-timeout: 600 > >>> > > performance.parallel-readdir: on > >>> > > > >>> > > > >>> > > In this case the gluster servers and also the client is > >>> > using a > >>> > > bonded network device running in adaptive load > balancing > >>> mode. > >>> > > > >>> > > I've tried using the debug option for the client mount. > >>> > But except > >>> > > for a ~0.5TB log file I didn't get information that > seems > >>> > > helpful to me. > >>> > > > >>> > > Transferring just a couple of GB works without > problems. > >>> > > > >>> > > It may very well be that I'm already blind to the > obvious > >>> > but after > >>> > > many long running tests I can't find the crux in the > setup. > >>> > > > >>> > > Does anyone have an idea as how to approach this > problem > >>> > in a way > >>> > > that sheds some useful information? > >>> > > > >>> > > Any help is highly appreciated! > >>> > > Cheers > >>> > > Richard > >>> > > > >>> > > -- > >>> > > /dev/null > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > _______________________________________________ > >>> > > Gluster-users mailing list > >>> > > [email protected] <mailto: > [email protected]> > >>> > <mailto:[email protected] > >>> <mailto:[email protected]>> > >>> > <mailto:[email protected] > >>> <mailto:[email protected]> > >>> > <mailto:[email protected] > >>> <mailto:[email protected]>>> > >>> > > > https://lists.gluster.org/mailman/listinfo/gluster-users > >>> <https://lists.gluster.org/mailman/listinfo/gluster-users> > >>> > <https://lists.gluster.org/mailman/listinfo/gluster-users > >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>> > >>> > > > >>> <https://lists.gluster.org/mailman/listinfo/gluster-users > >>> <https://lists.gluster.org/mailman/listinfo/gluster-users> > >>> > <https://lists.gluster.org/mailman/listinfo/gluster-users > >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>>> > >>> > > > >>> > > > >>> > > >>> > > >>> > -- > >>> > /dev/null > >>> > > >>> > > >>> > >>> > >>> -- > >>> /dev/null > >>> > >>> > >> > >> > >> > >> _______________________________________________ > >> Gluster-users mailing list > >> [email protected] > >> https://lists.gluster.org/mailman/listinfo/gluster-users > >> > > > > > > > > > > _______________________________________________ > > Gluster-users mailing list > > [email protected] > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > Gluster-users mailing list > [email protected] > https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-users
