Re: [Gluster-users] gluster connection interrupted during transfer

2018-11-21 Thread Richard Neuboeck
Hi Vijay,

this is an update to the 8 tests I've run so far. In short all is well.

I followed your advice and created state dumps every 3 hours. 4 tests
ran with the default volume options. The last 4 tests ran with all
performance optimizations I could find to increase small file performance.

During the run time the dump file size varied from the beginning of the
mount ~100KB to ~1GB reflecting the memory footprint of the gluster process.

Since every test ran without interruption the memory leak seems to be
fixed in 3.12.14-1.el7.x86_64 on CentOS 7.

Thanks again for you help.
Cheers
Richard

On 15.10.18 10:48, Richard Neuboeck wrote:
> Hi Vijay,
> 
> sorry it took so long. I've upgraded the gluster server and client to
> the latest packages 3.12.14-1.el7.x86_64 available in CentOS.
> 
> Incredibly my first test after the update worked perfectly! I'll do
> another couple of rsyncs, maybe apply the performance improvements again
> and do statedumps all the way.
> 
> I'll report back if there are any more problems or if they are resolved.
> 
> Thanks for the help so far!
> Cheers
> Richard
> 
> 
> On 25.09.18 00:39, Vijay Bellur wrote:
>> Hello Richard,
>>
>> Thank you for the logs.
>>
>> I am wondering if this could be a different memory leak than the one
>> addressed in the bug. Would it be possible for you to obtain a
>> statedump of the client so that we can understand the memory allocation
>> pattern better? Details about gathering a statedump can be found at [1].
>> Please ensure that /var/run/gluster is present before triggering a
>> statedump.
>>
>> Regards,
>> Vijay
>>
>> [1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/
>>
>>
>> On Fri, Sep 21, 2018 at 12:14 AM Richard Neuboeck > > wrote:
>>
>> Hi again,
>>
>> in my limited - non full time programmer - understanding it's a memory
>> leak in the gluster fuse client.
>>
>> Should I reopen the mentioned bugreport or open a new one? Or would the
>> community prefer an entirely different approach?
>>
>> Thanks
>> Richard
>>
>> On 13.09.18 10:07, Richard Neuboeck wrote:
>> > Hi,
>> >
>> > I've created excerpts from the brick and client logs +/- 1 minute to
>> > the kill event. Still the logs are ~400-500MB so will put them
>> > somewhere to download since I have no idea what I should be looking
>> > for and skimming them didn't reveal obvious problems to me.
>> >
>> > http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
>> 
>> > http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
>> 
>> >
>> > I was pointed in the direction of the following Bugreport
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1613512
>> > It sounds right but seems to have been addressed already.
>> >
>> > If there is anything I can do to help solve this problem please let
>> > me know. Thanks for your help!
>> >
>> > Cheers
>> > Richard
>> >
>> >
>> > On 9/11/18 10:10 AM, Richard Neuboeck wrote:
>> >> Hi,
>> >>
>> >> since I feared that the logs would fill up the partition (again) I
>> >> checked the systems daily and finally found the reason. The glusterfs
>> >> process on the client runs out of memory and get's killed by OOM
>> after
>> >> about four days. Since rsync runs for a couple of days longer till it
>> >> ends I never checked the whole time frame in the system logs and
>> never
>> >> stumbled upon the OOM message.
>> >>
>> >> Running out of memory on a 128GB RAM system even with a DB occupying
>> >> ~40% of that is kind of strange though. Might there be a leak?
>> >>
>> >> But this would explain the erratic behavior I've experienced over the
>> >> last 1.5 years while trying to work with our homes on glusterfs.
>> >>
>> >> Here is the kernel log message for the killed glusterfs process.
>> >> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
>> >>
>> >> I'm checking the brick and client trace logs. But those are
>> respectively
>> >> 1TB and 2TB in size so searching in them takes a while. I'll be
>> creating
>> >> gists for both logs about the time when the process died.
>> >>
>> >> As soon as I have more details I'll post them.
>> >>
>> >> Here you can see a graphical representation of the memory usage
>> of this
>> >> system: https://imgur.com/a/4BINtfr
>> >>
>> >> Cheers
>> >> Richard
>> >>
>> >>
>> >>
>> >> On 31.08.18 08:13, Raghavendra Gowdappa wrote:
>> >>>
>> >>>
>> >>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
>> >>> mailto:h...@tbi.univie.ac.at>
>> >> wrote:
>> >>>
>> >>>    

Re: [Gluster-users] gluster connection interrupted during transfer

2018-10-15 Thread Richard Neuboeck
Hi Vijay,

sorry it took so long. I've upgraded the gluster server and client to
the latest packages 3.12.14-1.el7.x86_64 available in CentOS.

Incredibly my first test after the update worked perfectly! I'll do
another couple of rsyncs, maybe apply the performance improvements again
and do statedumps all the way.

I'll report back if there are any more problems or if they are resolved.

Thanks for the help so far!
Cheers
Richard


On 25.09.18 00:39, Vijay Bellur wrote:
> Hello Richard,
> 
> Thank you for the logs.
> 
> I am wondering if this could be a different memory leak than the one
> addressed in the bug. Would it be possible for you to obtain a
> statedump of the client so that we can understand the memory allocation
> pattern better? Details about gathering a statedump can be found at [1].
> Please ensure that /var/run/gluster is present before triggering a
> statedump.
> 
> Regards,
> Vijay
> 
> [1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/
> 
> 
> On Fri, Sep 21, 2018 at 12:14 AM Richard Neuboeck  > wrote:
> 
> Hi again,
> 
> in my limited - non full time programmer - understanding it's a memory
> leak in the gluster fuse client.
> 
> Should I reopen the mentioned bugreport or open a new one? Or would the
> community prefer an entirely different approach?
> 
> Thanks
> Richard
> 
> On 13.09.18 10:07, Richard Neuboeck wrote:
> > Hi,
> >
> > I've created excerpts from the brick and client logs +/- 1 minute to
> > the kill event. Still the logs are ~400-500MB so will put them
> > somewhere to download since I have no idea what I should be looking
> > for and skimming them didn't reveal obvious problems to me.
> >
> > http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
> 
> > http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
> 
> >
> > I was pointed in the direction of the following Bugreport
> > https://bugzilla.redhat.com/show_bug.cgi?id=1613512
> > It sounds right but seems to have been addressed already.
> >
> > If there is anything I can do to help solve this problem please let
> > me know. Thanks for your help!
> >
> > Cheers
> > Richard
> >
> >
> > On 9/11/18 10:10 AM, Richard Neuboeck wrote:
> >> Hi,
> >>
> >> since I feared that the logs would fill up the partition (again) I
> >> checked the systems daily and finally found the reason. The glusterfs
> >> process on the client runs out of memory and get's killed by OOM
> after
> >> about four days. Since rsync runs for a couple of days longer till it
> >> ends I never checked the whole time frame in the system logs and
> never
> >> stumbled upon the OOM message.
> >>
> >> Running out of memory on a 128GB RAM system even with a DB occupying
> >> ~40% of that is kind of strange though. Might there be a leak?
> >>
> >> But this would explain the erratic behavior I've experienced over the
> >> last 1.5 years while trying to work with our homes on glusterfs.
> >>
> >> Here is the kernel log message for the killed glusterfs process.
> >> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
> >>
> >> I'm checking the brick and client trace logs. But those are
> respectively
> >> 1TB and 2TB in size so searching in them takes a while. I'll be
> creating
> >> gists for both logs about the time when the process died.
> >>
> >> As soon as I have more details I'll post them.
> >>
> >> Here you can see a graphical representation of the memory usage
> of this
> >> system: https://imgur.com/a/4BINtfr
> >>
> >> Cheers
> >> Richard
> >>
> >>
> >>
> >> On 31.08.18 08:13, Raghavendra Gowdappa wrote:
> >>>
> >>>
> >>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
> >>> mailto:h...@tbi.univie.ac.at>
> >> wrote:
> >>>
> >>>     On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
> >>>     > +Mohit. +Milind
> >>>     >
> >>>     > @Mohit/Milind,
> >>>     >
> >>>     > Can you check logs and see whether you can find anything
> relevant?
> >>>
> >>>     From glances at the system logs nothing out of the ordinary
> >>>     occurred. However I'll start another rsync and take a closer
> look.
> >>>     It will take a few days.
> >>>
> >>>     >
> >>>     > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
> >>>     > mailto:h...@tbi.univie.ac.at>
> >
> >>>     
> 

Re: [Gluster-users] gluster connection interrupted during transfer

2018-09-24 Thread Vijay Bellur
Hello Richard,

Thank you for the logs.

I am wondering if this could be a different memory leak than the one
addressed in the bug. Would it be possible for you to obtain a statedump of
the client so that we can understand the memory allocation pattern better?
Details about gathering a statedump can be found at [1]. Please ensure that
/var/run/gluster is present before triggering a statedump.

Regards,
Vijay

[1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/


On Fri, Sep 21, 2018 at 12:14 AM Richard Neuboeck 
wrote:

> Hi again,
>
> in my limited - non full time programmer - understanding it's a memory
> leak in the gluster fuse client.
>
> Should I reopen the mentioned bugreport or open a new one? Or would the
> community prefer an entirely different approach?
>
> Thanks
> Richard
>
> On 13.09.18 10:07, Richard Neuboeck wrote:
> > Hi,
> >
> > I've created excerpts from the brick and client logs +/- 1 minute to
> > the kill event. Still the logs are ~400-500MB so will put them
> > somewhere to download since I have no idea what I should be looking
> > for and skimming them didn't reveal obvious problems to me.
> >
> > http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
> > http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
> >
> > I was pointed in the direction of the following Bugreport
> > https://bugzilla.redhat.com/show_bug.cgi?id=1613512
> > It sounds right but seems to have been addressed already.
> >
> > If there is anything I can do to help solve this problem please let
> > me know. Thanks for your help!
> >
> > Cheers
> > Richard
> >
> >
> > On 9/11/18 10:10 AM, Richard Neuboeck wrote:
> >> Hi,
> >>
> >> since I feared that the logs would fill up the partition (again) I
> >> checked the systems daily and finally found the reason. The glusterfs
> >> process on the client runs out of memory and get's killed by OOM after
> >> about four days. Since rsync runs for a couple of days longer till it
> >> ends I never checked the whole time frame in the system logs and never
> >> stumbled upon the OOM message.
> >>
> >> Running out of memory on a 128GB RAM system even with a DB occupying
> >> ~40% of that is kind of strange though. Might there be a leak?
> >>
> >> But this would explain the erratic behavior I've experienced over the
> >> last 1.5 years while trying to work with our homes on glusterfs.
> >>
> >> Here is the kernel log message for the killed glusterfs process.
> >> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
> >>
> >> I'm checking the brick and client trace logs. But those are respectively
> >> 1TB and 2TB in size so searching in them takes a while. I'll be creating
> >> gists for both logs about the time when the process died.
> >>
> >> As soon as I have more details I'll post them.
> >>
> >> Here you can see a graphical representation of the memory usage of this
> >> system: https://imgur.com/a/4BINtfr
> >>
> >> Cheers
> >> Richard
> >>
> >>
> >>
> >> On 31.08.18 08:13, Raghavendra Gowdappa wrote:
> >>>
> >>>
> >>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
> >>> mailto:h...@tbi.univie.ac.at>> wrote:
> >>>
> >>> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
> >>> > +Mohit. +Milind
> >>> >
> >>> > @Mohit/Milind,
> >>> >
> >>> > Can you check logs and see whether you can find anything
> relevant?
> >>>
> >>> From glances at the system logs nothing out of the ordinary
> >>> occurred. However I'll start another rsync and take a closer look.
> >>> It will take a few days.
> >>>
> >>> >
> >>> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
> >>> > mailto:h...@tbi.univie.ac.at>
> >>> >>
> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > I'm attaching a shortened version since the whole is about
> 5.8GB of
> >>> > the client mount log. It includes the initial mount messages
> and the
> >>> > last two minutes of log entries.
> >>> >
> >>> > It ends very anticlimactic without an obvious error. Is there
> >>> > anything specific I should be looking for?
> >>> >
> >>> >
> >>> > Normally I look logs around disconnect msgs to find out the
> reason.
> >>> > But as you said, sometimes one can see just disconnect msgs
> without
> >>> > any reason. That normally points to reason for disconnect in the
> >>> > network rather than a Glusterfs initiated disconnect.
> >>>
> >>> The rsync source is serving our homes currently so there are NFS
> >>> connections 24/7. There don't seem to be any network related
> >>> interruptions
> >>>
> >>>
> >>> Can you set diagnostics.client-log-level and
> diagnostics.brick-log-level
> >>> to TRACE and check logs of both ends of connections - client and brick?
> >>> To reduce the logsize, I would suggest to logrotate existing logs and
> >>> start with fresh logs when you are about to start so that only

Re: [Gluster-users] gluster connection interrupted during transfer

2018-09-21 Thread Richard Neuboeck
Hi again,

in my limited - non full time programmer - understanding it's a memory
leak in the gluster fuse client.

Should I reopen the mentioned bugreport or open a new one? Or would the
community prefer an entirely different approach?

Thanks
Richard

On 13.09.18 10:07, Richard Neuboeck wrote:
> Hi,
> 
> I've created excerpts from the brick and client logs +/- 1 minute to
> the kill event. Still the logs are ~400-500MB so will put them
> somewhere to download since I have no idea what I should be looking
> for and skimming them didn't reveal obvious problems to me.
> 
> http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
> http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
> 
> I was pointed in the direction of the following Bugreport
> https://bugzilla.redhat.com/show_bug.cgi?id=1613512
> It sounds right but seems to have been addressed already.
> 
> If there is anything I can do to help solve this problem please let
> me know. Thanks for your help!
> 
> Cheers
> Richard
> 
> 
> On 9/11/18 10:10 AM, Richard Neuboeck wrote:
>> Hi,
>>
>> since I feared that the logs would fill up the partition (again) I
>> checked the systems daily and finally found the reason. The glusterfs
>> process on the client runs out of memory and get's killed by OOM after
>> about four days. Since rsync runs for a couple of days longer till it
>> ends I never checked the whole time frame in the system logs and never
>> stumbled upon the OOM message.
>>
>> Running out of memory on a 128GB RAM system even with a DB occupying
>> ~40% of that is kind of strange though. Might there be a leak?
>>
>> But this would explain the erratic behavior I've experienced over the
>> last 1.5 years while trying to work with our homes on glusterfs.
>>
>> Here is the kernel log message for the killed glusterfs process.
>> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
>>
>> I'm checking the brick and client trace logs. But those are respectively
>> 1TB and 2TB in size so searching in them takes a while. I'll be creating
>> gists for both logs about the time when the process died.
>>
>> As soon as I have more details I'll post them.
>>
>> Here you can see a graphical representation of the memory usage of this
>> system: https://imgur.com/a/4BINtfr
>>
>> Cheers
>> Richard
>>
>>
>>
>> On 31.08.18 08:13, Raghavendra Gowdappa wrote:
>>>
>>>
>>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
>>> mailto:h...@tbi.univie.ac.at>> wrote:
>>>
>>> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
>>> > +Mohit. +Milind
>>> > 
>>> > @Mohit/Milind,
>>> > 
>>> > Can you check logs and see whether you can find anything relevant?
>>>
>>> From glances at the system logs nothing out of the ordinary
>>> occurred. However I'll start another rsync and take a closer look.
>>> It will take a few days.
>>>
>>> > 
>>> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
>>> > mailto:h...@tbi.univie.ac.at>
>>> >> wrote:
>>> > 
>>> >     Hi,
>>> > 
>>> >     I'm attaching a shortened version since the whole is about 5.8GB 
>>> of
>>> >     the client mount log. It includes the initial mount messages and 
>>> the
>>> >     last two minutes of log entries.
>>> > 
>>> >     It ends very anticlimactic without an obvious error. Is there
>>> >     anything specific I should be looking for?
>>> > 
>>> > 
>>> > Normally I look logs around disconnect msgs to find out the reason.
>>> > But as you said, sometimes one can see just disconnect msgs without
>>> > any reason. That normally points to reason for disconnect in the
>>> > network rather than a Glusterfs initiated disconnect.
>>>
>>> The rsync source is serving our homes currently so there are NFS
>>> connections 24/7. There don't seem to be any network related
>>> interruptions 
>>>
>>>
>>> Can you set diagnostics.client-log-level and diagnostics.brick-log-level
>>> to TRACE and check logs of both ends of connections - client and brick?
>>> To reduce the logsize, I would suggest to logrotate existing logs and
>>> start with fresh logs when you are about to start so that only relevant
>>> logs are captured. Also, can you take strace of client and brick process
>>> using:
>>>
>>> strace -o  -ff -v -p 
>>>
>>> attach both logs and strace. Let's trace through what syscalls on socket
>>> return and then decide whether to inspect tcpdump or not. If you don't
>>> want to repeat tests again, please capture tcpdump too (on both ends of
>>> connection) and send them to us.
>>>
>>>
>>> - a co-worker would be here faster than I could check
>>> the logs if the connection to home would be broken ;-)
>>> The three gluster machines are due to this problem reduced to only
>>> testing so there is nothing else running.
>>>
>>>
>>> > 
>>> >     Cheers
>>> >     Richard
>>> > 
>>> >     On

Re: [Gluster-users] gluster connection interrupted during transfer

2018-09-13 Thread Richard Neuboeck
Hi,

I've created excerpts from the brick and client logs +/- 1 minute to
the kill event. Still the logs are ~400-500MB so will put them
somewhere to download since I have no idea what I should be looking
for and skimming them didn't reveal obvious problems to me.

http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log

I was pointed in the direction of the following Bugreport
https://bugzilla.redhat.com/show_bug.cgi?id=1613512
It sounds right but seems to have been addressed already.

If there is anything I can do to help solve this problem please let
me know. Thanks for your help!

Cheers
Richard


On 9/11/18 10:10 AM, Richard Neuboeck wrote:
> Hi,
> 
> since I feared that the logs would fill up the partition (again) I
> checked the systems daily and finally found the reason. The glusterfs
> process on the client runs out of memory and get's killed by OOM after
> about four days. Since rsync runs for a couple of days longer till it
> ends I never checked the whole time frame in the system logs and never
> stumbled upon the OOM message.
> 
> Running out of memory on a 128GB RAM system even with a DB occupying
> ~40% of that is kind of strange though. Might there be a leak?
> 
> But this would explain the erratic behavior I've experienced over the
> last 1.5 years while trying to work with our homes on glusterfs.
> 
> Here is the kernel log message for the killed glusterfs process.
> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
> 
> I'm checking the brick and client trace logs. But those are respectively
> 1TB and 2TB in size so searching in them takes a while. I'll be creating
> gists for both logs about the time when the process died.
> 
> As soon as I have more details I'll post them.
> 
> Here you can see a graphical representation of the memory usage of this
> system: https://imgur.com/a/4BINtfr
> 
> Cheers
> Richard
> 
> 
> 
> On 31.08.18 08:13, Raghavendra Gowdappa wrote:
>>
>>
>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
>> mailto:h...@tbi.univie.ac.at>> wrote:
>>
>> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
>> > +Mohit. +Milind
>> > 
>> > @Mohit/Milind,
>> > 
>> > Can you check logs and see whether you can find anything relevant?
>>
>> From glances at the system logs nothing out of the ordinary
>> occurred. However I'll start another rsync and take a closer look.
>> It will take a few days.
>>
>> > 
>> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
>> > mailto:h...@tbi.univie.ac.at>
>> >> wrote:
>> > 
>> >     Hi,
>> > 
>> >     I'm attaching a shortened version since the whole is about 5.8GB of
>> >     the client mount log. It includes the initial mount messages and 
>> the
>> >     last two minutes of log entries.
>> > 
>> >     It ends very anticlimactic without an obvious error. Is there
>> >     anything specific I should be looking for?
>> > 
>> > 
>> > Normally I look logs around disconnect msgs to find out the reason.
>> > But as you said, sometimes one can see just disconnect msgs without
>> > any reason. That normally points to reason for disconnect in the
>> > network rather than a Glusterfs initiated disconnect.
>>
>> The rsync source is serving our homes currently so there are NFS
>> connections 24/7. There don't seem to be any network related
>> interruptions 
>>
>>
>> Can you set diagnostics.client-log-level and diagnostics.brick-log-level
>> to TRACE and check logs of both ends of connections - client and brick?
>> To reduce the logsize, I would suggest to logrotate existing logs and
>> start with fresh logs when you are about to start so that only relevant
>> logs are captured. Also, can you take strace of client and brick process
>> using:
>>
>> strace -o  -ff -v -p 
>>
>> attach both logs and strace. Let's trace through what syscalls on socket
>> return and then decide whether to inspect tcpdump or not. If you don't
>> want to repeat tests again, please capture tcpdump too (on both ends of
>> connection) and send them to us.
>>
>>
>> - a co-worker would be here faster than I could check
>> the logs if the connection to home would be broken ;-)
>> The three gluster machines are due to this problem reduced to only
>> testing so there is nothing else running.
>>
>>
>> > 
>> >     Cheers
>> >     Richard
>> > 
>> >     On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:
>> >     > Normally client logs will give a clue on why the disconnections 
>> are
>> >     > happening (ping-timeout, wrong port etc). Can you look into 
>> client
>> >     > logs to figure out what's happening? If you can't find anything, 
>> can
>> >     > you send across client logs?
>> >     > 
>> >     > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck

Re: [Gluster-users] gluster connection interrupted during transfer

2018-09-11 Thread Richard Neuboeck
Hi,

since I feared that the logs would fill up the partition (again) I
checked the systems daily and finally found the reason. The glusterfs
process on the client runs out of memory and get's killed by OOM after
about four days. Since rsync runs for a couple of days longer till it
ends I never checked the whole time frame in the system logs and never
stumbled upon the OOM message.

Running out of memory on a 128GB RAM system even with a DB occupying
~40% of that is kind of strange though. Might there be a leak?

But this would explain the erratic behavior I've experienced over the
last 1.5 years while trying to work with our homes on glusterfs.

Here is the kernel log message for the killed glusterfs process.
https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a

I'm checking the brick and client trace logs. But those are respectively
1TB and 2TB in size so searching in them takes a while. I'll be creating
gists for both logs about the time when the process died.

As soon as I have more details I'll post them.

Here you can see a graphical representation of the memory usage of this
system: https://imgur.com/a/4BINtfr

Cheers
Richard



On 31.08.18 08:13, Raghavendra Gowdappa wrote:
> 
> 
> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
> mailto:h...@tbi.univie.ac.at>> wrote:
> 
> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
> > +Mohit. +Milind
> > 
> > @Mohit/Milind,
> > 
> > Can you check logs and see whether you can find anything relevant?
> 
> From glances at the system logs nothing out of the ordinary
> occurred. However I'll start another rsync and take a closer look.
> It will take a few days.
> 
> > 
> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
> > mailto:h...@tbi.univie.ac.at>
> >> wrote:
> > 
> >     Hi,
> > 
> >     I'm attaching a shortened version since the whole is about 5.8GB of
> >     the client mount log. It includes the initial mount messages and the
> >     last two minutes of log entries.
> > 
> >     It ends very anticlimactic without an obvious error. Is there
> >     anything specific I should be looking for?
> > 
> > 
> > Normally I look logs around disconnect msgs to find out the reason.
> > But as you said, sometimes one can see just disconnect msgs without
> > any reason. That normally points to reason for disconnect in the
> > network rather than a Glusterfs initiated disconnect.
> 
> The rsync source is serving our homes currently so there are NFS
> connections 24/7. There don't seem to be any network related
> interruptions 
> 
> 
> Can you set diagnostics.client-log-level and diagnostics.brick-log-level
> to TRACE and check logs of both ends of connections - client and brick?
> To reduce the logsize, I would suggest to logrotate existing logs and
> start with fresh logs when you are about to start so that only relevant
> logs are captured. Also, can you take strace of client and brick process
> using:
> 
> strace -o  -ff -v -p 
> 
> attach both logs and strace. Let's trace through what syscalls on socket
> return and then decide whether to inspect tcpdump or not. If you don't
> want to repeat tests again, please capture tcpdump too (on both ends of
> connection) and send them to us.
> 
> 
> - a co-worker would be here faster than I could check
> the logs if the connection to home would be broken ;-)
> The three gluster machines are due to this problem reduced to only
> testing so there is nothing else running.
> 
> 
> > 
> >     Cheers
> >     Richard
> > 
> >     On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:
> >     > Normally client logs will give a clue on why the disconnections 
> are
> >     > happening (ping-timeout, wrong port etc). Can you look into client
> >     > logs to figure out what's happening? If you can't find anything, 
> can
> >     > you send across client logs?
> >     > 
> >     > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
> >     > mailto:h...@tbi.univie.ac.at>
> >
> >     
>  >     wrote:
> >     >
> >     >     Hi Gluster Community,
> >     >
> >     >     I have problems with a glusterfs 'Transport endpoint not
> >     connected'
> >     >     connection abort during file transfers that I can
> >     replicate (all the
> >     >     time now) but not pinpoint as to why this is happening.
> >     >
> >     >     The volume is set up in replica 3 mode and accessed with
> >     the fuse
> >     >     gluster client. Both client and server are running CentOS
> >     and the
> >     >     supplied 3.12.11 version

Re: [Gluster-users] gluster connection interrupted during transfer

2018-08-30 Thread Raghavendra Gowdappa
On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck 
wrote:

> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
> > +Mohit. +Milind
> >
> > @Mohit/Milind,
> >
> > Can you check logs and see whether you can find anything relevant?
>
> From glances at the system logs nothing out of the ordinary
> occurred. However I'll start another rsync and take a closer look.
> It will take a few days.
>
> >
> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
> > mailto:h...@tbi.univie.ac.at>> wrote:
> >
> > Hi,
> >
> > I'm attaching a shortened version since the whole is about 5.8GB of
> > the client mount log. It includes the initial mount messages and the
> > last two minutes of log entries.
> >
> > It ends very anticlimactic without an obvious error. Is there
> > anything specific I should be looking for?
> >
> >
> > Normally I look logs around disconnect msgs to find out the reason.
> > But as you said, sometimes one can see just disconnect msgs without
> > any reason. That normally points to reason for disconnect in the
> > network rather than a Glusterfs initiated disconnect.
>
> The rsync source is serving our homes currently so there are NFS
> connections 24/7. There don't seem to be any network related
> interruptions


Can you set diagnostics.client-log-level and diagnostics.brick-log-level to
TRACE and check logs of both ends of connections - client and brick? To
reduce the logsize, I would suggest to logrotate existing logs and start
with fresh logs when you are about to start so that only relevant logs are
captured. Also, can you take strace of client and brick process using:

strace -o  -ff -v -p 

attach both logs and strace. Let's trace through what syscalls on socket
return and then decide whether to inspect tcpdump or not. If you don't want
to repeat tests again, please capture tcpdump too (on both ends of
connection) and send them to us.


- a co-worker would be here faster than I could check
> the logs if the connection to home would be broken ;-)
> The three gluster machines are due to this problem reduced to only
> testing so there is nothing else running.
>
>
> >
> > Cheers
> > Richard
> >
> > On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:
> > > Normally client logs will give a clue on why the disconnections are
> > > happening (ping-timeout, wrong port etc). Can you look into client
> > > logs to figure out what's happening? If you can't find anything,
> can
> > > you send across client logs?
> > >
> > > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
> > > mailto:h...@tbi.univie.ac.at>
> > >>
> > wrote:
> > >
> > > Hi Gluster Community,
> > >
> > > I have problems with a glusterfs 'Transport endpoint not
> > connected'
> > > connection abort during file transfers that I can
> > replicate (all the
> > > time now) but not pinpoint as to why this is happening.
> > >
> > > The volume is set up in replica 3 mode and accessed with
> > the fuse
> > > gluster client. Both client and server are running CentOS
> > and the
> > > supplied 3.12.11 version of gluster.
> > >
> > > The connection abort happens at different times during
> > rsync but
> > > occurs every time I try to sync all our files (1.1TB) to
> > the empty
> > > volume.
> > >
> > > Client and server side I don't find errors in the gluster
> > log files.
> > > rsync logs the obvious transfer problem. The only log that
> > shows
> > > anything related is the server brick log which states that the
> > > connection is shutting down:
> > >
> > > [2018-08-18 22:40:35.502510] I [MSGID: 115036]
> > > [server.c:527:server_rpc_notify] 0-home-server: disconnecting
> > > connection from
> > > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> > > [2018-08-18 22:40:35.502620] W
> > > [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server:
> > releasing lock
> > > on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
> > > {client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f}
> > > [2018-08-18 22:40:35.502692] W
> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:
> > releasing lock
> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> > > {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> > > [2018-08-18 22:40:35.502719] W
> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:
> > releasing lock
> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> > > {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> > > [2018-08-18 22:40:35.505950] I [MSGID: 101055]
> > > [client_t.c:443:gf_client_unref] 0-home-server: Shutting down
> > > connection
> >  

Re: [Gluster-users] gluster connection interrupted during transfer

2018-08-30 Thread Richard Neuboeck
On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
> +Mohit. +Milind
> 
> @Mohit/Milind,
> 
> Can you check logs and see whether you can find anything relevant?

From glances at the system logs nothing out of the ordinary
occurred. However I'll start another rsync and take a closer look.
It will take a few days.

> 
> On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
> mailto:h...@tbi.univie.ac.at>> wrote:
> 
> Hi,
> 
> I'm attaching a shortened version since the whole is about 5.8GB of
> the client mount log. It includes the initial mount messages and the
> last two minutes of log entries.
> 
> It ends very anticlimactic without an obvious error. Is there
> anything specific I should be looking for?
> 
> 
> Normally I look logs around disconnect msgs to find out the reason.
> But as you said, sometimes one can see just disconnect msgs without
> any reason. That normally points to reason for disconnect in the
> network rather than a Glusterfs initiated disconnect.

The rsync source is serving our homes currently so there are NFS
connections 24/7. There don't seem to be any network related
interruptions - a co-worker would be here faster than I could check
the logs if the connection to home would be broken ;-)
The three gluster machines are due to this problem reduced to only
testing so there is nothing else running.


> 
> Cheers
> Richard
> 
> On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:
> > Normally client logs will give a clue on why the disconnections are
> > happening (ping-timeout, wrong port etc). Can you look into client
> > logs to figure out what's happening? If you can't find anything, can
> > you send across client logs?
> > 
> > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
> > mailto:h...@tbi.univie.ac.at>
> >>
> wrote:
> >
> >     Hi Gluster Community,
> >
> >     I have problems with a glusterfs 'Transport endpoint not
> connected'
> >     connection abort during file transfers that I can
> replicate (all the
> >     time now) but not pinpoint as to why this is happening.
> >
> >     The volume is set up in replica 3 mode and accessed with
> the fuse
> >     gluster client. Both client and server are running CentOS
> and the
> >     supplied 3.12.11 version of gluster.
> >
> >     The connection abort happens at different times during
> rsync but
> >     occurs every time I try to sync all our files (1.1TB) to
> the empty
> >     volume.
> >
> >     Client and server side I don't find errors in the gluster
> log files.
> >     rsync logs the obvious transfer problem. The only log that
> shows
> >     anything related is the server brick log which states that the
> >     connection is shutting down:
> >
> >     [2018-08-18 22:40:35.502510] I [MSGID: 115036]
> >     [server.c:527:server_rpc_notify] 0-home-server: disconnecting
> >     connection from
> >     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> >     [2018-08-18 22:40:35.502620] W
> >     [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server:
> releasing lock
> >     on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
> >     {client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f}
> >     [2018-08-18 22:40:35.502692] W
> >     [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:
> releasing lock
> >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> >     {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> >     [2018-08-18 22:40:35.502719] W
> >     [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:
> releasing lock
> >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> >     {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> >     [2018-08-18 22:40:35.505950] I [MSGID: 101055]
> >     [client_t.c:443:gf_client_unref] 0-home-server: Shutting down
> >     connection
> brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> >
> >     Since I'm running another replica 3 setup for oVirt for a
> long time
> >     now which is completely stable I thought I made a mistake
> setting
> >     different options at first. However even when I reset
> those options
> >     I'm able to reproduce the connection problem.
> >
> >     The unoptimized volume setup looks like this:
> >
> >     Volume Name: home
> >     Type: Replicate
> >     Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
> >     Status: Started
> >     Snapshot Count: 0
> >     Number of Bricks: 1 x 3 = 3
> >     Transport-type: tcp
> >     Bricks:
> >     Brick1: sphere-four:/srv/gluster_home/brick
> >     Brick2: sphere-five:/srv/gluster_home/brick
> >     Brick3: sphere-six:/srv/gluster_home/brick
>

Re: [Gluster-users] gluster connection interrupted during transfer

2018-08-30 Thread Raghavendra Gowdappa
+Mohit. +Milind

@Mohit/Milind,

Can you check logs and see whether you can find anything relevant?

On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck 
wrote:

> Hi,
>
> I'm attaching a shortened version since the whole is about 5.8GB of
> the client mount log. It includes the initial mount messages and the
> last two minutes of log entries.
>
> It ends very anticlimactic without an obvious error. Is there
> anything specific I should be looking for?
>

Normally I look logs around disconnect msgs to find out the reason. But as
you said, sometimes one can see just disconnect msgs without any reason.
That normally points to reason for disconnect in the network rather than a
Glusterfs initiated disconnect.


> Cheers
> Richard
>
> On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:
> > Normally client logs will give a clue on why the disconnections are
> > happening (ping-timeout, wrong port etc). Can you look into client
> > logs to figure out what's happening? If you can't find anything, can
> > you send across client logs?
> >
> > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
> > mailto:h...@tbi.univie.ac.at>> wrote:
> >
> > Hi Gluster Community,
> >
> > I have problems with a glusterfs 'Transport endpoint not connected'
> > connection abort during file transfers that I can replicate (all the
> > time now) but not pinpoint as to why this is happening.
> >
> > The volume is set up in replica 3 mode and accessed with the fuse
> > gluster client. Both client and server are running CentOS and the
> > supplied 3.12.11 version of gluster.
> >
> > The connection abort happens at different times during rsync but
> > occurs every time I try to sync all our files (1.1TB) to the empty
> > volume.
> >
> > Client and server side I don't find errors in the gluster log files.
> > rsync logs the obvious transfer problem. The only log that shows
> > anything related is the server brick log which states that the
> > connection is shutting down:
> >
> > [2018-08-18 22:40:35.502510] I [MSGID: 115036]
> > [server.c:527:server_rpc_notify] 0-home-server: disconnecting
> > connection from
> > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> > [2018-08-18 22:40:35.502620] W
> > [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
> > on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
> > {client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f}
> > [2018-08-18 22:40:35.502692] W
> > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
> > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> > {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> > [2018-08-18 22:40:35.502719] W
> > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
> > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> > {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> > [2018-08-18 22:40:35.505950] I [MSGID: 101055]
> > [client_t.c:443:gf_client_unref] 0-home-server: Shutting down
> > connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> >
> > Since I'm running another replica 3 setup for oVirt for a long time
> > now which is completely stable I thought I made a mistake setting
> > different options at first. However even when I reset those options
> > I'm able to reproduce the connection problem.
> >
> > The unoptimized volume setup looks like this:
> >
> > Volume Name: home
> > Type: Replicate
> > Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 1 x 3 = 3
> > Transport-type: tcp
> > Bricks:
> > Brick1: sphere-four:/srv/gluster_home/brick
> > Brick2: sphere-five:/srv/gluster_home/brick
> > Brick3: sphere-six:/srv/gluster_home/brick
> > Options Reconfigured:
> > nfs.disable: on
> > transport.address-family: inet
> > cluster.quorum-type: auto
> > cluster.server-quorum-type: server
> > cluster.server-quorum-ratio: 50%
> >
> >
> > The following additional options were used before:
> >
> > performance.cache-size: 5GB
> > client.event-threads: 4
> > server.event-threads: 4
> > cluster.lookup-optimize: on
> > features.cache-invalidation: on
> > performance.stat-prefetch: on
> > performance.cache-invalidation: on
> > network.inode-lru-limit: 5
> > features.cache-invalidation-timeout: 600
> > performance.md-cache-timeout: 600
> > performance.parallel-readdir: on
> >
> >
> > In this case the gluster servers and also the client is using a
> > bonded network device running in adaptive load balancing mode.
> >
> > I've tried using the debug option for the client mount. But except
> > for a ~0.5TB log file I didn't get information that seems
> > helpful to me.
> >
> > Transferring just a couple of GB works without problems

Re: [Gluster-users] gluster connection interrupted during transfer

2018-08-30 Thread Raghavendra Gowdappa
Normally client logs will give a clue on why the disconnections are
happening (ping-timeout, wrong port etc). Can you look into client logs to
figure out what's happening? If you can't find anything, can you send
across client logs?

On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck 
wrote:

> Hi Gluster Community,
>
> I have problems with a glusterfs 'Transport endpoint not connected'
> connection abort during file transfers that I can replicate (all the
> time now) but not pinpoint as to why this is happening.
>
> The volume is set up in replica 3 mode and accessed with the fuse
> gluster client. Both client and server are running CentOS and the
> supplied 3.12.11 version of gluster.
>
> The connection abort happens at different times during rsync but
> occurs every time I try to sync all our files (1.1TB) to the empty
> volume.
>
> Client and server side I don't find errors in the gluster log files.
> rsync logs the obvious transfer problem. The only log that shows
> anything related is the server brick log which states that the
> connection is shutting down:
>
> [2018-08-18 22:40:35.502510] I [MSGID: 115036]
> [server.c:527:server_rpc_notify] 0-home-server: disconnecting
> connection from brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> [2018-08-18 22:40:35.502620] W
> [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
> on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
> {client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f}
> [2018-08-18 22:40:35.502692] W
> [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
> on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> [2018-08-18 22:40:35.502719] W
> [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
> on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> [2018-08-18 22:40:35.505950] I [MSGID: 101055]
> [client_t.c:443:gf_client_unref] 0-home-server: Shutting down
> connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
>
> Since I'm running another replica 3 setup for oVirt for a long time
> now which is completely stable I thought I made a mistake setting
> different options at first. However even when I reset those options
> I'm able to reproduce the connection problem.
>
> The unoptimized volume setup looks like this:
>
> Volume Name: home
> Type: Replicate
> Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: sphere-four:/srv/gluster_home/brick
> Brick2: sphere-five:/srv/gluster_home/brick
> Brick3: sphere-six:/srv/gluster_home/brick
> Options Reconfigured:
> nfs.disable: on
> transport.address-family: inet
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> cluster.server-quorum-ratio: 50%
>
>
> The following additional options were used before:
>
> performance.cache-size: 5GB
> client.event-threads: 4
> server.event-threads: 4
> cluster.lookup-optimize: on
> features.cache-invalidation: on
> performance.stat-prefetch: on
> performance.cache-invalidation: on
> network.inode-lru-limit: 5
> features.cache-invalidation-timeout: 600
> performance.md-cache-timeout: 600
> performance.parallel-readdir: on
>
>
> In this case the gluster servers and also the client is using a
> bonded network device running in adaptive load balancing mode.
>
> I've tried using the debug option for the client mount. But except
> for a ~0.5TB log file I didn't get information that seems helpful to me.
>
> Transferring just a couple of GB works without problems.
>
> It may very well be that I'm already blind to the obvious but after
> many long running tests I can't find the crux in the setup.
>
> Does anyone have an idea as how to approach this problem in a way
> that sheds some useful information?
>
> Any help is highly appreciated!
> Cheers
> Richard
>
> --
> /dev/null
>
>
>
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] gluster connection interrupted during transfer

2018-08-30 Thread Richard Neuboeck
Hi Nithya,


On 08/30/2018 09:45 AM, Nithya Balachandran wrote:
> Hi Richard,
> 
> 
> 
> On 29 August 2018 at 18:11, Richard Neuboeck  > wrote:
> 
> Hi Gluster Community,
> 
> I have problems with a glusterfs 'Transport endpoint not connected'
> connection abort during file transfers that I can replicate (all the
> time now) but not pinpoint as to why this is happening.
> 
> The volume is set up in replica 3 mode and accessed with the fuse
> gluster client. Both client and server are running CentOS and the
> supplied 3.12.11 version of gluster.
> 
> The connection abort happens at different times during rsync but
> occurs every time I try to sync all our files (1.1TB) to the empty
> volume.
> 
> Client and server side I don't find errors in the gluster log files.
> rsync logs the obvious transfer problem. The only log that shows
> anything related is the server brick log which states that the
> connection is shutting down:
> 
> [2018-08-18 22:40:35.502510] I [MSGID: 115036]
> [server.c:527:server_rpc_notify] 0-home-server: disconnecting
> connection from
> brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> [2018-08-18 22:40:35.502620] W
> [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
> on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
> {client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f}
> [2018-08-18 22:40:35.502692] W
> [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
> on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> [2018-08-18 22:40:35.502719] W
> [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
> on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> [2018-08-18 22:40:35.505950] I [MSGID: 101055]
> [client_t.c:443:gf_client_unref] 0-home-server: Shutting down
> connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0 
> 
> 
> Since I'm running another replica 3 setup for oVirt for a long time
> 
> 
> Is this setup running with the same gluster version and on the same
> nodes or is it a different cluster?


It's a different cluster (sphere-one, sphere-two and sphere-three)
but the same gluster version and basically the same hardware.

Cheers
Richard

> 
>  
> 
> now which is completely stable I thought I made a mistake setting
> different options at first. However even when I reset those options
> I'm able to reproduce the connection problem.
> 
> The unoptimized volume setup looks like this: 
> 
> 
> Volume Name: home
> Type: Replicate
> Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: sphere-four:/srv/gluster_home/brick
> Brick2: sphere-five:/srv/gluster_home/brick
> Brick3: sphere-six:/srv/gluster_home/brick
> Options Reconfigured:
> nfs.disable: on
> transport.address-family: inet
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> cluster.server-quorum-ratio: 50%
> 
> 
> The following additional options were used before:
> 
> performance.cache-size: 5GB
> client.event-threads: 4
> server.event-threads: 4
> cluster.lookup-optimize: on
> features.cache-invalidation: on
> performance.stat-prefetch: on
> performance.cache-invalidation: on
> network.inode-lru-limit: 5
> features.cache-invalidation-timeout: 600
> performance.md-cache-timeout: 600
> performance.parallel-readdir: on
> 
> 
> In this case the gluster servers and also the client is using a
> bonded network device running in adaptive load balancing mode.
> 
> I've tried using the debug option for the client mount. But except
> for a ~0.5TB log file I didn't get information that seems
> helpful to me.
> 
> Transferring just a couple of GB works without problems.
> 
> It may very well be that I'm already blind to the obvious but after
> many long running tests I can't find the crux in the setup.
> 
> Does anyone have an idea as how to approach this problem in a way
> that sheds some useful information?
> 
> Any help is highly appreciated!
> Cheers
> Richard
> 
> -- 
> /dev/null
> 
> 
> 
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org 
> https://lists.gluster.org/mailman/listinfo/gluster-users
> 
> 
> 


-- 
/dev/null



signature.asc
Description: OpenPGP digital signature
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.g

Re: [Gluster-users] gluster connection interrupted during transfer

2018-08-30 Thread Nithya Balachandran
Hi Richard,



On 29 August 2018 at 18:11, Richard Neuboeck  wrote:

> Hi Gluster Community,
>
> I have problems with a glusterfs 'Transport endpoint not connected'
> connection abort during file transfers that I can replicate (all the
> time now) but not pinpoint as to why this is happening.
>
> The volume is set up in replica 3 mode and accessed with the fuse
> gluster client. Both client and server are running CentOS and the
> supplied 3.12.11 version of gluster.
>
> The connection abort happens at different times during rsync but
> occurs every time I try to sync all our files (1.1TB) to the empty
> volume.
>
> Client and server side I don't find errors in the gluster log files.
> rsync logs the obvious transfer problem. The only log that shows
> anything related is the server brick log which states that the
> connection is shutting down:
>
> [2018-08-18 22:40:35.502510] I [MSGID: 115036]
> [server.c:527:server_rpc_notify] 0-home-server: disconnecting
> connection from brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> [2018-08-18 22:40:35.502620] W
> [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: releasing lock
> on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
> {client=0x7f83ec0b3ce0, pid=110423 lk-owner=d0fd5ffb427f}
> [2018-08-18 22:40:35.502692] W
> [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
> on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> [2018-08-18 22:40:35.502719] W
> [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: releasing lock
> on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> {client=0x7f83ec0b3ce0, pid=110423 lk-owner=703dd4cc407f}
> [2018-08-18 22:40:35.505950] I [MSGID: 101055]
> [client_t.c:443:gf_client_unref] 0-home-server: Shutting down
> connection brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0


> Since I'm running another replica 3 setup for oVirt for a long time
>

Is this setup running with the same gluster version and on the same nodes
or is it a different cluster?



> now which is completely stable I thought I made a mistake setting
> different options at first. However even when I reset those options
> I'm able to reproduce the connection problem.
>
> The unoptimized volume setup looks like this:


> Volume Name: home
> Type: Replicate
> Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: sphere-four:/srv/gluster_home/brick
> Brick2: sphere-five:/srv/gluster_home/brick
> Brick3: sphere-six:/srv/gluster_home/brick
> Options Reconfigured:
> nfs.disable: on
> transport.address-family: inet
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> cluster.server-quorum-ratio: 50%
>
>
> The following additional options were used before:
>
> performance.cache-size: 5GB
> client.event-threads: 4
> server.event-threads: 4
> cluster.lookup-optimize: on
> features.cache-invalidation: on
> performance.stat-prefetch: on
> performance.cache-invalidation: on
> network.inode-lru-limit: 5
> features.cache-invalidation-timeout: 600
> performance.md-cache-timeout: 600
> performance.parallel-readdir: on
>
>
> In this case the gluster servers and also the client is using a
> bonded network device running in adaptive load balancing mode.
>
> I've tried using the debug option for the client mount. But except
> for a ~0.5TB log file I didn't get information that seems helpful to me.
>
> Transferring just a couple of GB works without problems.
>
> It may very well be that I'm already blind to the obvious but after
> many long running tests I can't find the crux in the setup.
>
> Does anyone have an idea as how to approach this problem in a way
> that sheds some useful information?
>
> Any help is highly appreciated!
> Cheers
> Richard
>
> --
> /dev/null
>
>
>
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users