Re: [Gluster-devel] [Gluster-infra] rebal-all-nodes-migrate.t always fails now

2019-06-04 Thread Yaniv Kaul
What was the result of this investigation? I suspect seeing the same issue
on builder209[1].
Y.

[1] https://build.gluster.org/job/centos7-regression/6302/consoleFull

On Fri, Apr 5, 2019 at 5:40 PM Michael Scherer  wrote:

> Le vendredi 05 avril 2019 à 16:55 +0530, Nithya Balachandran a écrit :
> > On Fri, 5 Apr 2019 at 12:16, Michael Scherer 
> > wrote:
> >
> > > Le jeudi 04 avril 2019 à 18:24 +0200, Michael Scherer a écrit :
> > > > Le jeudi 04 avril 2019 à 19:10 +0300, Yaniv Kaul a écrit :
> > > > > I'm not convinced this is solved. Just had what I believe is a
> > > > > similar
> > > > > failure:
> > > > >
> > > > > *00:12:02.532* A dependency job for rpc-statd.service failed.
> > > > > See
> > > > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs:
> > > > > rpc.statd is
> > > > > not running but is required for remote locking.*00:12:02.532*
> > > > > mount.nfs: Either use '-o nolock' to keep locks local, or start
> > > > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was
> > > > > specified
> > > > >
> > > > > (of course, it can always be my patch!)
> > > > >
> > > > > https://build.gluster.org/job/centos7-regression/5384/console
> > > >
> > > > same issue, different builder (206). I will check them all, as
> > > > the
> > > > issue is more widespread than I expected (or it did popup since
> > > > last
> > > > time I checked).
> > >
> > > Deepshika did notice that the issue came back on one server
> > > (builder202) after a reboot, so the rpcbind issue is not related to
> > > the
> > > network initscript one, so the RCA continue.
> > >
> > > We are looking for another workaround involving fiddling with the
> > > socket (until we find why it do use ipv6 at boot, but not after,
> > > when
> > > ipv6 is disabled).
> > >
> >
> > Could this be relevant?
> > https://access.redhat.com/solutions/2798411
>
> Good catch.
>
> So, we already do that, Nigel took care of that (after 2 days of
> research). But I didn't knew the exact symptoms, and decided to double
> check just in case.
>
> And... there is no sysctl.conf in the initrd. Running dracut -v -f do
> not change anything.
>
> Running "dracut -v -f -H" take care of that (and this fix the problem),
> but:
> - our ansible script already run that
> - -H is hostonly, which is already the default on EL7 according to the
> doc.
>
> However, if dracut-config-generic is installed, it doesn't build a
> hostonly initrd, and so do not include the sysctl.conf file (who break
> rpcbnd, who break the test suite).
>
> And for some reason, it is installed the image in ec2 (likely default),
> but not by default on the builders.
>
> So what happen is that after a kernel upgrade, dracut rebuild a generic
> initrd instead of a hostonly one, who break things. And kernel was
> likely upgraded recently (and upgrade happen nightly (for some value of
> "night"), so we didn't see that earlier, nor with a fresh system.
>
>
> So now, we have several solution:
> - be explicit on using hostonly in dracut, so this doesn't happen again
> (or not for this reason)
>
> - disable ipv6 in rpcbind in a cleaner way (to be tested)
>
> - get the test suite work with ip v6
>
> In the long term, I also want to monitor the processes, but for that, I
> need a VPN between the nagios server and ec2, and that project got
> blocked by several issues (like EC2 not support ecdsa keys, and we use
> that for ansible, so we have to come back to RSA for full automated
> deployment, and openvon requires to use certificates, so I need a newer
> python openssl for doing what I want, and RHEL 7 is too old, etc, etc).
>
> As the weekend approach for me, I just rebuilt the initrd for the time
> being. I guess forcing hostonly is the safest fix for now, but this
> will be for monday.
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>
>
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



[Gluster-devel] GETXATTR op pending on index xlator for more than 10 hours

2019-06-04 Thread Xie Changlong




Hi all,




Today, i found gnfs GETXATTR  bailing out on gluster release 3.12.0. I have a 
simple 4*2 Distributed-Rep volume. 




[2019-06-03 19:58:33.085880] E [rpc-clnt.c:185:Call_bail] 0-cl25vol01-client-4: 
bailing out frame type(GlusterFS 3.3) op(GETXATTR(18)) xid=0x21de4275 sent = 
2019-06-03 19:28:30.552356. timeout = 1800 for 10.3.133.57:49153




xid= 0x21de4275 = 568214133




Then i try to dump brick 10.3.133.57:49153, and find the GETXATTR op pending on 
index xlator for more than 10 hours!




MicrosoftInternetExplorer402DocumentNotSpecified7.8 磅Normal0



  

[root@node0001 gluster]# grep -rn 568214133 
gluster-brick-1-cl25vol01.6078.dump.15596*

gluster-brick-1-cl25vol01.6078.dump.1559617125:5093:unique=568214133

gluster-brick-1-cl25vol01.6078.dump.1559618121:5230:unique=568214133

gluster-brick-1-cl25vol01.6078.dump.1559618912:5434:unique=568214133

gluster-brick-1-cl25vol01.6078.dump.1559628467:6921:unique=568214133




[root@node0001 gluster]# date -d @1559617125

Tue Jun  4 10:58:45 CST 2019

[root@node0001 gluster]# date -d @1559628467

Tue Jun  4 14:07:47 CST 2019

MicrosoftInternetExplorer402DocumentNotSpecified7.8 磅Normal0



  

[root@node0001 gluster]# 








[global.callpool.stack.115]

stack=0x7f8b342623c0

uid=500

gid=500

pid=-6

unique=568214133

lk-owner=faff

op=stack

type=0

cnt=4




[global.callpool.stack.115.frame.1]

frame=0x7f8b1d6fb540

ref_count=0

translator=cl25vol01-index

complete=0

parent=cl25vol01-quota

wind_from=quota_getxattr

wind_to=(this->children->xlator)->fops->getxattr

unwind_to=default_getxattr_cbk




[global.callpool.stack.115.frame.2]

frame=0x7f8b30a14da0

ref_count=1

translator=cl25vol01-quota

complete=0

parent=cl25vol01-io-stats

wind_from=io_stats_getxattr

wind_to=(this->children->xlator)->fops->getxattr

unwind_to=io_stats_getxattr_cbk




[global.callpool.stack.115.frame.3]

frame=0x7f8b6debada0

ref_count=1

translator=cl25vol01-io-stats

complete=0

parent=cl25vol01-server

wind_from=server_getxattr_resume

wind_to=FIRST_CHILD(this)->fops->getxattr

unwind_to=server_getxattr_cbk




[global.callpool.stack.115.frame.4]

frame=0x7f8b21962a60

ref_count=1

translator=cl25vol01-server

complete=0




I've checked the code logic and got nothing, any advice? I still have the scene 
on my side, so we can dig more.







Thanks









___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Memory leak in glusterfs

2019-06-04 Thread ABHISHEK PALIWAL
Hi Team,

Please respond on the issue which I raised.

Regards,
Abhishek

On Fri, May 17, 2019 at 2:46 PM ABHISHEK PALIWAL 
wrote:

> Anyone please reply
>
> On Thu, May 16, 2019, 10:49 ABHISHEK PALIWAL 
> wrote:
>
>> Hi Team,
>>
>> I upload some valgrind logs from my gluster 5.4 setup. This is writing to
>> the volume every 15 minutes. I stopped glusterd and then copy away the
>> logs.  The test was running for some simulated days. They are zipped in
>> valgrind-54.zip.
>>
>> Lots of info in valgrind-2730.log. Lots of possibly lost bytes in
>> glusterfs and even some definitely lost bytes.
>>
>> ==2737== 1,572,880 bytes in 1 blocks are possibly lost in loss record 391
>> of 391
>> ==2737== at 0x4C29C25: calloc (in
>> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>> ==2737== by 0xA22485E: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA217C94: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA21D9F8: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA21DED9: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA21E685: ??? (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0xA1B9D8C: init (in
>> /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so)
>> ==2737== by 0x4E511CE: xlator_init (in /usr/lib64/libglusterfs.so.0.0.1)
>> ==2737== by 0x4E8A2B8: ??? (in /usr/lib64/libglusterfs.so.0.0.1)
>> ==2737== by 0x4E8AAB3: glusterfs_graph_activate (in
>> /usr/lib64/libglusterfs.so.0.0.1)
>> ==2737== by 0x409C35: glusterfs_process_volfp (in /usr/sbin/glusterfsd)
>> ==2737== by 0x409D99: glusterfs_volumes_init (in /usr/sbin/glusterfsd)
>> ==2737==
>> ==2737== LEAK SUMMARY:
>> ==2737== definitely lost: 1,053 bytes in 10 blocks
>> ==2737== indirectly lost: 317 bytes in 3 blocks
>> ==2737== possibly lost: 2,374,971 bytes in 524 blocks
>> ==2737== still reachable: 53,277 bytes in 201 blocks
>> ==2737== suppressed: 0 bytes in 0 blocks
>>
>> --
>>
>>
>>
>>
>> Regards
>> Abhishek Paliwal
>>
>

-- 




Regards
Abhishek Paliwal
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel