Re: [Gluster-devel] spurious failures for ./tests/basic/afr/root-squash-self-heal.t
>From glusterd log: [2016-08-31 07:54:24.817811] E [run.c:191:runner_log] (-->/build/install/lib/glusterfs/3.9dev/xlator/mgmt/glusterd.so(+0xe1c30) [0x7f1a34ebac30] -->/build/install/lib/glusterfs/3.9dev/xlator/mgmt/glusterd.so(+0xe1794) [0x7f1a34eba794] -->/build/install/lib/libglusterfs.so.0(runner_log+0x1ae) [0x7f1a3fa15cea] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=patchy --first=yes --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2016-08-31 07:54:24.819166]:++ G_LOG:./tests/basic/afr/root-squash-self-heal.t: TEST: 20 1 afr_child_up_status patchy 0 ++ The above is spawned from a "volume start force". I checked the brick logs and the killed brick had started successfully. Links to failures: https://build.gluster.org/job/centos6-regression/429/console https://build.gluster.org/job/netbsd7-regression/358/consoleFull Thanks, Susant - Original Message - > From: "Susant Palai"> To: "gluster-devel" > Sent: Thursday, 1 September, 2016 12:13:01 PM > Subject: [Gluster-devel] spurious failures for > ./tests/basic/afr/root-squash-self-heal.t > > Hi, > $subject is failing spuriously for one of my patch. > One of the test case is: EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" > afr_child_up_status $V0 0 > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures in ec/quota.t and distribute/bug-860663.t
- Original Message - > From: "Poornima Gurusiddaiah"> To: "Gluster Devel" , "Manikandan Selvaganesan" > , "Susant Palai" > , "Nithya Balachandran" > Sent: Tuesday, March 1, 2016 4:49:51 PM > Subject: [Gluster-devel] Spurious failures in ec/quota.t and > distribute/bug-860663.t > > Hi, > > I see these test cases failing spuriously, > > ./tests/basic/ec/quota.t Failed Tests: 1-13, 16, 18, 20, 2 > https://build.gluster.org/job/rackspace-regression-2GB-triggered/18637/consoleFull > ./tests/bugs/distribute/bug-860663.t Failed Test: 13 > https://build.gluster.org/job/rackspace-regression-2GB-triggered/18622/consoleFull The test which failed is just a umount. Not sure why it failed # Unmount and remount to make sure we're doing fresh lookups. TEST umount $M0 Alternatively we can have another fresh mount on say $M1, and run future tests. Can you check whether patch [1] fixes your issue (push your patch as a dependency of [1])? [1] http://review.gluster.org/13567 > > Could any one from Quota and dht look into it? > Regards, > Poornima > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures in ec/quota.t and distribute/bug-860663.t
Thank You, have rebased the patch. Regards, Poornima - Original Message - > From: "Xavier Hernandez" <xhernan...@datalab.es> > To: "Poornima Gurusiddaiah" <pguru...@redhat.com>, "Gluster Devel" > <gluster-devel@gluster.org>, "Manikandan > Selvaganesan" <mselv...@redhat.com>, "Susant Palai" <spa...@redhat.com>, > "Nithya Balachandran" <nbala...@redhat.com> > Sent: Tuesday, March 1, 2016 4:57:11 PM > Subject: Re: [Gluster-devel] Spurious failures in ec/quota.t and > distribute/bug-860663.t > > Hi Poornima, > > On 01/03/16 12:19, Poornima Gurusiddaiah wrote: > > Hi, > > > > I see these test cases failing spuriously, > > > > ./tests/basic/ec/quota.t Failed Tests: 1-13, 16, 18, 20, 2 > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/18637/consoleFull > > This is already solved by http://review.gluster.org/13446/. It has been > merged just a couple hours ago. > > Xavi > > > > > ./tests/bugs/distribute/bug-860663.t Failed Test: 13 > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/18622/consoleFull > > > > Could any one from Quota and dht look into it? > > > > Regards, > > Poornima > > > > > > ___ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures in ec/quota.t and distribute/bug-860663.t
Hi Poornima, Below patch might solve the regression failure for ''./tests/basic/ec/quota.t' http://review.gluster.org/#/c/13446/ http://review.gluster.org/#/c/13447/ Thanks, Vijay On Tue, Mar 1, 2016 at 4:49 PM, Poornima Gurusiddaiahwrote: > Hi, > > I see these test cases failing spuriously, > > ./tests/basic/ec/quota.t Failed Tests: 1-13, 16, 18, 20, 2 > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/18637/consoleFull > > ./tests/bugs/distribute/bug-860663.t Failed Test: 13 > https://build.gluster.org/job/rackspace-regression-2GB-triggered/18622/consoleFull > > Could any one from Quota and dht look into it? > > Regards, > Poornima > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Thanks Michael! Thanks and Regards, Kotresh H R - Original Message - > From: "Michael Scherer"> To: "Kotresh Hiremath Ravishankar" > Cc: "Krutika Dhananjay" , "Atin Mukherjee" > , "Gaurav Garg" > , "Aravinda" , "Gluster Devel" > > Sent: Thursday, 24 September, 2015 11:09:52 PM > Subject: Re: Spurious failures > > Le jeudi 24 septembre 2015 à 07:59 -0400, Kotresh Hiremath Ravishankar a > écrit : > > Thank you:) and also please check the script I had given passes in all > > machines > > So it worked everywhere, but on slave0 and slave1. Not sure what is > wrong, or if they are used, I will check later. > > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Le jeudi 24 septembre 2015 à 06:50 -0400, Kotresh Hiremath Ravishankar a écrit : > >>> Ok, this definitely requires some tests and toughts. It only use ipv4 > >>> too ? > >>> (I guess yes, since ipv6 is removed from the rackspace build slaves) > > Yes! > > Could we know when can these settings be done on all linux slave machines? > If it takes sometime, we should consider moving all geo-rep testcases > under bad tests > till then. I will do that this afternoon, now I have a clear idea of what need to be done. ( I already pushed the path change ) -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
>>> Ok, this definitely requires some tests and toughts. It only use ipv4 >>> too ? >>> (I guess yes, since ipv6 is removed from the rackspace build slaves) Yes! Could we know when can these settings be done on all linux slave machines? If it takes sometime, we should consider moving all geo-rep testcases under bad tests till then. Thanks and Regards, Kotresh H R - Original Message - > From: "Michael Scherer"> To: "Kotresh Hiremath Ravishankar" > Cc: "Krutika Dhananjay" , "Atin Mukherjee" > , "Gaurav Garg" > , "Aravinda" , "Gluster Devel" > > Sent: Thursday, 24 September, 2015 1:18:16 PM > Subject: Re: Spurious failures > > Le jeudi 24 septembre 2015 à 02:24 -0400, Kotresh Hiremath Ravishankar a > écrit : > > Hi, > > > > >>>So, it is ok if I restrict that to be used only on 127.0.0.1 ? > > I think no, testcases use 'H0' to create volumes > > H0=${H0:=`hostname`}; > > Geo-rep expects passwordLess SSH to 'H0' > > > > Ok, this definitely requires some tests and toughts. It only use ipv4 > too ? > (I guess yes, since ipv6 is removed from the rackspace build slaves) > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Thank you:) and also please check the script I had given passes in all machines Thanks and Regards, Kotresh H R - Original Message - > From: "Michael Scherer"> To: "Kotresh Hiremath Ravishankar" > Cc: "Krutika Dhananjay" , "Atin Mukherjee" > , "Gaurav Garg" > , "Aravinda" , "Gluster Devel" > > Sent: Thursday, 24 September, 2015 5:00:43 PM > Subject: Re: Spurious failures > > Le jeudi 24 septembre 2015 à 06:50 -0400, Kotresh Hiremath Ravishankar a > écrit : > > >>> Ok, this definitely requires some tests and toughts. It only use ipv4 > > >>> too ? > > >>> (I guess yes, since ipv6 is removed from the rackspace build slaves) > > > > Yes! > > > > Could we know when can these settings be done on all linux slave > > machines? > > If it takes sometime, we should consider moving all geo-rep testcases > > under bad tests > > till then. > > I will do that this afternoon, now I have a clear idea of what need to > be done. > ( I already pushed the path change ) > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Hi, >>>So, it is ok if I restrict that to be used only on 127.0.0.1 ? I think no, testcases use 'H0' to create volumes H0=${H0:=`hostname`}; Geo-rep expects passwordLess SSH to 'H0' Thanks and Regards, Kotresh H R - Original Message - > From: "Michael Scherer"> To: "Kotresh Hiremath Ravishankar" > Cc: "Krutika Dhananjay" , "Atin Mukherjee" > , "Gaurav Garg" > , "Aravinda" , "Gluster Devel" > > Sent: Wednesday, 23 September, 2015 5:05:58 PM > Subject: Re: Spurious failures > > Le mercredi 23 septembre 2015 à 06:24 -0400, Kotresh Hiremath > Ravishankar a écrit : > > Hi Michael, > > > > Please find my replies below. > > > > >>> Root login using password should be disabled, so no. If that's still > > >>> working and people use it, that's gonna change soon, too much problems > > >>> with it. > > > > Ok > > > > >>>Can you be more explicit on where should the user come from so I can > > >>>properly integrate that ? > > > > It's just PasswordLess SSH from root to root on to same host. > > 1. Generate ssh key: > > #ssh-keygen > > 2. Add it to /root/.ssh/authorized_keys > > #ssh-copy-id -i root@host > > > > Requirement by geo-replication: > > 'ssh root@host' should not ask for password > > So, it is ok if I restrict that to be used only on 127.0.0.1 ? > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Le jeudi 24 septembre 2015 à 02:24 -0400, Kotresh Hiremath Ravishankar a écrit : > Hi, > > >>>So, it is ok if I restrict that to be used only on 127.0.0.1 ? > I think no, testcases use 'H0' to create volumes > H0=${H0:=`hostname`}; > Geo-rep expects passwordLess SSH to 'H0' > Ok, this definitely requires some tests and toughts. It only use ipv4 too ? (I guess yes, since ipv6 is removed from the rackspace build slaves) -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Le jeudi 24 septembre 2015 à 07:59 -0400, Kotresh Hiremath Ravishankar a écrit : > Thank you:) and also please check the script I had given passes in all > machines So it worked everywhere, but on slave0 and slave1. Not sure what is wrong, or if they are used, I will check later. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Hi Michael, Please find my replies below. >>> Root login using password should be disabled, so no. If that's still >>> working and people use it, that's gonna change soon, too much problems >>> with it. Ok >>>Can you be more explicit on where should the user come from so I can >>>properly integrate that ? It's just PasswordLess SSH from root to root on to same host. 1. Generate ssh key: #ssh-keygen 2. Add it to /root/.ssh/authorized_keys #ssh-copy-id -i root@host Requirement by geo-replication: 'ssh root@host' should not ask for password >>>There is something adding lots of line to /root/.ssh/authorized_keys on >>>the slave, and this make me quite unconfortable, so if that's it, I >>>rather have it done cleanly, and for that, I need to understand the >>>test, and the requirement. Yes, geo-rep is doing it. It adds only once per session. Since the test is running continuously for different patches, it's building up. I will submit a patch to clean it up in geo-rep testsuite itself. >>>I will do this one. Thank you! >>>Is georep supposed to work on other platform like freebsd ? ( because >>>freebsd do not have bash, so I have to adapt to local way, but if that's >>>not gonna be tested, I rather not spend too much time on reading the >>>handbook for now ) As of now it is supported only on Linux, it has known issues with other platforms such as NetBSD... Thanks and Regards, Kotresh H R - Original Message - > From: "Michael Scherer"> To: "Kotresh Hiremath Ravishankar" > Cc: "Krutika Dhananjay" , "Atin Mukherjee" > , "Gaurav Garg" > , "Aravinda" , "Gluster Devel" > > Sent: Wednesday, September 23, 2015 3:30:39 PM > Subject: Re: Spurious failures > > Le mercredi 23 septembre 2015 à 03:25 -0400, Kotresh Hiremath > Ravishankar a écrit : > > Hi Krutika, > > > > Looks like the prerequisites for geo-replication to work is changed > > in slave21 > > > > Hi Michael, > > Hi, > > > Could you please check following settings are made in all linux regression > > machines? > > Yeah, I will add to salt. > > > Or provide me with root password so that I can verify. > > Root login using password should be disabled, so no. If that's still > working and people use it, that's gonna change soon, too much problems > with it. > > > 1. Setup Passwordless SSH for the root user: > > Can you be more explicit on where should the user come from so I can > properly integrate that ? > > There is something adding lots of line to /root/.ssh/authorized_keys on > the slave, and this make me quite unconfortable, so if that's it, I > rather have it done cleanly, and for that, I need to understand the > test, and the requirement. > > > 2. Add below line in /root/.bashrc. This is required as geo-rep does > > "gluster --version" via ssh > >and it can't find the gluster PATH via ssh. > > export PATH=$PATH:/build/install/sbin:/build/install/bin > > I will do this one. > > Is georep supposed to work on other platform like freebsd ? ( because > freebsd do not have bash, so I have to adapt to local way, but if that's > not gonna be tested, I rather not spend too much time on reading the > handbook for now ) > > > Once above settings are done, the following script should output proper > > version. > > > > --- > > #!/bin/bash > > > > function SSHM() > > { > > ssh -q \ > > -oPasswordAuthentication=no \ > > -oStrictHostKeyChecking=no \ > > -oControlMaster=yes \ > > "$@"; > > } > > > > function cmd_slave() > > { > > local cmd_line; > > cmd_line=$(cat < > function do_verify() { > > ver=\$(gluster --version | head -1 | cut -f2 -d " "); > > echo \$ver; > > }; > > source /etc/profile && do_verify; > > EOF > > ); > > echo $cmd_line; > > }[root@slave32 ~] > > > > HOST=$1 > > cmd_line=$(cmd_slave); > > ver=`SSHM root@$HOST bash -c "'$cmd_line'"`; > > echo $ver > > - > > > > I could verify for slave32. > > [root@slave32 ~]# vi /tmp/gver.sh > > [root@slave32 ~]# /tmp/gver.sh slave32 > > 3.8dev > > > > Please help me in verifying the same for all the linux regression machines. > > > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Le mercredi 23 septembre 2015 à 06:24 -0400, Kotresh Hiremath Ravishankar a écrit : > Hi Michael, > > Please find my replies below. > > >>> Root login using password should be disabled, so no. If that's still > >>> working and people use it, that's gonna change soon, too much problems > >>> with it. > > Ok > > >>>Can you be more explicit on where should the user come from so I can > >>>properly integrate that ? > > It's just PasswordLess SSH from root to root on to same host. > 1. Generate ssh key: > #ssh-keygen > 2. Add it to /root/.ssh/authorized_keys > #ssh-copy-id -i root@host > > Requirement by geo-replication: > 'ssh root@host' should not ask for password So, it is ok if I restrict that to be used only on 127.0.0.1 ? -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Le mercredi 23 septembre 2015 à 03:25 -0400, Kotresh Hiremath Ravishankar a écrit : > Hi Krutika, > > Looks like the prerequisites for geo-replication to work is changed > in slave21 > > Hi Michael, Hi, > Could you please check following settings are made in all linux regression > machines? Yeah, I will add to salt. > Or provide me with root password so that I can verify. Root login using password should be disabled, so no. If that's still working and people use it, that's gonna change soon, too much problems with it. > 1. Setup Passwordless SSH for the root user: Can you be more explicit on where should the user come from so I can properly integrate that ? There is something adding lots of line to /root/.ssh/authorized_keys on the slave, and this make me quite unconfortable, so if that's it, I rather have it done cleanly, and for that, I need to understand the test, and the requirement. > 2. Add below line in /root/.bashrc. This is required as geo-rep does "gluster > --version" via ssh >and it can't find the gluster PATH via ssh. > export PATH=$PATH:/build/install/sbin:/build/install/bin I will do this one. Is georep supposed to work on other platform like freebsd ? ( because freebsd do not have bash, so I have to adapt to local way, but if that's not gonna be tested, I rather not spend too much time on reading the handbook for now ) > Once above settings are done, the following script should output proper > version. > > --- > #!/bin/bash > > function SSHM() > { > ssh -q \ > -oPasswordAuthentication=no \ > -oStrictHostKeyChecking=no \ > -oControlMaster=yes \ > "$@"; > } > > function cmd_slave() > { > local cmd_line; > cmd_line=$(cat < function do_verify() { > ver=\$(gluster --version | head -1 | cut -f2 -d " "); > echo \$ver; > }; > source /etc/profile && do_verify; > EOF > ); > echo $cmd_line; > }[root@slave32 ~] > > HOST=$1 > cmd_line=$(cmd_slave); > ver=`SSHM root@$HOST bash -c "'$cmd_line'"`; > echo $ver > - > > I could verify for slave32. > [root@slave32 ~]# vi /tmp/gver.sh > [root@slave32 ~]# /tmp/gver.sh slave32 > 3.8dev > > Please help me in verifying the same for all the linux regression machines. > -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Hi Krutika, It's failing with ++ gluster --mode=script --wignore volume geo-rep master slave21.cloud.gluster.org::slave create push-pem Gluster version mismatch between master and slave. I will look into it. Thanks and Regards, Kotresh H R - Original Message - > From: "Krutika Dhananjay"> To: "Atin Mukherjee" > Cc: "Gluster Devel" , "Gaurav Garg" > , "Aravinda" , > "Kotresh Hiremath Ravishankar" > Sent: Tuesday, September 22, 2015 9:03:44 PM > Subject: Re: Spurious failures > > Ah! Sorry. I didn't read that line. :) > > Just figured even ./tests/geo-rep/georep-basic-dr-rsync.t is added to bad > tests list. > > So it's just /tests/geo-rep/georep-basic-dr-tarssh.t for now. > > Thanks Atin! > > -Krutika > > - Original Message - > > > From: "Atin Mukherjee" > > To: "Krutika Dhananjay" > > Cc: "Gluster Devel" , "Gaurav Garg" > > , "Aravinda" , "Kotresh Hiremath > > Ravishankar" > > Sent: Tuesday, September 22, 2015 8:51:22 PM > > Subject: Re: Spurious failures > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t (Wstat: > > 0 Tests: 8 Failed: 2) > > Failed tests: 6, 8 > > Files=1, Tests=8, 48 wallclock secs ( 0.01 usr 0.01 sys + 0.88 cusr > > 0.56 csys = 1.46 CPU) > > Result: FAIL > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t: bad > > status 1 > > *Ignoring failure from known-bad test > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t* > > [11:24:16] ./tests/bugs/glusterd/bug-1242543-replace-brick.t .. ok > > 17587 ms > > [11:24:16] > > All tests successful > > > On 09/22/2015 08:46 PM, Krutika Dhananjay wrote: > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > > > > > Ctrl + f 'not ok'. > > > > > > -Krutika > > > > > > > > > > > > *From: *"Atin Mukherjee" > > > *To: *"Krutika Dhananjay" , "Gluster Devel" > > > > > > *Cc: *"Gaurav Garg" , "Aravinda" > > > , "Kotresh Hiremath Ravishankar" > > > > > > *Sent: *Tuesday, September 22, 2015 8:39:56 PM > > > *Subject: *Re: Spurious failures > > > > > > Krutika, > > > > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t is > > > already a part of bad_tests () in both mainline and 3.7. Could you > > > provide me the link where this test has failed explicitly and that has > > > caused the regression to fail? > > > > > > ~Atin > > > > > > > > > On 09/22/2015 07:27 PM, Krutika Dhananjay wrote: > > > > Hi, > > > > > > > > The following tests seem to be failing consistently on the build > > > > machines in Linux: > > > > > > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t .. > > > > > > > > ./tests/geo-rep/georep-basic-dr-rsync.t .. > > > > > > > > ./tests/geo-rep/georep-basic-dr-tarssh.t .. > > > > > > > > I have added these tests into the tracker etherpad. > > > > > > > > Meanwhile could someone from geo-rep and glusterd team take a look or > > > > perhaps move them to bad tests list? > > > > > > > > > > > > Here is one place where the three tests failed: > > > > > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > > > > > > > -Krutika > > > > > > > > > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Hi Krutika, Looks like the prerequisites for geo-replication to work is changed in slave21 Hi Michael, Could you please check following settings are made in all linux regression machines? Or provide me with root password so that I can verify. 1. Setup Passwordless SSH for the root user: 2. Add below line in /root/.bashrc. This is required as geo-rep does "gluster --version" via ssh and it can't find the gluster PATH via ssh. export PATH=$PATH:/build/install/sbin:/build/install/bin Once above settings are done, the following script should output proper version. --- #!/bin/bash function SSHM() { ssh -q \ -oPasswordAuthentication=no \ -oStrictHostKeyChecking=no \ -oControlMaster=yes \ "$@"; } function cmd_slave() { local cmd_line; cmd_line=$(cat < From: "Kotresh Hiremath Ravishankar"> To: "Krutika Dhananjay" > Cc: "Atin Mukherjee" , "Gluster Devel" > , "Gaurav Garg" > , "Aravinda" > Sent: Wednesday, September 23, 2015 12:31:12 PM > Subject: Re: Spurious failures > > Hi Krutika, > > It's failing with > > ++ gluster --mode=script --wignore volume geo-rep master > slave21.cloud.gluster.org::slave create push-pem > Gluster version mismatch between master and slave. > > I will look into it. > > Thanks and Regards, > Kotresh H R > > - Original Message - > > From: "Krutika Dhananjay" > > To: "Atin Mukherjee" > > Cc: "Gluster Devel" , "Gaurav Garg" > > , "Aravinda" , > > "Kotresh Hiremath Ravishankar" > > Sent: Tuesday, September 22, 2015 9:03:44 PM > > Subject: Re: Spurious failures > > > > Ah! Sorry. I didn't read that line. :) > > > > Just figured even ./tests/geo-rep/georep-basic-dr-rsync.t is added to bad > > tests list. > > > > So it's just /tests/geo-rep/georep-basic-dr-tarssh.t for now. > > > > Thanks Atin! > > > > -Krutika > > > > - Original Message - > > > > > From: "Atin Mukherjee" > > > To: "Krutika Dhananjay" > > > Cc: "Gluster Devel" , "Gaurav Garg" > > > , "Aravinda" , "Kotresh Hiremath > > > Ravishankar" > > > Sent: Tuesday, September 22, 2015 8:51:22 PM > > > Subject: Re: Spurious failures > > > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t (Wstat: > > > 0 Tests: 8 Failed: 2) > > > Failed tests: 6, 8 > > > Files=1, Tests=8, 48 wallclock secs ( 0.01 usr 0.01 sys + 0.88 cusr > > > 0.56 csys = 1.46 CPU) > > > Result: FAIL > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t: bad > > > status 1 > > > *Ignoring failure from known-bad test > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t* > > > [11:24:16] ./tests/bugs/glusterd/bug-1242543-replace-brick.t .. ok > > > 17587 ms > > > [11:24:16] > > > All tests successful > > > > > On 09/22/2015 08:46 PM, Krutika Dhananjay wrote: > > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > > > > > > > Ctrl + f 'not ok'. > > > > > > > > -Krutika > > > > > > > > > > > > > > > > *From: *"Atin Mukherjee" > > > > *To: *"Krutika Dhananjay" , "Gluster Devel" > > > > > > > > *Cc: *"Gaurav Garg" , "Aravinda" > > > > , "Kotresh Hiremath Ravishankar" > > > > > > > > *Sent: *Tuesday, September 22, 2015 8:39:56 PM > > > > *Subject: *Re: Spurious failures > > > > > > > > Krutika, > > > > > > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t is > > > > already a part of bad_tests () in both mainline and 3.7. Could you > > > > provide me the link where this test has failed explicitly and that has > > > > caused the regression to fail? > > > > > > > > ~Atin > > > > > > > > > > > > On 09/22/2015 07:27 PM, Krutika Dhananjay wrote: > > > > > Hi, > > > > > > > > > > The following tests seem to be failing consistently on the build > > > > > machines in Linux: > > > > > > > > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t .. > > > > > > > > > > ./tests/geo-rep/georep-basic-dr-rsync.t .. > > > > > > > > > > ./tests/geo-rep/georep-basic-dr-tarssh.t .. > > > > > > > > > > I have added these tests into the tracker etherpad. > > > > > > > > > > Meanwhile could someone from geo-rep and glusterd team take a look or > > > > > perhaps move them to bad tests list? > > > > > > > > > > > > > > > Here is one place where the three tests failed: > > > > > > > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > > >
Re: [Gluster-devel] Spurious failures
Krutika, ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t is already a part of bad_tests () in both mainline and 3.7. Could you provide me the link where this test has failed explicitly and that has caused the regression to fail? ~Atin On 09/22/2015 07:27 PM, Krutika Dhananjay wrote: > Hi, > > The following tests seem to be failing consistently on the build > machines in Linux: > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t .. > > ./tests/geo-rep/georep-basic-dr-rsync.t .. > > ./tests/geo-rep/georep-basic-dr-tarssh.t .. > > I have added these tests into the tracker etherpad. > > Meanwhile could someone from geo-rep and glusterd team take a look or > perhaps move them to bad tests list? > > > Here is one place where the three tests failed: > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > -Krutika > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull Ctrl + f 'not ok'. -Krutika - Original Message - > From: "Atin Mukherjee"> To: "Krutika Dhananjay" , "Gluster Devel" > > Cc: "Gaurav Garg" , "Aravinda" , > "Kotresh Hiremath Ravishankar" > Sent: Tuesday, September 22, 2015 8:39:56 PM > Subject: Re: Spurious failures > Krutika, > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t is > already a part of bad_tests () in both mainline and 3.7. Could you > provide me the link where this test has failed explicitly and that has > caused the regression to fail? > ~Atin > On 09/22/2015 07:27 PM, Krutika Dhananjay wrote: > > Hi, > > > > The following tests seem to be failing consistently on the build > > machines in Linux: > > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t .. > > > > ./tests/geo-rep/georep-basic-dr-rsync.t .. > > > > ./tests/geo-rep/georep-basic-dr-tarssh.t .. > > > > I have added these tests into the tracker etherpad. > > > > Meanwhile could someone from geo-rep and glusterd team take a look or > > perhaps move them to bad tests list? > > > > > > Here is one place where the three tests failed: > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > > > -Krutika > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t (Wstat: 0 Tests: 8 Failed: 2) Failed tests: 6, 8 Files=1, Tests=8, 48 wallclock secs ( 0.01 usr 0.01 sys + 0.88 cusr 0.56 csys = 1.46 CPU) Result: FAIL ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t: bad status 1 *Ignoring failure from known-bad test ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t* [11:24:16] ./tests/bugs/glusterd/bug-1242543-replace-brick.t .. ok 17587 ms [11:24:16] All tests successful On 09/22/2015 08:46 PM, Krutika Dhananjay wrote: > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > Ctrl + f 'not ok'. > > -Krutika > > > > *From: *"Atin Mukherjee"> *To: *"Krutika Dhananjay" , "Gluster Devel" > > *Cc: *"Gaurav Garg" , "Aravinda" > , "Kotresh Hiremath Ravishankar" > > *Sent: *Tuesday, September 22, 2015 8:39:56 PM > *Subject: *Re: Spurious failures > > Krutika, > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t is > already a part of bad_tests () in both mainline and 3.7. Could you > provide me the link where this test has failed explicitly and that has > caused the regression to fail? > > ~Atin > > > On 09/22/2015 07:27 PM, Krutika Dhananjay wrote: > > Hi, > > > > The following tests seem to be failing consistently on the build > > machines in Linux: > > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t .. > > > > ./tests/geo-rep/georep-basic-dr-rsync.t .. > > > > ./tests/geo-rep/georep-basic-dr-tarssh.t .. > > > > I have added these tests into the tracker etherpad. > > > > Meanwhile could someone from geo-rep and glusterd team take a look or > > perhaps move them to bad tests list? > > > > > > Here is one place where the three tests failed: > > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > > > -Krutika > > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures
Ah! Sorry. I didn't read that line. :) Just figured even ./tests/geo-rep/georep-basic-dr-rsync.t is added to bad tests list. So it's just /tests/geo-rep/georep-basic-dr-tarssh.t for now. Thanks Atin! -Krutika - Original Message - > From: "Atin Mukherjee"> To: "Krutika Dhananjay" > Cc: "Gluster Devel" , "Gaurav Garg" > , "Aravinda" , "Kotresh Hiremath > Ravishankar" > Sent: Tuesday, September 22, 2015 8:51:22 PM > Subject: Re: Spurious failures > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t (Wstat: > 0 Tests: 8 Failed: 2) > Failed tests: 6, 8 > Files=1, Tests=8, 48 wallclock secs ( 0.01 usr 0.01 sys + 0.88 cusr > 0.56 csys = 1.46 CPU) > Result: FAIL > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t: bad > status 1 > *Ignoring failure from known-bad test > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t* > [11:24:16] ./tests/bugs/glusterd/bug-1242543-replace-brick.t .. ok > 17587 ms > [11:24:16] > All tests successful > On 09/22/2015 08:46 PM, Krutika Dhananjay wrote: > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > > > Ctrl + f 'not ok'. > > > > -Krutika > > > > > > > > *From: *"Atin Mukherjee" > > *To: *"Krutika Dhananjay" , "Gluster Devel" > > > > *Cc: *"Gaurav Garg" , "Aravinda" > > , "Kotresh Hiremath Ravishankar" > > > > *Sent: *Tuesday, September 22, 2015 8:39:56 PM > > *Subject: *Re: Spurious failures > > > > Krutika, > > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t is > > already a part of bad_tests () in both mainline and 3.7. Could you > > provide me the link where this test has failed explicitly and that has > > caused the regression to fail? > > > > ~Atin > > > > > > On 09/22/2015 07:27 PM, Krutika Dhananjay wrote: > > > Hi, > > > > > > The following tests seem to be failing consistently on the build > > > machines in Linux: > > > > > > ./tests/bugs/glusterd/bug-1238706-daemons-stop-on-peer-cleanup.t .. > > > > > > ./tests/geo-rep/georep-basic-dr-rsync.t .. > > > > > > ./tests/geo-rep/georep-basic-dr-tarssh.t .. > > > > > > I have added these tests into the tracker etherpad. > > > > > > Meanwhile could someone from geo-rep and glusterd team take a look or > > > perhaps move them to bad tests list? > > > > > > > > > Here is one place where the three tests failed: > > > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/14421/consoleFull > > > > > > -Krutika > > > > > > > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures in tests/basic/afr/arbiter.t
On 07/20/2015 12:45 PM, Niels de Vos wrote: On Mon, Jul 20, 2015 at 09:25:15AM +0530, Ravishankar N wrote: I'll take a look. Thanks. I'm actually not sure if this is a arbiter.t issue, maybe I blamed it too early? Its the first test that gets executed, and no others are tried after it failed. Niels Regards, Ravi On 07/20/2015 03:07 AM, Niels de Vos wrote: I have seen several occurences of failures in arbiter.t now. This is one of the errors: https://build.gluster.org/job/rackspace-regression-2GB-triggered/12626/consoleFull [21:20:20] ./tests/basic/afr/arbiter.t .. not ok 7 Got N instead of Y not ok 15 not ok 16 Got instead of 1 not ok 23 Got instead of 1 not ok 25 Got 0 when not expecting it not ok 26 not ok 34 Got 0 instead of 1 not ok 35 Got 0 instead of 1 not ok 41 Got instead of 1 not ok 47 Got N instead of Y Failed 10/47 subtests [21:20:20] Test Summary Report --- ./tests/basic/afr/arbiter.t (Wstat: 0 Tests: 47 Failed: 10) Failed tests: 7, 15-16, 23, 25-26, 34-35, 41, 47 So the test #7 that failed is 16 EXPECT_WITHIN $UMOUNT_TIMEOUT Y force_umount $M0 Looking at mnt-glusterfs-0.log, I see that the unmount has already happened before the actual command was run, at least from the time stamp logged by G_LOG() function. [2015-07-19 21:16:21.784293] I [fuse-bridge.c:4946:fuse_thread_proc] 0-fuse: unmounting /mnt/glusterfs/0 [2015-07-19 21:16:21.784542] W [glusterfsd.c:1214:cleanup_and_exit] (--/lib64/libpthread.so.0(+0x79d1) [0x7fc3f41c49d1] --glusterfs(glusterfs_sigwaiter+0xe4) [0x409734] --glusterfs(cleanup_and_exit+0x87) [0x407ba7] ) 0-: received signum (15), shutting down [2015-07-19 21:16:21.784571] I [fuse-bridge.c:5645:fini] 0-fuse: Unmounting '/mnt/glusterfs/0'. [2015-07-19 21:16:21.785817332]:++ G_LOG:./tests/basic/afr/arbiter.t: TEST: 15 ! stat /mnt/glusterfs/0/.meta/graphs/active/patchy-replicate-0/options/arbiter-count ++ [2015-07-19 21:16:21.796574975]:++ G_LOG:./tests/basic/afr/arbiter.t: TEST: 16 Y force_umount /mnt/glusterfs/0 ++ I have no clue as to why that could have happened because appending to the gluster log files using G_LOG() is done *before* the test is executed.In all my trial runs, the G_LOG message gets logged first, followed by the logs relevant to the actual command being run. FWIW, http://review.gluster.org/#/c/4/ changed made the following change to arbiter.t amongst other test cases : -TEST umount $M0 +EXPECT_WITHIN $UMOUNT_TIMEOUT Y force_umount $M0 But I'm not sure doing a umount -f has any impact for fuse mounts. Regards, Ravi Files=1, Tests=47, 243 wallclock secs ( 0.04 usr 0.00 sys + 15.22 cusr 3.48 csys = 18.74 CPU) Result: FAIL Who could have look at this? Thanks, Niels ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures in tests/basic/afr/arbiter.t
On Mon, Jul 20, 2015 at 09:25:15AM +0530, Ravishankar N wrote: I'll take a look. Thanks. I'm actually not sure if this is a arbiter.t issue, maybe I blamed it too early? Its the first test that gets executed, and no others are tried after it failed. Niels Regards, Ravi On 07/20/2015 03:07 AM, Niels de Vos wrote: I have seen several occurences of failures in arbiter.t now. This is one of the errors: https://build.gluster.org/job/rackspace-regression-2GB-triggered/12626/consoleFull [21:20:20] ./tests/basic/afr/arbiter.t .. not ok 7 Got N instead of Y not ok 15 not ok 16 Got instead of 1 not ok 23 Got instead of 1 not ok 25 Got 0 when not expecting it not ok 26 not ok 34 Got 0 instead of 1 not ok 35 Got 0 instead of 1 not ok 41 Got instead of 1 not ok 47 Got N instead of Y Failed 10/47 subtests [21:20:20] Test Summary Report --- ./tests/basic/afr/arbiter.t (Wstat: 0 Tests: 47 Failed: 10) Failed tests: 7, 15-16, 23, 25-26, 34-35, 41, 47 Files=1, Tests=47, 243 wallclock secs ( 0.04 usr 0.00 sys + 15.22 cusr 3.48 csys = 18.74 CPU) Result: FAIL Who could have look at this? Thanks, Niels ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures in tests/basic/afr/arbiter.t
I'll take a look. Regards, Ravi On 07/20/2015 03:07 AM, Niels de Vos wrote: I have seen several occurences of failures in arbiter.t now. This is one of the errors: https://build.gluster.org/job/rackspace-regression-2GB-triggered/12626/consoleFull [21:20:20] ./tests/basic/afr/arbiter.t .. not ok 7 Got N instead of Y not ok 15 not ok 16 Got instead of 1 not ok 23 Got instead of 1 not ok 25 Got 0 when not expecting it not ok 26 not ok 34 Got 0 instead of 1 not ok 35 Got 0 instead of 1 not ok 41 Got instead of 1 not ok 47 Got N instead of Y Failed 10/47 subtests [21:20:20] Test Summary Report --- ./tests/basic/afr/arbiter.t (Wstat: 0 Tests: 47 Failed: 10) Failed tests: 7, 15-16, 23, 25-26, 34-35, 41, 47 Files=1, Tests=47, 243 wallclock secs ( 0.04 usr 0.00 sys + 15.22 cusr 3.48 csys = 18.74 CPU) Result: FAIL Who could have look at this? Thanks, Niels ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures again
Sad but true. More tests are failing than passing, and the failures are often *clearly* unrelated to the patches they're supposedly testing. Let's revive the Etherpad, and use it to track progress as we clean this up. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/bugs/snapshot/bug-1109889.t
This doesn't seem to have been fixed completely. My change [1] failed (again !) on this test [2], even after rebasing onto the fix [3]. [1]: https://review.gluster.org/11559 [2]: http://build.gluster.org/job/rackspace-regression-2GB-triggered/12152/consoleFull [3]: https://review.gluster.org/11579 On Thu, Jul 9, 2015 at 4:20 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: Sorry, seems like this is already fixed, I just need to rebase. Pranith On 07/09/2015 03:56 PM, Pranith Kumar Karampuri wrote: hi, Could you please look into http://build.gluster.org/job/rackspace-regression-2GB-triggered/12150/consoleFull Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/bugs/snapshot/bug-1109889.t
Sorry, seems like this is already fixed, I just need to rebase. Pranith On 07/09/2015 03:56 PM, Pranith Kumar Karampuri wrote: hi, Could you please look into http://build.gluster.org/job/rackspace-regression-2GB-triggered/12150/consoleFull Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures again
On 07/08/2015 03:57 PM, Anuradha Talur wrote: - Original Message - From: Kaushal M kshlms...@gmail.com To: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, July 8, 2015 3:42:12 PM Subject: [Gluster-devel] Spurious failures again I've been hitting spurious failures in Linux regression runs for my change [1]. The following tests failed, ./tests/basic/afr/replace-brick-self-heal.t [2] ./tests/bugs/replicate/bug-1238508-self-heal.t [3] ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t [4] ./tests/bugs/quota/bug-1235182.t [5] ./tests/bugs/replicate/bug-977797.t [6] Ran ./tests/bugs/replicate/bug-977797.t multiple times in a loop, no failure observed. The logs in [6] seem inaccessible as well. Can AFR and quota owners look into this? Thanks. Kaushal [1] https://review.gluster.org/11559 [2] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12023/consoleFull Will look into this. [3] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12029/consoleFull For 3rd one the patch needs to be rebased. Ravi sent a fix http://review.gluster.org/#/c/11556/ . [4] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12044/consoleFull [5] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12060/consoleFull [6] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12071/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures again
On Wednesday 08 July 2015 03:42 PM, Kaushal M wrote: I've been hitting spurious failures in Linux regression runs for my change [1]. The following tests failed, ./tests/basic/afr/replace-brick-self-heal.t [2] ./tests/bugs/replicate/bug-1238508-self-heal.t [3] ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t [4] I will look into this issue ./tests/bugs/quota/bug-1235182.t [5] I have submitted two patches to fix failures from 'bug-1235182.t' http://review.gluster.org/#/c/11561/ http://review.gluster.org/#/c/11510/ ./tests/bugs/replicate/bug-977797.t [6] Can AFR and quota owners look into this? Thanks. Kaushal [1] https://review.gluster.org/11559 [2] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12023/consoleFull [3] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12029/consoleFull [4] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12044/consoleFull [5] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12060/consoleFull [6] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12071/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures again
- Original Message - From: Kaushal M kshlms...@gmail.com To: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, July 8, 2015 3:42:12 PM Subject: [Gluster-devel] Spurious failures again I've been hitting spurious failures in Linux regression runs for my change [1]. The following tests failed, ./tests/basic/afr/replace-brick-self-heal.t [2] ./tests/bugs/replicate/bug-1238508-self-heal.t [3] ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t [4] ./tests/bugs/quota/bug-1235182.t [5] ./tests/bugs/replicate/bug-977797.t [6] Can AFR and quota owners look into this? Thanks. Kaushal [1] https://review.gluster.org/11559 [2] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12023/consoleFull Will look into this. [3] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12029/consoleFull For 3rd one the patch needs to be rebased. Ravi sent a fix http://review.gluster.org/#/c/11556/ . [4] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12044/consoleFull [5] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12060/consoleFull [6] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12071/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- Thanks, Anuradha. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures again
On Wednesday 08 July 2015 03:53 PM, Vijaikumar M wrote: On Wednesday 08 July 2015 03:42 PM, Kaushal M wrote: I've been hitting spurious failures in Linux regression runs for my change [1]. The following tests failed, ./tests/basic/afr/replace-brick-self-heal.t [2] ./tests/bugs/replicate/bug-1238508-self-heal.t [3] ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t [4] I will look into this issue Patch submitted: http://review.gluster.org/#/c/11583/ ./tests/bugs/quota/bug-1235182.t [5] I have submitted two patches to fix failures from 'bug-1235182.t' http://review.gluster.org/#/c/11561/ http://review.gluster.org/#/c/11510/ ./tests/bugs/replicate/bug-977797.t [6] Can AFR and quota owners look into this? Thanks. Kaushal [1] https://review.gluster.org/11559 [2] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12023/consoleFull [3] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12029/consoleFull [4] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12044/consoleFull [5] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12060/consoleFull [6] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12071/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures again
I think our linux regression is again unstable. I am seeing at least 10 such test cases ( if not more) which have failed. I think we should again start maintaining an etherpad page (probably the same earlier one) and keep track of them otherwise it will be difficult to track what is fixed and what's not if we have to go through mails. Thoughts? -Atin Sent from one plus one On Jul 8, 2015 8:45 PM, Vijaikumar M vmall...@redhat.com wrote: On Wednesday 08 July 2015 03:53 PM, Vijaikumar M wrote: On Wednesday 08 July 2015 03:42 PM, Kaushal M wrote: I've been hitting spurious failures in Linux regression runs for my change [1]. The following tests failed, ./tests/basic/afr/replace-brick-self-heal.t [2] ./tests/bugs/replicate/bug-1238508-self-heal.t [3] ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t [4] I will look into this issue Patch submitted: http://review.gluster.org/#/c/11583/ ./tests/bugs/quota/bug-1235182.t [5] I have submitted two patches to fix failures from 'bug-1235182.t' http://review.gluster.org/#/c/11561/ http://review.gluster.org/#/c/11510/ ./tests/bugs/replicate/bug-977797.t [6] Can AFR and quota owners look into this? Thanks. Kaushal [1] https://review.gluster.org/11559 [2] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12023/consoleFull [3] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12029/consoleFull [4] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12044/consoleFull [5] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12060/consoleFull [6] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12071/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures again
On 07/08/2015 11:16 PM, Atin Mukherjee wrote: I think our linux regression is again unstable. I am seeing at least 10 such test cases ( if not more) which have failed. I think we should again start maintaining an etherpad page (probably the same earlier one) and keep track of them otherwise it will be difficult to track what is fixed and what's not if we have to go through mails. Thoughts? Makes sense. The link is here https://public.pad.fsfe.org/p/gluster-spurious-failures Perhaps we should remove the entries and start fresh. -Ravi -Atin Sent from one plus one On Jul 8, 2015 8:45 PM, Vijaikumar M vmall...@redhat.com mailto:vmall...@redhat.com wrote: On Wednesday 08 July 2015 03:53 PM, Vijaikumar M wrote: On Wednesday 08 July 2015 03:42 PM, Kaushal M wrote: I've been hitting spurious failures in Linux regression runs for my change [1]. The following tests failed, ./tests/basic/afr/replace-brick-self-heal.t [2] ./tests/bugs/replicate/bug-1238508-self-heal.t [3] ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t [4] I will look into this issue Patch submitted: http://review.gluster.org/#/c/11583/ ./tests/bugs/quota/bug-1235182.t [5] I have submitted two patches to fix failures from 'bug-1235182.t' http://review.gluster.org/#/c/11561/ http://review.gluster.org/#/c/11510/ ./tests/bugs/replicate/bug-977797.t [6] Can AFR and quota owners look into this? Thanks. Kaushal [1] https://review.gluster.org/11559 [2] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12023/consoleFull [3] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12029/consoleFull [4] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12044/consoleFull [5] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12060/consoleFull [6] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12071/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t
This approach could still surprise the storage-admin when glusterfs(d) processes bind to ports in the range where brick ports are being assigned. We should make this predictable by reserving brick ports setting net.ipv4.ip_local_reserved_ports. Initially reserve 50 ports starting at 49152. Subsequently, we could reserve ports on demand, say 50 more ports, when we exhaust previously reserved range. net.ipv4.ip_local_reserved_ports doesn't interfere with explicit port allocation behaviour. i.e if the socket uses a port other than zero. With this option we don't have to manage ports assignment at a process level. Thoughts? If the reallocation can be done on demand, I do think this is a better approach to tackle this problem. We could fix the predictability aspect in a different patch. This patch, where we assign ports starting from 65335 in descending order, can be reviewed independently. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t
It seems this is exactly whats happening. I have a question, I get the following data from netstat and grep tcp0 0 f6be17c0fbf5:1023 f6be17c0fbf5:24007 ESTABLISHED 31516/glusterfsd tcp0 0 f6be17c0fbf5:49152 f6be17c0fbf5:490 ESTABLISHED 31516/glusterfsd unix 3 [ ] STREAM CONNECTED 988353 31516/glusterfsd /var/run/gluster/4878d6e905c5f6032140a00cc584df8a.socket Here 31516 is the brick pid. Looking at the data, line 2 is very clear, it shows connection between brick and glusterfs client. unix socket on line 3 is also clear, it is the unix socket connection that glusterd and brick process use for communication. I am not able to understand line 1; which part of brick process established a tcp connection with glusterd using port 1023? This is the rpc connection from any glusterfs(d) process to glusterd to fetch volfile on receiving notification from glusterd. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t
- Original Message - This is caused because when bind-insecure is turned on (which is the default now), it may happen that brick is not able to bind to port assigned by Glusterd for example 49192-49195... It seems to occur because the rpc_clnt connections are binding to ports in the same range. so brick fails to bind to a port which is already used by someone else. This bug already exist before http://review.gluster.org/#/c/11039/ when use rdma, i.e. even previously rdma binds to port = 1024 if it cannot find a free port 1024, even when bind insecure was turned off (ref to commit '0e3fd04e'). Since we don't have tests related to rdma we did not discover this issue previously. http://review.gluster.org/#/c/11039/ discovers the bug we encountered, however now the bug can be fixed by http://review.gluster.org/#/c/11512/ by making rpc_clnt to get port numbers from 65535 in a descending order, as a result port clash is minimized, also it fixes issues in rdma too This approach could still surprise the storage-admin when glusterfs(d) processes bind to ports in the range where brick ports are being assigned. We should make this predictable by reserving brick ports setting net.ipv4.ip_local_reserved_ports. Initially reserve 50 ports starting at 49152. Subsequently, we could reserve ports on demand, say 50 more ports, when we exhaust previously reserved range. net.ipv4.ip_local_reserved_ports doesn't interfere with explicit port allocation behaviour. i.e if the socket uses a port other than zero. With this option we don't have to manage ports assignment at a process level. Thoughts? ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t
On 07/03/2015 11:58 AM, Krishnan Parthasarathi wrote: - Original Message - This is caused because when bind-insecure is turned on (which is the default now), it may happen that brick is not able to bind to port assigned by Glusterd for example 49192-49195... It seems to occur because the rpc_clnt connections are binding to ports in the same range. so brick fails to bind to a port which is already used by someone else. This bug already exist before http://review.gluster.org/#/c/11039/ when use rdma, i.e. even previously rdma binds to port = 1024 if it cannot find a free port 1024, even when bind insecure was turned off (ref to commit '0e3fd04e'). Since we don't have tests related to rdma we did not discover this issue previously. http://review.gluster.org/#/c/11039/ discovers the bug we encountered, however now the bug can be fixed by http://review.gluster.org/#/c/11512/ by making rpc_clnt to get port numbers from 65535 in a descending order, as a result port clash is minimized, also it fixes issues in rdma too This approach could still surprise the storage-admin when glusterfs(d) processes bind to ports in the range where brick ports are being assigned. We should make this predictable by reserving brick ports setting net.ipv4.ip_local_reserved_ports. Initially reserve 50 ports starting at 49152. Subsequently, we could reserve ports on demand, say 50 more ports, when we exhaust previously reserved range. net.ipv4.ip_local_reserved_ports doesn't interfere with explicit port allocation behaviour. i.e if the socket uses a port other than zero. With this option we don't have to manage ports assignment at a process level. Thoughts? If the reallocation can be done on demand, I do think this is a better approach to tackle this problem. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- ~Atin ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t
This is caused because when bind-insecure is turned on (which is the default now), it may happen that brick is not able to bind to port assigned by Glusterd for example 49192-49195... It seems to occur because the rpc_clnt connections are binding to ports in the same range. so brick fails to bind to a port which is already used by someone else. This bug already exist before http://review.gluster.org/#/c/11039/ when use rdma, i.e. even previously rdma binds to port = 1024 if it cannot find a free port 1024, even when bind insecure was turned off (ref to commit '0e3fd04e'). Since we don't have tests related to rdma we did not discover this issue previously. http://review.gluster.org/#/c/11039/ discovers the bug we encountered, however now the bug can be fixed by http://review.gluster.org/#/c/11512/ by making rpc_clnt to get port numbers from 65535 in a descending order, as a result port clash is minimized, also it fixes issues in rdma too Thanks to Raghavendra Talur for help in discovering the real cause Regards, Prasanna Kalever - Original Message - From: Raghavendra Talur raghavendra.ta...@gmail.com To: Krishnan Parthasarathi kpart...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, July 2, 2015 6:45:17 PM Subject: Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t On Thu, Jul 2, 2015 at 4:40 PM, Raghavendra Talur raghavendra.ta...@gmail.com wrote: On Thu, Jul 2, 2015 at 10:52 AM, Krishnan Parthasarathi kpart...@redhat.com wrote: A port assigned by Glusterd for a brick is found to be in use already by the brick. Any changes in Glusterd recently which can cause this? Or is it a test infra problem? This issue is likely to be caused by http://review.gluster.org/11039 This patch changes the port allocation that happens for rpc_clnt based connections. Previously, ports allocated where 1024. With this change, these connections, typically mount process, gluster-nfs server processes etc could end up using ports that bricks are being assigned to. IIUC, the intention of the patch was to make server processes lenient to inbound messages from ports 1024. If we don't require to use ports 1024 we could leave the port allocation for rpc_clnt connections as before. Alternately, we could reserve the range of ports starting from 49152 for bricks by setting net.ipv4.ip_local_reserved_ports using sysctl(8). This is specific to Linux. I'm not aware of how this could be done in NetBSD for instance though. It seems this is exactly whats happening. I have a question, I get the following data from netstat and grep tcp 0 0 f6be17c0fbf5:1023 f6be17c0fbf5:24007 ESTABLISHED 31516/glusterfsd tcp 0 0 f6be17c0fbf5:49152 f6be17c0fbf5:490 ESTABLISHED 31516/glusterfsd unix 3 [ ] STREAM CONNECTED 988353 31516/glusterfsd /var/run/gluster/4878d6e905c5f6032140a00cc584df8a.socket Here 31516 is the brick pid. Looking at the data, line 2 is very clear, it shows connection between brick and glusterfs client. unix socket on line 3 is also clear, it is the unix socket connection that glusterd and brick process use for communication. I am not able to understand line 1; which part of brick process established a tcp connection with glusterd using port 1023? Note: this data is from a build which does not have the above mentioned patch. The patch which exposed this bug is being reverted till the underlying bug is also fixed. You can monitor revert patches here master: http://review.gluster.org/11507 3.7 branch: http://review.gluster.org/11508 Please rebase your patches after the above patches are merged to ensure that you patches pass regression. -- Raghavendra Talur ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t
Thanks Prasanna for the patches :) -Atin Sent from one plus one On Jul 2, 2015 9:19 PM, Prasanna Kalever pkale...@redhat.com wrote: This is caused because when bind-insecure is turned on (which is the default now), it may happen that brick is not able to bind to port assigned by Glusterd for example 49192-49195... It seems to occur because the rpc_clnt connections are binding to ports in the same range. so brick fails to bind to a port which is already used by someone else. This bug already exist before http://review.gluster.org/#/c/11039/ when use rdma, i.e. even previously rdma binds to port = 1024 if it cannot find a free port 1024, even when bind insecure was turned off (ref to commit '0e3fd04e'). Since we don't have tests related to rdma we did not discover this issue previously. http://review.gluster.org/#/c/11039/ discovers the bug we encountered, however now the bug can be fixed by http://review.gluster.org/#/c/11512/ by making rpc_clnt to get port numbers from 65535 in a descending order, as a result port clash is minimized, also it fixes issues in rdma too Thanks to Raghavendra Talur for help in discovering the real cause Regards, Prasanna Kalever - Original Message - From: Raghavendra Talur raghavendra.ta...@gmail.com To: Krishnan Parthasarathi kpart...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, July 2, 2015 6:45:17 PM Subject: Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t On Thu, Jul 2, 2015 at 4:40 PM, Raghavendra Talur raghavendra.ta...@gmail.com wrote: On Thu, Jul 2, 2015 at 10:52 AM, Krishnan Parthasarathi kpart...@redhat.com wrote: A port assigned by Glusterd for a brick is found to be in use already by the brick. Any changes in Glusterd recently which can cause this? Or is it a test infra problem? This issue is likely to be caused by http://review.gluster.org/11039 This patch changes the port allocation that happens for rpc_clnt based connections. Previously, ports allocated where 1024. With this change, these connections, typically mount process, gluster-nfs server processes etc could end up using ports that bricks are being assigned to. IIUC, the intention of the patch was to make server processes lenient to inbound messages from ports 1024. If we don't require to use ports 1024 we could leave the port allocation for rpc_clnt connections as before. Alternately, we could reserve the range of ports starting from 49152 for bricks by setting net.ipv4.ip_local_reserved_ports using sysctl(8). This is specific to Linux. I'm not aware of how this could be done in NetBSD for instance though. It seems this is exactly whats happening. I have a question, I get the following data from netstat and grep tcp 0 0 f6be17c0fbf5:1023 f6be17c0fbf5:24007 ESTABLISHED 31516/glusterfsd tcp 0 0 f6be17c0fbf5:49152 f6be17c0fbf5:490 ESTABLISHED 31516/glusterfsd unix 3 [ ] STREAM CONNECTED 988353 31516/glusterfsd /var/run/gluster/4878d6e905c5f6032140a00cc584df8a.socket Here 31516 is the brick pid. Looking at the data, line 2 is very clear, it shows connection between brick and glusterfs client. unix socket on line 3 is also clear, it is the unix socket connection that glusterd and brick process use for communication. I am not able to understand line 1; which part of brick process established a tcp connection with glusterd using port 1023? Note: this data is from a build which does not have the above mentioned patch. The patch which exposed this bug is being reverted till the underlying bug is also fixed. You can monitor revert patches here master: http://review.gluster.org/11507 3.7 branch: http://review.gluster.org/11508 Please rebase your patches after the above patches are merged to ensure that you patches pass regression. -- Raghavendra Talur ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t
Yep will have a look - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Joseph Fernandes josfe...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Wednesday, July 1, 2015 1:44:44 PM Subject: spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t hi, http://build.gluster.org/job/rackspace-regression-2GB-triggered/11757/consoleFull has the logs. Could you please look into it. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t
:glusterfs_graph_init] 0-patchy-server: initializing translator failed [2015-07-01 07:33:25.069808] E [MSGID: 101176] [graph.c:669:glusterfs_graph_activate] 0-graph: init failed [2015-07-01 07:33:25.070183] W [glusterfsd.c:1214:cleanup_and_exit] (-- 0-: received signum (0), shutting down Looks like it is assigned a port which is already in used. Saw the same error in another test failing for another patch set. Here is the link: http://build.gluster.org/job/rackspace-regression-2GB-triggered/11740/consoleFull A port assigned by Glusterd for a brick is found to be in use already by the brick. Any changes in Glusterd recently which can cause this? Or is it a test infra problem? Prasanna is looking into this for now. The status of the volume in glusterd is not started, as a result attach-tier command fails, i.e tiering rebalancer cannot run. [2015-07-01 07:33:25.275092] E [MSGID: 106301] [glusterd-op-sm.c:4086:glusterd_op_ac_send_stage_op] 0-management: Staging of operation 'Volume Rebalance' failed on localhost : Volume patchy needs to be started to perform rebalance but the volume is running in the crippled mode, as a result mount works fine. i.e TEST $GFS --volfile-id=/$V0 --volfile-server=$H0 $M0; works fine TEST 9-12 failed as attach has failed. Regards, Joe - Original Message - From: Joseph Fernandes josfe...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Wednesday, July 1, 2015 1:59:41 PM Subject: Re: [Gluster-devel] spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t Yep will have a look - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Joseph Fernandes josfe...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Wednesday, July 1, 2015 1:44:44 PM Subject: spurious failures tests/bugs/tier/bug-1205545-CTR-and-trash-integration.t hi, http://build.gluster.org/job/rackspace-regression-2GB-triggered/11757/consoleFull has the logs. Could you please look into it. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- Raghavendra Talur ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures? (master)
On 06/05/2015 02:12 AM, Shyam wrote: Just checking, This review request: http://review.gluster.org/#/c/11073/ Failed in the following tests: 1) Linux [20:20:16] ./tests/bugs/replicate/bug-880898.t .. not ok 4 This seems to be same RC as in self-heald.t where heal info is not failing sometimes when the brick is down. Failed 1/4 subtests [20:20:16] http://build.gluster.org/job/rackspace-regression-2GB-triggered/10088/consoleFull 2) NetBSD (Du seems to have faced the same) [11:56:45] ./tests/basic/afr/sparse-file-self-heal.t .. not ok 52 Got instead of 1 not ok 53 Got instead of 1 not ok 54 not ok 55 Got 2 instead of 0 not ok 56 Got d41d8cd98f00b204e9800998ecf8427e instead of b6d81b360a5672d80c27430f39153e2c not ok 60 Got 0 instead of 1 Failed 6/64 subtests [11:56:45] http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/6233/consoleFull There is a bug in statedump code path, If it races with STACK_RESET then shd seems to crash. I see the following output indicating the process died. kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec] I have not done any analysis, and also the change request should not affect the paths that this test is failing on. Checking the logs for Linux did not throw any more light on the cause, although the brick logs are not updated(?) to reflect the volume create and start as per the TC in (1). Anyone know anything (more) about this? Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/basic/volume-snapshot-clone.t
Hi, As already discussed, if you encounter this or any other snapshot tests, it would be great to provide the regression run instance so that we can have a look at the logs if there are any. Also I tried running the test in a loop as you suggested. After an hour and a half I stopped it so that I can use my machines to work on some patches. So please let us know when this or any snapshot tests fails for anyone and we will look into it asap. Regards, Avra On 05/05/2015 09:01 AM, Pranith Kumar Karampuri wrote: hi Avra/Rajesh, Any update on this test? * tests/basic/volume-snapshot-clone.t * http://review.gluster.org/#/c/10053/ * Came back on April 9 * http://build.gluster.org/job/rackspace-regression-2GB-triggered/6658/ Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/basic/volume-snapshot-clone.t
On 05/05/2015 10:32 AM, Avra Sengupta wrote: Hi, As already discussed, if you encounter this or any other snapshot tests, it would be great to provide the regression run instance so that we can have a look at the logs if there are any. Also I tried running the test in a loop as you suggested. After an hour and a half I stopped it so that I can use my machines to work on some patches. So please let us know when this or any snapshot tests fails for anyone and we will look into it asap. Please read the mail again to find the link which has the logs. ./tests/basic/volume-snapshot-clone.t (Wstat: 0 Tests: 41 Failed: 3) Failed tests: 36, 38, 40 Pranith Regards, Avra On 05/05/2015 09:01 AM, Pranith Kumar Karampuri wrote: hi Avra/Rajesh, Any update on this test? * tests/basic/volume-snapshot-clone.t * http://review.gluster.org/#/c/10053/ * Came back on April 9 * http://build.gluster.org/job/rackspace-regression-2GB-triggered/6658/ Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/basic/volume-snapshot-clone.t
On 05/05/2015 10:43 AM, Pranith Kumar Karampuri wrote: On 05/05/2015 10:32 AM, Avra Sengupta wrote: Hi, As already discussed, if you encounter this or any other snapshot tests, it would be great to provide the regression run instance so that we can have a look at the logs if there are any. Also I tried running the test in a loop as you suggested. After an hour and a half I stopped it so that I can use my machines to work on some patches. So please let us know when this or any snapshot tests fails for anyone and we will look into it asap. Please read the mail again to find the link which has the logs. ./tests/basic/volume-snapshot-clone.t (Wstat: 0 Tests: 41 Failed: 3) Failed tests: 36, 38, 40 As repeatedly told, older regression run doesn't have the logs any more. Please find the link and try and fetch the logs. Please tell me if I am missing something here. [root@VM1 lab]# wget http://slave33.cloud.gluster.org/logs/glusterfs-logs-20150409:09:27:03.tgz . --2015-05-05 10:47:18-- http://slave33.cloud.gluster.org/logs/glusterfs-logs-20150409:09:27:03.tgz Resolving slave33.cloud.gluster.org... 104.130.217.7 Connecting to slave33.cloud.gluster.org|104.130.217.7|:80... failed: Connection refused. --2015-05-05 10:47:19-- http://./ Resolving failed: No address associated with hostname. wget: unable to resolve host address “.” [root@VM1 lab]# Regards, Avra Pranith Regards, Avra On 05/05/2015 09:01 AM, Pranith Kumar Karampuri wrote: hi Avra/Rajesh, Any update on this test? * tests/basic/volume-snapshot-clone.t * http://review.gluster.org/#/c/10053/ * Came back on April 9 * http://build.gluster.org/job/rackspace-regression-2GB-triggered/6658/ Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/basic/volume-snapshot-clone.t
On 05/05/2015 10:48 AM, Avra Sengupta wrote: On 05/05/2015 10:43 AM, Pranith Kumar Karampuri wrote: On 05/05/2015 10:32 AM, Avra Sengupta wrote: Hi, As already discussed, if you encounter this or any other snapshot tests, it would be great to provide the regression run instance so that we can have a look at the logs if there are any. Also I tried running the test in a loop as you suggested. After an hour and a half I stopped it so that I can use my machines to work on some patches. So please let us know when this or any snapshot tests fails for anyone and we will look into it asap. Please read the mail again to find the link which has the logs. ./tests/basic/volume-snapshot-clone.t (Wstat: 0 Tests: 41 Failed: 3) Failed tests: 36, 38, 40 As repeatedly told, older regression run doesn't have the logs any more. Please find the link and try and fetch the logs. Please tell me if I am missing something here. [root@VM1 lab]# wget http://slave33.cloud.gluster.org/logs/glusterfs-logs-20150409:09:27:03.tgz . --2015-05-05 10:47:18-- http://slave33.cloud.gluster.org/logs/glusterfs-logs-20150409:09:27:03.tgz Resolving slave33.cloud.gluster.org... 104.130.217.7 Connecting to slave33.cloud.gluster.org|104.130.217.7|:80... failed: Connection refused. --2015-05-05 10:47:19-- http://./ Resolving failed: No address associated with hostname. wget: unable to resolve host address “.” [root@VM1 lab]# Ah! my bad, will let you know if it happens again. Pranith Regards, Avra Pranith Regards, Avra On 05/05/2015 09:01 AM, Pranith Kumar Karampuri wrote: hi Avra/Rajesh, Any update on this test? * tests/basic/volume-snapshot-clone.t * http://review.gluster.org/#/c/10053/ * Came back on April 9 * http://build.gluster.org/job/rackspace-regression-2GB-triggered/6658/ Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/basic/afr/sparse-file-self-heal.t
On 05/02/2015 10:14 AM, Krishnan Parthasarathi wrote: If glusterd itself fails to come up, of course the test will fail :-). Is it still happening? Pranith, Did you get a chance to see glusterd logs and find why glusterd didn't come up? Please paste the relevant logs in this thread. No :-(. The etherpad doesn't have any links :-(. Justin any help here? Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/basic/afr/sparse-file-self-heal.t
On 05/02/2015 08:17 AM, Pranith Kumar Karampuri wrote: hi, As per the etherpad: https://public.pad.fsfe.org/p/gluster-spurious-failures * tests/basic/afr/sparse-file-self-heal.t (Wstat: 0 Tests: 64 Failed: 35) * Failed tests: 1-6, 11, 20-30, 33-34, 36, 41, 50-61, 64 * Happens in master (Mon 30th March - git commit id 3feaf1648528ff39e23748ac9004a77595460c9d) * (hasn't yet been added to BZs) If glusterd itself fails to come up, of course the test will fail :-). Is it still happening? We have not been actively curating this list for the last few days and am not certain if this failure happens anymore. Investigating why a regression run fails for our patches and fixing them (though unrelated to our patch) should be the most effective way going ahead. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/basic/afr/sparse-file-self-heal.t
If glusterd itself fails to come up, of course the test will fail :-). Is it still happening? Pranith, Did you get a chance to see glusterd logs and find why glusterd didn't come up? Please paste the relevant logs in this thread. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious Failures in regression runs
I'll take a look at the hangs. Regards, Nithya - Original Message - From: Justin Clift jus...@gluster.org To: Vijay Bellur vbel...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Nithya Balachandran nbala...@redhat.com Sent: Tuesday, 31 March, 2015 5:40:29 AM Subject: Re: [Gluster-devel] Spurious Failures in regression runs On 30 Mar 2015, at 18:54, Vijay Bellur vbel...@redhat.com wrote: Hi All, We are attempting to capture all known spurious regression failures from the jenkins instance in build.gluster.org at [1]. The issues listed in the etherpad impede our patch merging workflow and need to be sorted out before we branch release-3.7. If you happen to be the owner of one or more issues in the etherpad, can you please look into the failures and have them addressed soon? To help show up more regression failures, we ran 20x new VM's in Rackspace with a full regression test each of master head branch: * Two hung regression tests on tests/bugs/posix/bug-1113960.t * Still hung in case anyone wants to check them out * 162.242.167.96 * 162.242.167.132 * Both allowing remote root login, and using our jenkins slave password as their root pw * 2 x failures on ./tests/basic/afr/sparse-file-self-heal.t Failed tests: 1-6, 11, 20-30, 33-34, 36, 41, 50-61, 64 Added to etherpad * 1 x failure on ./tests/bugs/disperse/bug-1187474.t Failed tests: 11-12 Added to etherpad * 1 x failure on ./tests/basic/uss.t Failed test: 153 Already on etherpad Looks like our general failure rate is improving. :) The hangs are a bit worrying though. :( Regards and best wishes, Justin Clift -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious Failures in regression runs
On 30 Mar 2015, at 18:54, Vijay Bellur vbel...@redhat.com wrote: Hi All, We are attempting to capture all known spurious regression failures from the jenkins instance in build.gluster.org at [1]. The issues listed in the etherpad impede our patch merging workflow and need to be sorted out before we branch release-3.7. If you happen to be the owner of one or more issues in the etherpad, can you please look into the failures and have them addressed soon? To help show up more regression failures, we ran 20x new VM's in Rackspace with a full regression test each of master head branch: * Two hung regression tests on tests/bugs/posix/bug-1113960.t * Still hung in case anyone wants to check them out * 162.242.167.96 * 162.242.167.132 * Both allowing remote root login, and using our jenkins slave password as their root pw * 2 x failures on ./tests/basic/afr/sparse-file-self-heal.t Failed tests: 1-6, 11, 20-30, 33-34, 36, 41, 50-61, 64 Added to etherpad * 1 x failure on ./tests/bugs/disperse/bug-1187474.t Failed tests: 11-12 Added to etherpad * 1 x failure on ./tests/basic/uss.t Failed test: 153 Already on etherpad Looks like our general failure rate is improving. :) The hangs are a bit worrying though. :( Regards and best wishes, Justin Clift -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in split-brain-healing.t
On 03/10/2015 06:55 PM, Emmanuel Dreyfus wrote: 3) later I hit this, I do not know yet if it is a consequence or not: assertion list_empty (priv-table.lru[i]) failed: file quick-read.c, line 1052, function qr_inode_table_destroy This happens in debug builds only, it should be fixed with http://review.gluster.org/#/c/9819/ ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
On 05/21/2014 08:50 PM, Vijaikumar M wrote: KP, Atin and myself did some debugging and found that there was a deadlock in glusterd. When creating a volume snapshot, the back-end operation 'taking a lvm_snapshot and starting brick' for the each brick are executed in parallel using synctask framework. brick_start was releasing a big_lock with brick_connect and does a lock again. This caused a deadlock in some race condition where main-thread waiting for one of the synctask thread to finish and synctask-thread waiting for the big_lock. We are working on fixing this issue. If this fix is going to take more time, can we please log a bug to track this problem and remove the test cases that need to be addressed from the test unit? This way other valid patches will not be blocked by the failure of the snapshot test unit. We can introduce these tests again as part of the fix for the problem. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
- Original Message - From: Atin Mukherjee amukh...@redhat.com To: gluster-devel@gluster.org, Pranith Kumar Karampuri pkara...@redhat.com Sent: Wednesday, May 21, 2014 3:39:21 PM Subject: Re: Fwd: Re: [Gluster-devel] Spurious failures because of nfs and snapshots On 05/21/2014 11:42 AM, Atin Mukherjee wrote: On 05/21/2014 10:54 AM, SATHEESARAN wrote: Guys, This is the issue pointed out by Pranith with regard to Barrier. I was reading through it. But I wanted to bring it to concern -- S Original Message Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots Date: Tue, 20 May 2014 21:16:57 -0400 (EDT) From: Pranith Kumar Karampuri pkara...@redhat.com To:Vijaikumar M vmall...@redhat.com, Joseph Fernandes josfe...@redhat.com CC:Gluster Devel gluster-devel@gluster.org Hey, Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs at http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Pranith, Is this the correct link? I don't see any log having this sequence there. Also looking at the log from this mail, this is expected as per the barrier functionality, an enable request followed by another enable should always fail and the same happens for disable. Can you please confirm the link and which particular regression test is causing this issue, is it bug-1090042.t? --Atin Relevant logs: [2014-05-20 20:17:07.026045] : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : Failed to reconfigure barrier. [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : Failed to reconfigure barrier. This log is for bug-1092841.t and its expected. Damn :-(. I think I screwed up the timestamps while checking Sorry about that :-(. But there are failures. Check http://build.gluster.org/job/regression/4501/consoleFull Pranith --Atin Pranith - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Joseph Fernandes josfe...@redhat.com, Vijaikumar M vmall...@redhat.com Sent: Tuesday, May 20, 2014 3:41:11 PM Subject: Re: Spurious failures because of nfs and snapshots hi, Please resubmit the patches on top of http://review.gluster.com/#/c/7753 to prevent frequent regression failures. Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Joseph Fernandes josfe...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, May 19, 2014 2:40:47 PM Subject: Re: Spurious failures because of nfs and snapshots Brick disconnected with ping-time out: Here is the log message [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l /build/install/var/log/glusterfs/br icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba 7b4b0 --brick-port 49164 --xlator-option 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164) 2 [2014-05-19 04:29:38.141118] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting ping-timeout to 30secs 3 [2014-05-19 04:30:09.139521] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting. Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where ping-timer will be disabled by default for all the rpc connection except for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec). Thanks, Vijay On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote: The latest build failure also has the same issue: Download it from here: http://build.gluster.org:443/logs/glusterfs-logs
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
KP, Atin and myself did some debugging and found that there was a deadlock in glusterd. When creating a volume snapshot, the back-end operation 'taking a lvm_snapshot and starting brick' for the each brick are executed in parallel using synctask framework. brick_start was releasing a big_lock with brick_connect and does a lock again. This caused a deadlock in some race condition where main-thread waiting for one of the synctask thread to finish and synctask-thread waiting for the big_lock. We are working on fixing this issue. Thanks, Vijay On Wednesday 21 May 2014 12:23 PM, Vijaikumar M wrote: From the log: http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a17%3a10%3a51.tgzit looks like glusterd was hung: *Glusterd log:** * 5305 [2014-05-20 20:08:55.040665] E [glusterd-snapshot.c:3805:glusterd_add_brick_to_snap_volume] 0-management: Unable to fetch snap device (vol1.brick_snapdevice0). Leaving empty 5306 [2014-05-20 20:08:55.649146] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 5307 [2014-05-20 20:08:55.663181] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 5308 [2014-05-20 20:16:55.541197] W [glusterfsd.c:1182:cleanup_and_exit] (-- 0-: received signum (15), shutting down Glusterd was hung when executing the testcase ./tests/bugs/bug-1090042.t. *Cli log:** *72649 [2014-05-20 20:12:51.960765] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72650 [2014-05-20 20:12:51.960850] T [socket.c:2689:socket_connect] (--/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) [0x7ff8b6609994] (--/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) [0x7ff8b5d3305b] (- -/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already connected 72651 [2014-05-20 20:12:52.960943] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72652 [2014-05-20 20:12:52.960999] T [socket.c:2697:socket_connect] 0-glusterfs: connecting 0x1e0fcc0, state=0 gen=0 sock=-1 72653 [2014-05-20 20:12:52.961038] W [dict.c:1059:data_to_str] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) [0x7ff8ad9e95f3] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0xf1) [0x7ff8ad9ec7d0]))) 0-dict: data is NULL 72654 [2014-05-20 20:12:52.961070] W [dict.c:1059:data_to_str] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) [0x7ff8ad9e95f3] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0x100) [0x7ff8ad9ec7df]))) 0-dict: data is NULL 72655 [2014-05-20 20:12:52.961079] E [name.c:140:client_fill_address_family] 0-glusterfs: transport.address-family not specified. Could not guess default value from (remote-host:(null) or transport.unix.connect-path:(null)) optio ns 72656 [2014-05-20 20:12:54.961273] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72657 [2014-05-20 20:12:54.961404] T [socket.c:2689:socket_connect] (--/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) [0x7ff8b6609994] (--/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) [0x7ff8b5d3305b] (- -/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already connected 72658 [2014-05-20 20:12:55.120645] D [cli-cmd.c:384:cli_cmd_submit] 0-cli: Returning 110 72659 [2014-05-20 20:12:55.120723] D [cli-rpc-ops.c:8716:gf_cli_snapshot] 0-cli: Returning 110 Now we need to find why glusterd was hung. Thanks, Vijay On Wednesday 21 May 2014 06:46 AM, Pranith Kumar Karampuri wrote: Hey, Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs athttp://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Relevant logs: [2014-05-20 20:17:07.026045] : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : Failed to reconfigure barrier. [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : Failed to reconfigure barrier. Pranith - Original Message - From: Pranith Kumar Karampuripkara...@redhat.com To: Gluster Develgluster-devel@gluster.org Cc: Joseph Fernandesjosfe...@redhat.com, Vijaikumar
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
hi, Please resubmit the patches on top of http://review.gluster.com/#/c/7753 to prevent frequent regression failures. Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Joseph Fernandes josfe...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, May 19, 2014 2:40:47 PM Subject: Re: Spurious failures because of nfs and snapshots Brick disconnected with ping-time out: Here is the log message [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l /build/install/var/log/glusterfs/br icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba 7b4b0 --brick-port 49164 --xlator-option 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164) 2 [2014-05-19 04:29:38.141118] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting ping-timeout to 30secs 3 [2014-05-19 04:30:09.139521] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting. Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where ping-timer will be disabled by default for all the rpc connection except for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec). Thanks, Vijay On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote: The latest build failure also has the same issue: Download it from here: http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Joseph Fernandes josfe...@redhat.com Cc: Pranith Kumar Karampuri pkara...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, 19 May, 2014 11:41:28 AM Subject: Re: Spurious failures because of nfs and snapshots Hi Joseph, In the log mentioned below, it say ping-time is set to default value 30sec.I think issue is different. Can you please point me to the logs where you where able to re-create the problem. Thanks, Vijay On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote: hi Vijai, Joseph, In 2 of the last 3 build failures, http://build.gluster.org/job/regression/4479/console, http://build.gluster.org/job/regression/4478/console this test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better to revert this test until the fix is available? Please send a patch to revert the test case if you guys feel so. You can re-submit it along with the fix to the bug mentioned by Joseph. Pranith. - Original Message - From: Joseph Fernandes josfe...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 16 May, 2014 5:13:57 PM Subject: Re: Spurious failures because of nfs and snapshots Hi All, tests/bugs/bug-1090042.t : I was able to reproduce the issue i.e when this test is done in a loop for i in {1..135} ; do ./bugs/bug-1090042.t When checked the logs [2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs [2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs The issue is with ping-timeout and is tracked under the bug https://bugzilla.redhat.com/show_bug.cgi?id=1096729 The workaround is mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8 Regards, Joe - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Joseph Fernandes josfe...@redhat.com Sent: Friday, May 16, 2014 6:19:54 AM Subject: Spurious failures because of nfs and snapshots hi, In the latest build I fired for review.gluster.com/7766 (http://build.gluster.org/job/regression/4443/console) failed
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
Hey, Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs at http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Relevant logs: [2014-05-20 20:17:07.026045] : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : Failed to reconfigure barrier. [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : Failed to reconfigure barrier. Pranith - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Joseph Fernandes josfe...@redhat.com, Vijaikumar M vmall...@redhat.com Sent: Tuesday, May 20, 2014 3:41:11 PM Subject: Re: Spurious failures because of nfs and snapshots hi, Please resubmit the patches on top of http://review.gluster.com/#/c/7753 to prevent frequent regression failures. Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Joseph Fernandes josfe...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, May 19, 2014 2:40:47 PM Subject: Re: Spurious failures because of nfs and snapshots Brick disconnected with ping-time out: Here is the log message [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l /build/install/var/log/glusterfs/br icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba 7b4b0 --brick-port 49164 --xlator-option 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164) 2 [2014-05-19 04:29:38.141118] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting ping-timeout to 30secs 3 [2014-05-19 04:30:09.139521] C [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting. Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where ping-timer will be disabled by default for all the rpc connection except for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec). Thanks, Vijay On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote: The latest build failure also has the same issue: Download it from here: http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz Pranith - Original Message - From: Vijaikumar M vmall...@redhat.com To: Joseph Fernandes josfe...@redhat.com Cc: Pranith Kumar Karampuri pkara...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Monday, 19 May, 2014 11:41:28 AM Subject: Re: Spurious failures because of nfs and snapshots Hi Joseph, In the log mentioned below, it say ping-time is set to default value 30sec.I think issue is different. Can you please point me to the logs where you where able to re-create the problem. Thanks, Vijay On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote: hi Vijai, Joseph, In 2 of the last 3 build failures, http://build.gluster.org/job/regression/4479/console, http://build.gluster.org/job/regression/4478/console this test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better to revert this test until the fix is available? Please send a patch to revert the test case if you guys feel so. You can re-submit it along with the fix to the bug mentioned by Joseph. Pranith. - Original Message - From: Joseph Fernandes josfe...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 16 May, 2014 5:13:57 PM Subject: Re: Spurious failures because of nfs and snapshots Hi All, tests/bugs/bug-1090042.t : I was able to reproduce the issue i.e when this test is done in a loop for i in
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
Hi Joseph, In the log mentioned below, it say ping-time is set to default value 30sec.I think issue is different. Can you please point me to the logs where you where able to re-create the problem. Thanks, Vijay On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote: hi Vijai, Joseph, In 2 of the last 3 build failures, http://build.gluster.org/job/regression/4479/console, http://build.gluster.org/job/regression/4478/console this test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better to revert this test until the fix is available? Please send a patch to revert the test case if you guys feel so. You can re-submit it along with the fix to the bug mentioned by Joseph. Pranith. - Original Message - From: Joseph Fernandes josfe...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 16 May, 2014 5:13:57 PM Subject: Re: Spurious failures because of nfs and snapshots Hi All, tests/bugs/bug-1090042.t : I was able to reproduce the issue i.e when this test is done in a loop for i in {1..135} ; do ./bugs/bug-1090042.t When checked the logs [2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs [2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs The issue is with ping-timeout and is tracked under the bug https://bugzilla.redhat.com/show_bug.cgi?id=1096729 The workaround is mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8 Regards, Joe - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Joseph Fernandes josfe...@redhat.com Sent: Friday, May 16, 2014 6:19:54 AM Subject: Spurious failures because of nfs and snapshots hi, In the latest build I fired for review.gluster.com/7766 (http://build.gluster.org/job/regression/4443/console) failed because of spurious failure. The script doesn't wait for nfs export to be available. I fixed that, but interestingly I found quite a few scripts with same problem. Some of the scripts are relying on 'sleep 5' which also could lead to spurious failures if the export is not available in 5 seconds. We found that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily delay the build execution. So if you guys are going to write any scripts which has to do nfs mounts, please do it the following way: EXPECT_WITHIN 20 1 is_nfs_export_available; TEST mount -t nfs -o vers=3 $H0:/$V0 $N0; Please review http://review.gluster.com/7773 :-) I saw one more spurious failure in a snapshot related script tests/bugs/bug-1090042.t on the next build fired by Niels. Joesph (CCed) is debugging it. He agreed to reply what he finds and share it with us so that we won't introduce similar bugs in future. I encourage you guys to share what you fix to prevent spurious failures in future. Thanks Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
- Original Message - From: Justin Clift jus...@gluster.org To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Monday, 19 May, 2014 10:26:04 AM Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots On 16/05/2014, at 1:49 AM, Pranith Kumar Karampuri wrote: hi, In the latest build I fired for review.gluster.com/7766 (http://build.gluster.org/job/regression/4443/console) failed because of spurious failure. The script doesn't wait for nfs export to be available. I fixed that, but interestingly I found quite a few scripts with same problem. Some of the scripts are relying on 'sleep 5' which also could lead to spurious failures if the export is not available in 5 seconds. Cool. Fixing this NFS problem across all of the tests would be really welcome. That specific failed test (bug-1087198.t) is the most common one I've seen over the last few weeks, causing about half of all failures in master. Eliminating this class of regression failure would be really helpful. :) This particular class is eliminated :-). Patch was merged on Friday. Pranith + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
On Thu, May 15, 2014 at 5:49 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, In the latest build I fired for review.gluster.com/7766 ( http://build.gluster.org/job/regression/4443/console) failed because of spurious failure. The script doesn't wait for nfs export to be available. I fixed that, but interestingly I found quite a few scripts with same problem. Some of the scripts are relying on 'sleep 5' which also could lead to spurious failures if the export is not available in 5 seconds. We found that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily delay the build execution. So if you guys are going to write any scripts which has to do nfs mounts, please do it the following way: EXPECT_WITHIN 20 1 is_nfs_export_available; TEST mount -t nfs -o vers=3 $H0:/$V0 $N0; Always please also add mount -o soft,intr in the regression scripts for mounting nfs. Becomes so much easier to cleanup any hung mess. We probably need an NFS mounting helper function which can be called like: TEST mount_nfs $H0:/$V0 $N0; Thanks Avati ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel