Re: [Gluster-devel] bug-857330/normal.t failure
Kaushal, Rebalance status command seems to be failing sometimes. I sent a mail about such spurious failure earlier today. Did you get a chance to look at the logs and confirm that rebalance didn't fail and it is indeed a timeout? Pranith - Original Message - From: Kaushal M kshlms...@gmail.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Justin Clift jus...@gluster.org, Gluster Devel gluster-devel@gluster.org Sent: Thursday, May 22, 2014 4:40:25 PM Subject: Re: [Gluster-devel] bug-857330/normal.t failure The test is waiting for rebalance to finish. This is a rebalance with some actual data so it could have taken a long time to finish. I did set a pretty high timeout, but it seems like it's not enough for the new VMs. Possible options are, - Increase this timeout further - Reduce the amount of data. Currently this is 100 directories with 10 files each of size between 10-500KB ~kaushal On Thu, May 22, 2014 at 3:59 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: Kaushal has more context about these CCed. Keep the setup until he responds so that he can take a look. Pranith - Original Message - From: Justin Clift jus...@gluster.org To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, May 22, 2014 3:54:46 PM Subject: bug-857330/normal.t failure Hi Pranith, Ran a few VM's with your Gerrit CR 7835 applied, and in DEBUG mode (I think). One of the VM's had a failure in bug-857330/normal.t: Test Summary Report --- ./tests/basic/rpm.t (Wstat: 0 Tests: 0 Failed: 0) Parse errors: Bad plan. You planned 8 tests but ran 0. ./tests/bugs/bug-857330/normal.t(Wstat: 0 Tests: 24 Failed: 1) Failed test: 13 Files=230, Tests=4369, 5407 wallclock secs ( 2.13 usr 1.73 sys + 941.82 cusr 645.54 csys = 1591.22 CPU) Result: FAIL Seems to be this test: COMMAND=volume rebalance $V0 status PATTERN=completed EXPECT_WITHIN 300 $PATTERN get-task-status Is this one on your radar already? Btw, this VM is still online. Can give you access to retrieve logs if useful. + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] bug-857330/normal.t failure
Thanks Justin, I found the problem. The VM can be deleted now. Turns out, there was more than enough time for the rebalance to complete. But we hit a race, which caused a command to fail. The particular test that failed is waiting for rebalance to finish. It does this by doing a 'gluster volume rebalance status' command and checking the result. The EXPECT_WITHIN function runs this command till we have a match, the command fails or the timeout happens. For a rebalance status command, glusterd sends a request to the rebalance process (as a brick_op) to get the latest stats. It had done the same in this case as well. But while glusterd was waiting for the reply, the rebalance completed and the process stopped itself. This caused the rpc connection between glusterd and rebalance proc to close. This caused the all pending requests to be unwound as failures. Which in turnlead to the command failing. I cannot think of a way to avoid this race from within glusterd. For this particular test, we could avoid using the 'rebalance status' command if we directly checked the rebalance process state using its pid etc. I don't particularly approve of this approach, as I think I used the 'rebalance status' command for a reason. But I currently cannot recall the reason, and if cannot come with it soon, I wouldn't mind changing the test to avoid rebalance status. ~kaushal On Thu, May 22, 2014 at 5:22 PM, Justin Clift jus...@gluster.org wrote: On 22/05/2014, at 12:32 PM, Kaushal M wrote: I haven't yet. But I will. Justin, Can I get take a peek inside the vm? Sure. IP: 23.253.57.20 User: root Password: foobar123 The stdout log from the regression test is in /tmp/regression.log. The GlusterFS git repo is in /root/glusterfs. Um, you should be able to find everything else pretty easily. Btw, this is just a temp VM, so feel free to do anything you want with it. When you're finished with it let me know so I can delete it. :) + Justin ~kaushal On Thu, May 22, 2014 at 4:53 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: Kaushal, Rebalance status command seems to be failing sometimes. I sent a mail about such spurious failure earlier today. Did you get a chance to look at the logs and confirm that rebalance didn't fail and it is indeed a timeout? Pranith - Original Message - From: Kaushal M kshlms...@gmail.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Justin Clift jus...@gluster.org, Gluster Devel gluster-devel@gluster.org Sent: Thursday, May 22, 2014 4:40:25 PM Subject: Re: [Gluster-devel] bug-857330/normal.t failure The test is waiting for rebalance to finish. This is a rebalance with some actual data so it could have taken a long time to finish. I did set a pretty high timeout, but it seems like it's not enough for the new VMs. Possible options are, - Increase this timeout further - Reduce the amount of data. Currently this is 100 directories with 10 files each of size between 10-500KB ~kaushal On Thu, May 22, 2014 at 3:59 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: Kaushal has more context about these CCed. Keep the setup until he responds so that he can take a look. Pranith - Original Message - From: Justin Clift jus...@gluster.org To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, May 22, 2014 3:54:46 PM Subject: bug-857330/normal.t failure Hi Pranith, Ran a few VM's with your Gerrit CR 7835 applied, and in DEBUG mode (I think). One of the VM's had a failure in bug-857330/normal.t: Test Summary Report --- ./tests/basic/rpm.t (Wstat: 0 Tests: 0 Failed: 0) Parse errors: Bad plan. You planned 8 tests but ran 0. ./tests/bugs/bug-857330/normal.t(Wstat: 0 Tests: 24 Failed: 1) Failed test: 13 Files=230, Tests=4369, 5407 wallclock secs ( 2.13 usr 1.73 sys + 941.82 cusr 645.54 csys = 1591.22 CPU) Result: FAIL Seems to be this test: COMMAND=volume rebalance $V0 status PATTERN=completed EXPECT_WITHIN 300 $PATTERN get-task-status Is this one on your radar already? Btw, this VM is still online. Can give you access to retrieve logs if useful. + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel
Re: [Gluster-devel] bug-857330/normal.t failure
- Original Message - From: Kaushal M kshlms...@gmail.com To: Justin Clift jus...@gluster.org, Gluster Devel gluster-devel@gluster.org Sent: Thursday, May 22, 2014 6:04:29 PM Subject: Re: [Gluster-devel] bug-857330/normal.t failure Thanks Justin, I found the problem. The VM can be deleted now. Turns out, there was more than enough time for the rebalance to complete. But we hit a race, which caused a command to fail. The particular test that failed is waiting for rebalance to finish. It does this by doing a 'gluster volume rebalance status' command and checking the result. The EXPECT_WITHIN function runs this command till we have a match, the command fails or the timeout happens. For a rebalance status command, glusterd sends a request to the rebalance process (as a brick_op) to get the latest stats. It had done the same in this case as well. But while glusterd was waiting for the reply, the rebalance completed and the process stopped itself. This caused the rpc connection between glusterd and rebalance proc to close. This caused the all pending requests to be unwound as failures. Which in turnlead to the command failing. Do you think we can print the status of the process as 'not-responding' when such a thing happens, instead of failing the command? Pranith I cannot think of a way to avoid this race from within glusterd. For this particular test, we could avoid using the 'rebalance status' command if we directly checked the rebalance process state using its pid etc. I don't particularly approve of this approach, as I think I used the 'rebalance status' command for a reason. But I currently cannot recall the reason, and if cannot come with it soon, I wouldn't mind changing the test to avoid rebalance status. ~kaushal On Thu, May 22, 2014 at 5:22 PM, Justin Clift jus...@gluster.org wrote: On 22/05/2014, at 12:32 PM, Kaushal M wrote: I haven't yet. But I will. Justin, Can I get take a peek inside the vm? Sure. IP: 23.253.57.20 User: root Password: foobar123 The stdout log from the regression test is in /tmp/regression.log. The GlusterFS git repo is in /root/glusterfs. Um, you should be able to find everything else pretty easily. Btw, this is just a temp VM, so feel free to do anything you want with it. When you're finished with it let me know so I can delete it. :) + Justin ~kaushal On Thu, May 22, 2014 at 4:53 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: Kaushal, Rebalance status command seems to be failing sometimes. I sent a mail about such spurious failure earlier today. Did you get a chance to look at the logs and confirm that rebalance didn't fail and it is indeed a timeout? Pranith - Original Message - From: Kaushal M kshlms...@gmail.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Justin Clift jus...@gluster.org , Gluster Devel gluster-devel@gluster.org Sent: Thursday, May 22, 2014 4:40:25 PM Subject: Re: [Gluster-devel] bug-857330/normal.t failure The test is waiting for rebalance to finish. This is a rebalance with some actual data so it could have taken a long time to finish. I did set a pretty high timeout, but it seems like it's not enough for the new VMs. Possible options are, - Increase this timeout further - Reduce the amount of data. Currently this is 100 directories with 10 files each of size between 10-500KB ~kaushal On Thu, May 22, 2014 at 3:59 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: Kaushal has more context about these CCed. Keep the setup until he responds so that he can take a look. Pranith - Original Message - From: Justin Clift jus...@gluster.org To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Thursday, May 22, 2014 3:54:46 PM Subject: bug-857330/normal.t failure Hi Pranith, Ran a few VM's with your Gerrit CR 7835 applied, and in DEBUG mode (I think). One of the VM's had a failure in bug-857330/normal.t: Test Summary Report --- ./tests/basic/rpm.t (Wstat: 0 Tests: 0 Failed: 0) Parse errors: Bad plan. You planned 8 tests but ran 0. ./tests/bugs/bug-857330/normal.t (Wstat: 0 Tests: 24 Failed: 1) Failed test: 13 Files=230, Tests=4369, 5407 wallclock secs ( 2.13 usr 1.73 sys + 941.82 cusr 645.54 csys = 1591.22 CPU) Result: FAIL Seems to be this test: COMMAND=volume rebalance $V0 status PATTERN=completed EXPECT_WITHIN 300 $PATTERN get-task-status Is this one on your radar already? Btw, this VM is still online. Can give you access to retrieve logs if useful. + Justin -- Open Source and Standards @ Red Hat
Re: [Gluster-devel] bug-857330/normal.t failure
- Original Message - On 22/05/2014, at 1:34 PM, Kaushal M wrote: Thanks Justin, I found the problem. The VM can be deleted now. Done. :) Turns out, there was more than enough time for the rebalance to complete. But we hit a race, which caused a command to fail. The particular test that failed is waiting for rebalance to finish. It does this by doing a 'gluster volume rebalance status' command and checking the result. The EXPECT_WITHIN function runs this command till we have a match, the command fails or the timeout happens. For a rebalance status command, glusterd sends a request to the rebalance process (as a brick_op) to get the latest stats. It had done the same in this case as well. But while glusterd was waiting for the reply, the rebalance completed and the process stopped itself. This caused the rpc connection between glusterd and rebalance proc to close. This caused the all pending requests to be unwound as failures. Which in turnlead to the command failing. I cannot think of a way to avoid this race from within glusterd. For this particular test, we could avoid using the 'rebalance status' command if we directly checked the rebalance process state using its pid etc. I don't particularly approve of this approach, as I think I used the 'rebalance status' command for a reason. But I currently cannot recall the reason, and if cannot come with it soon, I wouldn't mind changing the test to avoid rebalance status. I think its the rebalance daemon's life cycle which is problematic. It makes it inconvenient, if not impossible, for glusterd to gather progress/status deterministically. The rebalance process could wait for the rebalance-commit subcommand to terminate. There is no other daemon, managed by glusterd, has this kind of life cycle. I don't see any good reason why rebalance should kill itself on completion of data migration. Thoughts? ~Krish Hmmm, is it the kind of thing where the rebalance status command should retry, if it's connection gets closed by a just-completed- rebalance (as happened here)? Or would that not work as well? + Justin -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel