Re: [Gluster-devel] bug-857330/normal.t failure

2014-05-22 Thread Pranith Kumar Karampuri
Kaushal,
   Rebalance status command seems to be failing sometimes. I sent a mail about 
such spurious failure earlier today. Did you get a chance to look at the logs 
and confirm that rebalance didn't fail and it is indeed a timeout?

Pranith
- Original Message -
 From: Kaushal M kshlms...@gmail.com
 To: Pranith Kumar Karampuri pkara...@redhat.com
 Cc: Justin Clift jus...@gluster.org, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Thursday, May 22, 2014 4:40:25 PM
 Subject: Re: [Gluster-devel] bug-857330/normal.t failure
 
 The test is waiting for rebalance to finish. This is a rebalance with some
 actual data so it could have taken a long time to finish. I did set a
 pretty high timeout, but it seems like it's not enough for the new VMs.
 
 Possible options are,
 - Increase this timeout further
 - Reduce the amount of data. Currently this is 100 directories with 10
 files each of size between 10-500KB
 
 ~kaushal
 
 
 On Thu, May 22, 2014 at 3:59 PM, Pranith Kumar Karampuri 
 pkara...@redhat.com wrote:
 
  Kaushal has more context about these CCed. Keep the setup until he
  responds so that he can take a look.
 
  Pranith
  - Original Message -
   From: Justin Clift jus...@gluster.org
   To: Pranith Kumar Karampuri pkara...@redhat.com
   Cc: Gluster Devel gluster-devel@gluster.org
   Sent: Thursday, May 22, 2014 3:54:46 PM
   Subject: bug-857330/normal.t failure
  
   Hi Pranith,
  
   Ran a few VM's with your Gerrit CR 7835 applied, and in DEBUG
   mode (I think).
  
   One of the VM's had a failure in bug-857330/normal.t:
  
 Test Summary Report
 ---
 ./tests/basic/rpm.t (Wstat: 0 Tests: 0
  Failed:
 0)
   Parse errors: Bad plan.  You planned 8 tests but ran 0.
 ./tests/bugs/bug-857330/normal.t(Wstat: 0 Tests: 24
  Failed:
 1)
   Failed test:  13
 Files=230, Tests=4369, 5407 wallclock secs ( 2.13 usr  1.73 sys +
  941.82
 cusr 645.54 csys = 1591.22 CPU)
 Result: FAIL
  
   Seems to be this test:
  
 COMMAND=volume rebalance $V0 status
 PATTERN=completed
 EXPECT_WITHIN 300 $PATTERN get-task-status
  
   Is this one on your radar already?
  
   Btw, this VM is still online.  Can give you access to retrieve logs
   if useful.
  
   + Justin
  
   --
   Open Source and Standards @ Red Hat
  
   twitter.com/realjustinclift
  
  
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] bug-857330/normal.t failure

2014-05-22 Thread Kaushal M
Thanks Justin, I found the problem. The VM can be deleted now.

Turns out, there was more than enough time for the rebalance to complete.
But we hit a race, which caused a command to fail.

The particular test that failed is waiting for rebalance to finish. It does
this by doing a 'gluster volume rebalance  status' command and checking
the result. The EXPECT_WITHIN function runs this command till we have a
match, the command fails or the timeout happens.

For a rebalance status command, glusterd sends a request to the rebalance
process (as a brick_op) to get the latest stats. It had done the same in
this case as well. But while glusterd was waiting for the reply, the
rebalance completed and the process stopped itself. This caused the rpc
connection between glusterd and rebalance proc to close. This caused the
all pending requests to be unwound as failures. Which in turnlead to the
command failing.

I cannot think of a way to avoid this race from within glusterd. For this
particular test, we could avoid using the 'rebalance status' command if we
directly checked the rebalance process state using its pid etc. I don't
particularly approve of this approach, as I think I used the 'rebalance
status' command for a reason. But I currently cannot recall the reason, and
if cannot come with it soon, I wouldn't mind changing the test to avoid
rebalance status.

~kaushal



On Thu, May 22, 2014 at 5:22 PM, Justin Clift jus...@gluster.org wrote:

 On 22/05/2014, at 12:32 PM, Kaushal M wrote:
  I haven't yet. But I will.
 
  Justin,
  Can I get take a peek inside the vm?

 Sure.

   IP: 23.253.57.20
   User: root
   Password: foobar123

 The stdout log from the regression test is in /tmp/regression.log.

 The GlusterFS git repo is in /root/glusterfs.  Um, you should be
 able to find everything else pretty easily.

 Btw, this is just a temp VM, so feel free to do anything you want
 with it.  When you're finished with it let me know so I can delete
 it. :)

 + Justin


  ~kaushal
 
 
  On Thu, May 22, 2014 at 4:53 PM, Pranith Kumar Karampuri 
 pkara...@redhat.com wrote:
  Kaushal,
 Rebalance status command seems to be failing sometimes. I sent a mail
 about such spurious failure earlier today. Did you get a chance to look at
 the logs and confirm that rebalance didn't fail and it is indeed a timeout?
 
  Pranith
  - Original Message -
   From: Kaushal M kshlms...@gmail.com
   To: Pranith Kumar Karampuri pkara...@redhat.com
   Cc: Justin Clift jus...@gluster.org, Gluster Devel 
 gluster-devel@gluster.org
   Sent: Thursday, May 22, 2014 4:40:25 PM
   Subject: Re: [Gluster-devel] bug-857330/normal.t failure
  
   The test is waiting for rebalance to finish. This is a rebalance with
 some
   actual data so it could have taken a long time to finish. I did set a
   pretty high timeout, but it seems like it's not enough for the new VMs.
  
   Possible options are,
   - Increase this timeout further
   - Reduce the amount of data. Currently this is 100 directories with 10
   files each of size between 10-500KB
  
   ~kaushal
  
  
   On Thu, May 22, 2014 at 3:59 PM, Pranith Kumar Karampuri 
   pkara...@redhat.com wrote:
  
Kaushal has more context about these CCed. Keep the setup until he
responds so that he can take a look.
   
Pranith
- Original Message -
 From: Justin Clift jus...@gluster.org
 To: Pranith Kumar Karampuri pkara...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Thursday, May 22, 2014 3:54:46 PM
 Subject: bug-857330/normal.t failure

 Hi Pranith,

 Ran a few VM's with your Gerrit CR 7835 applied, and in DEBUG
 mode (I think).

 One of the VM's had a failure in bug-857330/normal.t:

   Test Summary Report
   ---
   ./tests/basic/rpm.t (Wstat: 0 Tests:
 0
Failed:
   0)
 Parse errors: Bad plan.  You planned 8 tests but ran 0.
   ./tests/bugs/bug-857330/normal.t(Wstat: 0 Tests:
 24
Failed:
   1)
 Failed test:  13
   Files=230, Tests=4369, 5407 wallclock secs ( 2.13 usr  1.73 sys +
941.82
   cusr 645.54 csys = 1591.22 CPU)
   Result: FAIL

 Seems to be this test:

   COMMAND=volume rebalance $V0 status
   PATTERN=completed
   EXPECT_WITHIN 300 $PATTERN get-task-status

 Is this one on your radar already?

 Btw, this VM is still online.  Can give you access to retrieve logs
 if useful.

 + Justin

 --
 Open Source and Standards @ Red Hat

 twitter.com/realjustinclift


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel
   
  
 

 --
 Open Source and Standards @ Red Hat

 twitter.com/realjustinclift


___
Gluster-devel mailing list
Gluster-devel

Re: [Gluster-devel] bug-857330/normal.t failure

2014-05-22 Thread Pranith Kumar Karampuri


- Original Message -
 From: Kaushal M kshlms...@gmail.com
 To: Justin Clift jus...@gluster.org, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Thursday, May 22, 2014 6:04:29 PM
 Subject: Re: [Gluster-devel] bug-857330/normal.t failure
 
 Thanks Justin, I found the problem. The VM can be deleted now.
 
 Turns out, there was more than enough time for the rebalance to complete. But
 we hit a race, which caused a command to fail.
 
 The particular test that failed is waiting for rebalance to finish. It does
 this by doing a 'gluster volume rebalance  status' command and checking
 the result. The EXPECT_WITHIN function runs this command till we have a
 match, the command fails or the timeout happens.
 
 For a rebalance status command, glusterd sends a request to the rebalance
 process (as a brick_op) to get the latest stats. It had done the same in
 this case as well. But while glusterd was waiting for the reply, the
 rebalance completed and the process stopped itself. This caused the rpc
 connection between glusterd and rebalance proc to close. This caused the all
 pending requests to be unwound as failures. Which in turnlead to the command
 failing.

Do you think we can print the status of the process as 'not-responding' when 
such a thing happens, instead of failing the command?

Pranith

 
 I cannot think of a way to avoid this race from within glusterd. For this
 particular test, we could avoid using the 'rebalance status' command if we
 directly checked the rebalance process state using its pid etc. I don't
 particularly approve of this approach, as I think I used the 'rebalance
 status' command for a reason. But I currently cannot recall the reason, and
 if cannot come with it soon, I wouldn't mind changing the test to avoid
 rebalance status.
 
 ~kaushal
 
 
 
 On Thu, May 22, 2014 at 5:22 PM, Justin Clift  jus...@gluster.org  wrote:
 
 
 
 On 22/05/2014, at 12:32 PM, Kaushal M wrote:
  I haven't yet. But I will.
  
  Justin,
  Can I get take a peek inside the vm?
 
 Sure.
 
 IP: 23.253.57.20
 User: root
 Password: foobar123
 
 The stdout log from the regression test is in /tmp/regression.log.
 
 The GlusterFS git repo is in /root/glusterfs. Um, you should be
 able to find everything else pretty easily.
 
 Btw, this is just a temp VM, so feel free to do anything you want
 with it. When you're finished with it let me know so I can delete
 it. :)
 
 + Justin
 
 
  ~kaushal
  
  
  On Thu, May 22, 2014 at 4:53 PM, Pranith Kumar Karampuri 
  pkara...@redhat.com  wrote:
  Kaushal,
  Rebalance status command seems to be failing sometimes. I sent a mail about
  such spurious failure earlier today. Did you get a chance to look at the
  logs and confirm that rebalance didn't fail and it is indeed a timeout?
  
  Pranith
  - Original Message -
   From: Kaushal M  kshlms...@gmail.com 
   To: Pranith Kumar Karampuri  pkara...@redhat.com 
   Cc: Justin Clift  jus...@gluster.org , Gluster Devel 
   gluster-devel@gluster.org 
   Sent: Thursday, May 22, 2014 4:40:25 PM
   Subject: Re: [Gluster-devel] bug-857330/normal.t failure
   
   The test is waiting for rebalance to finish. This is a rebalance with
   some
   actual data so it could have taken a long time to finish. I did set a
   pretty high timeout, but it seems like it's not enough for the new VMs.
   
   Possible options are,
   - Increase this timeout further
   - Reduce the amount of data. Currently this is 100 directories with 10
   files each of size between 10-500KB
   
   ~kaushal
   
   
   On Thu, May 22, 2014 at 3:59 PM, Pranith Kumar Karampuri 
   pkara...@redhat.com  wrote:
   
Kaushal has more context about these CCed. Keep the setup until he
responds so that he can take a look.

Pranith
- Original Message -
 From: Justin Clift  jus...@gluster.org 
 To: Pranith Kumar Karampuri  pkara...@redhat.com 
 Cc: Gluster Devel  gluster-devel@gluster.org 
 Sent: Thursday, May 22, 2014 3:54:46 PM
 Subject: bug-857330/normal.t failure
 
 Hi Pranith,
 
 Ran a few VM's with your Gerrit CR 7835 applied, and in DEBUG
 mode (I think).
 
 One of the VM's had a failure in bug-857330/normal.t:
 
 Test Summary Report
 ---
 ./tests/basic/rpm.t (Wstat: 0 Tests: 0
Failed:
 0)
 Parse errors: Bad plan. You planned 8 tests but ran 0.
 ./tests/bugs/bug-857330/normal.t (Wstat: 0 Tests: 24
Failed:
 1)
 Failed test: 13
 Files=230, Tests=4369, 5407 wallclock secs ( 2.13 usr 1.73 sys +
941.82
 cusr 645.54 csys = 1591.22 CPU)
 Result: FAIL
 
 Seems to be this test:
 
 COMMAND=volume rebalance $V0 status
 PATTERN=completed
 EXPECT_WITHIN 300 $PATTERN get-task-status
 
 Is this one on your radar already?
 
 Btw, this VM is still online. Can give you access to retrieve logs
 if useful.
 
 + Justin
 
 --
 Open Source and Standards @ Red Hat

Re: [Gluster-devel] bug-857330/normal.t failure

2014-05-22 Thread Krishnan Parthasarathi

- Original Message -
 On 22/05/2014, at 1:34 PM, Kaushal M wrote:
  Thanks Justin, I found the problem. The VM can be deleted now.
 
 Done. :)
 
 
  Turns out, there was more than enough time for the rebalance to complete.
  But we hit a race, which caused a command to fail.
  
  The particular test that failed is waiting for rebalance to finish. It does
  this by doing a 'gluster volume rebalance  status' command and checking
  the result. The EXPECT_WITHIN function runs this command till we have a
  match, the command fails or the timeout happens.
  
  For a rebalance status command, glusterd sends a request to the rebalance
  process (as a brick_op) to get the latest stats. It had done the same in
  this case as well. But while glusterd was waiting for the reply, the
  rebalance completed and the process stopped itself. This caused the rpc
  connection between glusterd and rebalance proc to close. This caused the
  all pending requests to be unwound as failures. Which in turnlead to the
  command failing.
  
  I cannot think of a way to avoid this race from within glusterd. For this
  particular test, we could avoid using the 'rebalance status' command if we
  directly checked the rebalance process state using its pid etc. I don't
  particularly approve of this approach, as I think I used the 'rebalance
  status' command for a reason. But I currently cannot recall the reason,
  and if cannot come with it soon, I wouldn't mind changing the test to
  avoid rebalance status.
 

I think its the rebalance daemon's life cycle which is problematic. It makes it
inconvenient, if not impossible, for glusterd to gather progress/status 
deterministically.
The rebalance process could wait for the rebalance-commit subcommand to 
terminate.
There is no other daemon, managed by glusterd, has this kind of life cycle.
I don't see any good reason why rebalance should kill itself on completion
of data migration.

Thoughts?

~Krish

 Hmmm, is it the kind of thing where the rebalance status command
 should retry, if it's connection gets closed by a just-completed-
 rebalance (as happened here)?
 
 Or would that not work as well?
 
 + Justin
 
 --
 Open Source and Standards @ Red Hat
 
 twitter.com/realjustinclift
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel