Re: [Gluster-devel] GlusterFS-3.4.4beta is slipping

2014-05-15 Thread Ravishankar N

On 05/15/2014 08:37 AM, Pranith Kumar Karampuri wrote:


- Original Message -

From: "Kaleb S. KEITHLEY" 
To: "Gluster Devel" 
Sent: Wednesday, May 14, 2014 5:04:02 PM
Subject: [Gluster-devel] GlusterFS-3.4.4beta is slipping


At last week's community meeting we tentatively agreed that today — May
14th — we would ship 3.4.4beta.

Three changes for 3.4.4 need to be reviewed before they can be merged:

1 Ubuntu code audit results (blocking inclusion in Ubuntu Main repo):
https://bugzilla.redhat.com/show_bug.cgi?id=1086460
http://review.gluster.org/#/c/7583/
(also http://review.gluster.org/#/c/7605/ for 3.5.1)

Done with the review. Please address the comments and re-submit.


2 Addition of new server after upgrade from 3.3 results in peer rejected:
https://bugzilla.redhat.com/show_bug.cgi?id=1090298
http://review.gluster.org/#/c/7729/

Kp may have a better idea, CCed him.


Abandoned this patch. The reason and the workaround are detailed in the 
review comments [1]
TL;DR fix: After upgrading all nodes to 3.4 and before doing any new 
peer probes, do a dummy volume set operation, viz: `gluster volume set 
 brick-log-level INFO` to update the checksum.


The documentation for upgrading to 3.4 [2] seems to be pointing to 
Vijay's blog. The above fix could be included in the steps mentioned 
there. (CC'ing Vijay)


Regards,
Ravi

[1] 
http://review.gluster.org/#/c/7729/1/xlators/mgmt/glusterd/src/glusterd-store.c
[2] 
http://www.gluster.org/community/documentation/index.php/Main_Page#GlusterFS_3.4



3 Disabling NFS causes E level errors in nfs.log:
https://bugzilla.redhat.com/show_bug.cgi?id=1095330
http://review.gluster.org/#/c/7699/

Reviewed it and merged.
I do not maintain nfs component but the patch is same as the one in 3.5 except 
for the tabs to spaces conversion. I tested it on my machine before merging.


And Joe Julian added https://bugzilla.redhat.com/show_bug.cgi?id=1095596
as a blocker for 3.4.4 — Stick to IANA standard while allocating brick
ports. This needs a port or rebase of http://review.gluster.com/#/c/3339/

--

Kaleb





___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Borking Gluster

2014-05-15 Thread James
On Fri, May 16, 2014 at 12:06 AM, Krishnan Parthasarathi
 wrote:
> Which version of gluster are you using?

3.5.0 on CentOS 6.5 x86_64.
Sorry I forgot to mention it.
Cheers,
James

>
> thanks,
> Krish
>
> - Original Message -
>> Due to some weird automation things, I noticed the following:
>>
>> Given a cluster of hosts A,B,C,D
>>
>> It turns out that if you restart glusterd on host B while you are
>> running volume create on host A, this can cause host B to be borked.
>> This means: glusterd will refuse to start, and the only fix I found
>> was to delete the volume data from it, and re-create the volume. Not
>> sure if this is useful or not, but reproducing this is pretty easy in
>> case this uncovers a code path that isn't working properly.
>>
>> HTH,
>> James
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Borking Gluster

2014-05-15 Thread Krishnan Parthasarathi
Which version of gluster are you using?

thanks,
Krish

- Original Message -
> Due to some weird automation things, I noticed the following:
> 
> Given a cluster of hosts A,B,C,D
> 
> It turns out that if you restart glusterd on host B while you are
> running volume create on host A, this can cause host B to be borked.
> This means: glusterd will refuse to start, and the only fix I found
> was to delete the volume data from it, and re-create the volume. Not
> sure if this is useful or not, but reproducing this is pretty easy in
> case this uncovers a code path that isn't working properly.
> 
> HTH,
> James
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Volume failed to create (but did)

2014-05-15 Thread James
Hi,

When automatically building volumes, a volume create failed:

volume create: puppet: failed: Commit failed on
----. Please check log file for
details.

The funny thing was that 'gluster volume info' showed a normal looking
volume, and starting it worked fine.

Attached are all the logs. Hopefully someone can decipher this, and
maybe kill a gluster bug.

HTH,
James

PS:
Cluster was a two host, Replica=2, single volume, with two disks each
host, all running in VM's.


glusterfs.annex1.tar.gz
Description: GNU Zip compressed data


glusterfs.annex2.tar.gz
Description: GNU Zip compressed data
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Borking Gluster

2014-05-15 Thread James
Due to some weird automation things, I noticed the following:

Given a cluster of hosts A,B,C,D

It turns out that if you restart glusterd on host B while you are
running volume create on host A, this can cause host B to be borked.
This means: glusterd will refuse to start, and the only fix I found
was to delete the volume data from it, and re-create the volume. Not
sure if this is useful or not, but reproducing this is pretty easy in
case this uncovers a code path that isn't working properly.

HTH,
James
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Gluster SELinux attributes

2014-05-15 Thread James
Hey Gluster. I realized that things might have changed at some point,
so I was hoping someone could help me get this straight.

Who can enumerate or provide reference to a list of all the files that
should have specific SELinux attributes in Gluster, and what each of
them is.

Currently, /var/lib/glusterd/glusterd.info has the seluser set as
'system_u', but something keeps changing it to 'unconfined_u' so
perhaps that is now what is correct. (It didn't used to be.)

What other files need attrs and what should they be?

Thanks,

James
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Need inputs for command deprecation output

2014-05-15 Thread Ravishankar N

On 05/16/2014 07:23 AM, Pranith Kumar Karampuri wrote:


- Original Message -

From: "Ravishankar N" 
To: "Pranith Kumar Karampuri" , "Gluster Devel" 

Sent: Friday, May 16, 2014 7:15:58 AM
Subject: Re: [Gluster-devel] Need inputs for command deprecation output

On 05/16/2014 06:25 AM, Pranith Kumar Karampuri wrote:

Hi,
 As part of changing behaviour of 'volume heal' commands. I want the
 commands to show the following output. Any feedback in making them
 better would be awesome :-).

root@pranithk-laptop - ~
06:20:10 :) ⚡ gluster volume heal r2 info healed
This command has been deprecated

root@pranithk-laptop - ~
06:20:13 :( ⚡ gluster volume heal r2 info heal-failed
This command has been deprecated

When a command is deprecated, it still works the way it did but gives
out a warning about it not being maintained and possible alternatives to it.
If I understand http://review.gluster.org/#/c/7766/ correctly, we are
not supporting these commands any more, in which case the right message
would be "Command not supported"

I am wondering if we should even let the command be sent to self-heal-daemons 
from glusterd.

How about
06:20:10 :) ⚡ gluster volume heal r2 info healed
Command not supported.

Makes sense; +1

Instead of
06:20:10 :) ⚡ gluster volume heal r2 info healed
brick: brick-1
status: Command not supported

brick: brick-2
status: Command not supported

Pranith

-Ravi

Pranith.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Need inputs for command deprecation output

2014-05-15 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Ravishankar N" 
> To: "Pranith Kumar Karampuri" , "Gluster Devel" 
> 
> Sent: Friday, May 16, 2014 7:15:58 AM
> Subject: Re: [Gluster-devel] Need inputs for command deprecation output
> 
> On 05/16/2014 06:25 AM, Pranith Kumar Karampuri wrote:
> > Hi,
> > As part of changing behaviour of 'volume heal' commands. I want the
> > commands to show the following output. Any feedback in making them
> > better would be awesome :-).
> >
> > root@pranithk-laptop - ~
> > 06:20:10 :) ⚡ gluster volume heal r2 info healed
> > This command has been deprecated
> >
> > root@pranithk-laptop - ~
> > 06:20:13 :( ⚡ gluster volume heal r2 info heal-failed
> > This command has been deprecated
> When a command is deprecated, it still works the way it did but gives
> out a warning about it not being maintained and possible alternatives to it.
> If I understand http://review.gluster.org/#/c/7766/ correctly, we are
> not supporting these commands any more, in which case the right message
> would be "Command not supported"

I am wondering if we should even let the command be sent to self-heal-daemons 
from glusterd.

How about
06:20:10 :) ⚡ gluster volume heal r2 info healed
Command not supported.

Instead of 
06:20:10 :) ⚡ gluster volume heal r2 info healed
brick: brick-1
status: Command not supported

brick: brick-2
status: Command not supported

Pranith
> 
> -Ravi
> > Pranith.
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Need inputs for command deprecation output

2014-05-15 Thread Ravishankar N

On 05/16/2014 06:25 AM, Pranith Kumar Karampuri wrote:

Hi,
As part of changing behaviour of 'volume heal' commands. I want the 
commands to show the following output. Any feedback in making them better would 
be awesome :-).

root@pranithk-laptop - ~
06:20:10 :) ⚡ gluster volume heal r2 info healed
This command has been deprecated

root@pranithk-laptop - ~
06:20:13 :( ⚡ gluster volume heal r2 info heal-failed
This command has been deprecated
When a command is deprecated, it still works the way it did but gives 
out a warning about it not being maintained and possible alternatives to it.
If I understand http://review.gluster.org/#/c/7766/ correctly, we are 
not supporting these commands any more, in which case the right message 
would be "Command not supported"


-Ravi

Pranith.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-15 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Anand Avati" 
> To: "Pranith Kumar Karampuri" 
> Cc: "Gluster Devel" 
> Sent: Friday, May 16, 2014 6:30:44 AM
> Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots
> 
> On Thu, May 15, 2014 at 5:49 PM, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
> 
> > hi,
> > In the latest build I fired for review.gluster.com/7766 (
> > http://build.gluster.org/job/regression/4443/console) failed because of
> > spurious failure. The script doesn't wait for nfs export to be available. I
> > fixed that, but interestingly I found quite a few scripts with same
> > problem. Some of the scripts are relying on 'sleep 5' which also could lead
> > to spurious failures if the export is not available in 5 seconds. We found
> > that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily
> > delay the build execution. So if you guys are going to write any scripts
> > which has to do nfs mounts, please do it the following way:
> >
> > EXPECT_WITHIN 20 "1" is_nfs_export_available;
> > TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;
> >
> 
> Always please also add mount -o soft,intr in the regression scripts for
> mounting nfs. Becomes so much easier to cleanup any "hung" mess. We
> probably need an NFS mounting helper function which can be called like:
> 
> TEST mount_nfs $H0:/$V0 $N0;

Will do, there seems to be some extra-options(noac etc) for some of these, so 
will add one more argument for any extra options for nfs mount.

Pranith
> 
> Thanks
> 
> Avati
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-15 Thread Anand Avati
On Thu, May 15, 2014 at 5:49 PM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

> hi,
> In the latest build I fired for review.gluster.com/7766 (
> http://build.gluster.org/job/regression/4443/console) failed because of
> spurious failure. The script doesn't wait for nfs export to be available. I
> fixed that, but interestingly I found quite a few scripts with same
> problem. Some of the scripts are relying on 'sleep 5' which also could lead
> to spurious failures if the export is not available in 5 seconds. We found
> that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily
> delay the build execution. So if you guys are going to write any scripts
> which has to do nfs mounts, please do it the following way:
>
> EXPECT_WITHIN 20 "1" is_nfs_export_available;
> TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;
>

Always please also add mount -o soft,intr in the regression scripts for
mounting nfs. Becomes so much easier to cleanup any "hung" mess. We
probably need an NFS mounting helper function which can be called like:

TEST mount_nfs $H0:/$V0 $N0;

Thanks

Avati
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Need inputs for command deprecation output

2014-05-15 Thread Pranith Kumar Karampuri
Hi,
   As part of changing behaviour of 'volume heal' commands. I want the commands 
to show the following output. Any feedback in making them better would be 
awesome :-).

root@pranithk-laptop - ~ 
06:20:10 :) ⚡ gluster volume heal r2 info healed
This command has been deprecated

root@pranithk-laptop - ~ 
06:20:13 :( ⚡ gluster volume heal r2 info heal-failed
This command has been deprecated

Pranith.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-15 Thread Pranith Kumar Karampuri
hi,
In the latest build I fired for review.gluster.com/7766 
(http://build.gluster.org/job/regression/4443/console) failed because of 
spurious failure. The script doesn't wait for nfs export to be available. I 
fixed that, but interestingly I found quite a few scripts with same problem. 
Some of the scripts are relying on 'sleep 5' which also could lead to spurious 
failures if the export is not available in 5 seconds. We found that waiting for 
20 seconds is better, but 'sleep 20' would unnecessarily delay the build 
execution. So if you guys are going to write any scripts which has to do nfs 
mounts, please do it the following way:

EXPECT_WITHIN 20 "1" is_nfs_export_available;
TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;

Please review http://review.gluster.com/7773 :-)

I saw one more spurious failure in a snapshot related script 
tests/bugs/bug-1090042.t on the next build fired by Niels.
Joesph (CCed) is debugging it. He agreed to reply what he finds and share it 
with us so that we won't introduce similar bugs in future.

I encourage you guys to share what you fix to prevent spurious failures in 
future.

Thanks
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] Progress report for regression tests in Rackspace

2014-05-15 Thread Vijay Bellur

On 05/15/2014 09:08 PM, Luis Pabon wrote:

Should we create bugs for each of these, and divide-and-conquer?


That could be of help. First level of consolidation done (with frequency 
of test failures) by Justin might be a good list to start with. If we 
observe more failures as part of ongoing regression runs, let us open 
new bugs and have them cleaned up.


-Vijay



- Luis

On 05/15/2014 10:27 AM, Niels de Vos wrote:

On Thu, May 15, 2014 at 06:05:00PM +0530, Vijay Bellur wrote:

On 04/30/2014 07:03 PM, Justin Clift wrote:

Hi us,

Was trying out the GlusterFS regression tests in Rackspace VMs last
night for each of the release-3.4, release-3.5, and master branches.

The regression test is just a run of "run-tests.sh", from a git
checkout of the appropriate branch.

The good news is we're adding a lot of testing code with each release:

  * release-3.4 -  6303 lines  (~30 mins to run test)
  * release-3.5 -  9776 lines  (~85 mins to run test)
  * master  - 11660 lines  (~90 mins to run test)

(lines counted using:
  $ find tests -type f -iname "*.t" -exec cat {} >> a \;; wc -l a;
rm -f a)

The bad news is the tests only "kind of" pass now.  I say kind of
because
although the regression run *can* pass for each of these branch's, it's
inconsistent. :(

Results from testing overnight:

  * release-3.4 - 20 runs - 17 PASS, 3 FAIL. 85% success.
* bug-857330/normal.t failed in one run
* bug-887098-gmount-crash.t failed in one run
* bug-857330/normal.t failed in one run

  * release-3.5 - 20 runs, 18 PASS, 2 FAIL. 90% success.
* bug-857330/xml.t failed in one run
* bug-1004744.t failed in another run (same vm for both failures)

  * master - 20 runs, 6 PASS, 14 FAIL. 30% success.
* bug-1070734.t failed in one run
* bug-1087198.t & bug-860663.t failed in one run (same vm as
bug-1070734.t failure above)
* bug-1087198.t & bug-857330/normal.t failed in one run (new vm,
a subsequent run on same vm passed)
* bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1070734.t & bug-1087198.t failed in one run (new vm)
* bug-860663.t failed in one run
* bug-1023974.t & bug-1087198.t & bug-948686.t failed in one run
(new vm)
* bug-1004744.t & bug-1023974.t & bug-1087198.t & bug-948686.t
failed in one run (new vm)
* bug-948686.t failed in one run (new vm)
* bug-1070734.t failed in one run (new vm)
* bug-1023974.t failed in one run (new vm)
* bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1070734.t failed in one run (new vm)
* bug-1087198.t failed in one run (new vm)

The occasional failing tests aren't completely random, suggesting
something is going on.  Possible race conditions maybe? (no idea).

  * 8 failures - bug-1087198.t
  * 5 failures - bug-948686.t
  * 4 failures - bug-1070734.t
  * 3 failures - bug-1023974.t
  * 3 failures - bug-857330/normal.t
  * 2 failures - bug-860663.t
  * 2 failures - bug-1004744.t
  * 1 failures - bug-857330/xml.t
  * 1 failures - bug-887098-gmount-crash.t

Anyone have suggestions on how to make this work reliably?



I think it would be a good idea to arrive at a list of test cases that
are failing at random and assign owners to address them (default owner
being the submitter of the test case). In addition to these, I have
also seen tests like bd.t and xml.t fail pretty regularly.

Justin - can we publish a consolidated list of regression tests that
fail and owners for them on an etherpad or similar?

Fixing these test cases will enable us to bring in more jenkins
instances for parallel regression runs etc. and will also provide more
determinism for our regression tests. Your help to address the
regression test suite problems will be greatly appreciated!

Indeed, getting the regression tests stable seems like a blocker before
we can move to a scalable Jenkins solution. Unfortunately, it may not be
trivial to debug these test cases... Any suggestion on capturing useful
data that helps in figuring out why the test cases don't pass?

Thanks,
Niels
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] Progress report for regression tests in Rackspace

2014-05-15 Thread Vijay Bellur

On 05/15/2014 07:57 PM, Niels de Vos wrote:

On Thu, May 15, 2014 at 06:05:00PM +0530, Vijay Bellur wrote:

On 04/30/2014 07:03 PM, Justin Clift wrote:

Hi us,

Was trying out the GlusterFS regression tests in Rackspace VMs last
night for each of the release-3.4, release-3.5, and master branches.

The regression test is just a run of "run-tests.sh", from a git
checkout of the appropriate branch.

The good news is we're adding a lot of testing code with each release:

  * release-3.4 -  6303 lines  (~30 mins to run test)
  * release-3.5 -  9776 lines  (~85 mins to run test)
  * master  - 11660 lines  (~90 mins to run test)

(lines counted using:
  $ find tests -type f -iname "*.t" -exec cat {} >> a \;; wc -l a; rm -f a)

The bad news is the tests only "kind of" pass now.  I say kind of because
although the regression run *can* pass for each of these branch's, it's
inconsistent. :(

Results from testing overnight:

  * release-3.4 - 20 runs - 17 PASS, 3 FAIL. 85% success.
* bug-857330/normal.t failed in one run
* bug-887098-gmount-crash.t failed in one run
* bug-857330/normal.t failed in one run

  * release-3.5 - 20 runs, 18 PASS, 2 FAIL. 90% success.
* bug-857330/xml.t failed in one run
* bug-1004744.t failed in another run (same vm for both failures)

  * master - 20 runs, 6 PASS, 14 FAIL. 30% success.
* bug-1070734.t failed in one run
* bug-1087198.t & bug-860663.t failed in one run (same vm as bug-1070734.t 
failure above)
* bug-1087198.t & bug-857330/normal.t failed in one run (new vm, a 
subsequent run on same vm passed)
* bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1070734.t & bug-1087198.t failed in one run (new vm)
* bug-860663.t failed in one run
* bug-1023974.t & bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1004744.t & bug-1023974.t & bug-1087198.t & bug-948686.t failed in 
one run (new vm)
* bug-948686.t failed in one run (new vm)
* bug-1070734.t failed in one run (new vm)
* bug-1023974.t failed in one run (new vm)
* bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1070734.t failed in one run (new vm)
* bug-1087198.t failed in one run (new vm)

The occasional failing tests aren't completely random, suggesting
something is going on.  Possible race conditions maybe? (no idea).

  * 8 failures - bug-1087198.t
  * 5 failures - bug-948686.t
  * 4 failures - bug-1070734.t
  * 3 failures - bug-1023974.t
  * 3 failures - bug-857330/normal.t
  * 2 failures - bug-860663.t
  * 2 failures - bug-1004744.t
  * 1 failures - bug-857330/xml.t
  * 1 failures - bug-887098-gmount-crash.t

Anyone have suggestions on how to make this work reliably?




I think it would be a good idea to arrive at a list of test cases that
are failing at random and assign owners to address them (default owner
being the submitter of the test case). In addition to these, I have
also seen tests like bd.t and xml.t fail pretty regularly.

Justin - can we publish a consolidated list of regression tests that
fail and owners for them on an etherpad or similar?

Fixing these test cases will enable us to bring in more jenkins
instances for parallel regression runs etc. and will also provide more
determinism for our regression tests. Your help to address the
regression test suite problems will be greatly appreciated!


Indeed, getting the regression tests stable seems like a blocker before
we can move to a scalable Jenkins solution. Unfortunately, it may not be
trivial to debug these test cases... Any suggestion on capturing useful
data that helps in figuring out why the test cases don't pass?



To start with, obtaining the logs and cores from a failed regression run 
(/d/logs/...) of build.gluster.org would be useful. Once we start 
debugging a few problems and notice the necessity for more information, 
we can start collecting them for a failed regression run.


-Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] Progress report for regression tests in Rackspace

2014-05-15 Thread Luis Pabon

Should we create bugs for each of these, and divide-and-conquer?

- Luis

On 05/15/2014 10:27 AM, Niels de Vos wrote:

On Thu, May 15, 2014 at 06:05:00PM +0530, Vijay Bellur wrote:

On 04/30/2014 07:03 PM, Justin Clift wrote:

Hi us,

Was trying out the GlusterFS regression tests in Rackspace VMs last
night for each of the release-3.4, release-3.5, and master branches.

The regression test is just a run of "run-tests.sh", from a git
checkout of the appropriate branch.

The good news is we're adding a lot of testing code with each release:

  * release-3.4 -  6303 lines  (~30 mins to run test)
  * release-3.5 -  9776 lines  (~85 mins to run test)
  * master  - 11660 lines  (~90 mins to run test)

(lines counted using:
  $ find tests -type f -iname "*.t" -exec cat {} >> a \;; wc -l a; rm -f a)

The bad news is the tests only "kind of" pass now.  I say kind of because
although the regression run *can* pass for each of these branch's, it's
inconsistent. :(

Results from testing overnight:

  * release-3.4 - 20 runs - 17 PASS, 3 FAIL. 85% success.
* bug-857330/normal.t failed in one run
* bug-887098-gmount-crash.t failed in one run
* bug-857330/normal.t failed in one run

  * release-3.5 - 20 runs, 18 PASS, 2 FAIL. 90% success.
* bug-857330/xml.t failed in one run
* bug-1004744.t failed in another run (same vm for both failures)

  * master - 20 runs, 6 PASS, 14 FAIL. 30% success.
* bug-1070734.t failed in one run
* bug-1087198.t & bug-860663.t failed in one run (same vm as bug-1070734.t 
failure above)
* bug-1087198.t & bug-857330/normal.t failed in one run (new vm, a 
subsequent run on same vm passed)
* bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1070734.t & bug-1087198.t failed in one run (new vm)
* bug-860663.t failed in one run
* bug-1023974.t & bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1004744.t & bug-1023974.t & bug-1087198.t & bug-948686.t failed in 
one run (new vm)
* bug-948686.t failed in one run (new vm)
* bug-1070734.t failed in one run (new vm)
* bug-1023974.t failed in one run (new vm)
* bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1070734.t failed in one run (new vm)
* bug-1087198.t failed in one run (new vm)

The occasional failing tests aren't completely random, suggesting
something is going on.  Possible race conditions maybe? (no idea).

  * 8 failures - bug-1087198.t
  * 5 failures - bug-948686.t
  * 4 failures - bug-1070734.t
  * 3 failures - bug-1023974.t
  * 3 failures - bug-857330/normal.t
  * 2 failures - bug-860663.t
  * 2 failures - bug-1004744.t
  * 1 failures - bug-857330/xml.t
  * 1 failures - bug-887098-gmount-crash.t

Anyone have suggestions on how to make this work reliably?



I think it would be a good idea to arrive at a list of test cases that
are failing at random and assign owners to address them (default owner
being the submitter of the test case). In addition to these, I have
also seen tests like bd.t and xml.t fail pretty regularly.

Justin - can we publish a consolidated list of regression tests that
fail and owners for them on an etherpad or similar?

Fixing these test cases will enable us to bring in more jenkins
instances for parallel regression runs etc. and will also provide more
determinism for our regression tests. Your help to address the
regression test suite problems will be greatly appreciated!

Indeed, getting the regression tests stable seems like a blocker before
we can move to a scalable Jenkins solution. Unfortunately, it may not be
trivial to debug these test cases... Any suggestion on capturing useful
data that helps in figuring out why the test cases don't pass?

Thanks,
Niels
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] Progress report for regression tests in Rackspace

2014-05-15 Thread Niels de Vos
On Thu, May 15, 2014 at 06:05:00PM +0530, Vijay Bellur wrote:
> On 04/30/2014 07:03 PM, Justin Clift wrote:
> >Hi us,
> >
> >Was trying out the GlusterFS regression tests in Rackspace VMs last
> >night for each of the release-3.4, release-3.5, and master branches.
> >
> >The regression test is just a run of "run-tests.sh", from a git
> >checkout of the appropriate branch.
> >
> >The good news is we're adding a lot of testing code with each release:
> >
> >  * release-3.4 -  6303 lines  (~30 mins to run test)
> >  * release-3.5 -  9776 lines  (~85 mins to run test)
> >  * master  - 11660 lines  (~90 mins to run test)
> >
> >(lines counted using:
> >  $ find tests -type f -iname "*.t" -exec cat {} >> a \;; wc -l a; rm -f a)
> >
> >The bad news is the tests only "kind of" pass now.  I say kind of because
> >although the regression run *can* pass for each of these branch's, it's
> >inconsistent. :(
> >
> >Results from testing overnight:
> >
> >  * release-3.4 - 20 runs - 17 PASS, 3 FAIL. 85% success.
> >* bug-857330/normal.t failed in one run
> >* bug-887098-gmount-crash.t failed in one run
> >* bug-857330/normal.t failed in one run
> >
> >  * release-3.5 - 20 runs, 18 PASS, 2 FAIL. 90% success.
> >* bug-857330/xml.t failed in one run
> >* bug-1004744.t failed in another run (same vm for both failures)
> >
> >  * master - 20 runs, 6 PASS, 14 FAIL. 30% success.
> >* bug-1070734.t failed in one run
> >* bug-1087198.t & bug-860663.t failed in one run (same vm as 
> > bug-1070734.t failure above)
> >* bug-1087198.t & bug-857330/normal.t failed in one run (new vm, a 
> > subsequent run on same vm passed)
> >* bug-1087198.t & bug-948686.t failed in one run (new vm)
> >* bug-1070734.t & bug-1087198.t failed in one run (new vm)
> >* bug-860663.t failed in one run
> >* bug-1023974.t & bug-1087198.t & bug-948686.t failed in one run (new vm)
> >* bug-1004744.t & bug-1023974.t & bug-1087198.t & bug-948686.t failed in 
> > one run (new vm)
> >* bug-948686.t failed in one run (new vm)
> >* bug-1070734.t failed in one run (new vm)
> >* bug-1023974.t failed in one run (new vm)
> >* bug-1087198.t & bug-948686.t failed in one run (new vm)
> >* bug-1070734.t failed in one run (new vm)
> >* bug-1087198.t failed in one run (new vm)
> >
> >The occasional failing tests aren't completely random, suggesting
> >something is going on.  Possible race conditions maybe? (no idea).
> >
> >  * 8 failures - bug-1087198.t
> >  * 5 failures - bug-948686.t
> >  * 4 failures - bug-1070734.t
> >  * 3 failures - bug-1023974.t
> >  * 3 failures - bug-857330/normal.t
> >  * 2 failures - bug-860663.t
> >  * 2 failures - bug-1004744.t
> >  * 1 failures - bug-857330/xml.t
> >  * 1 failures - bug-887098-gmount-crash.t
> >
> >Anyone have suggestions on how to make this work reliably?
> 
> 
> 
> I think it would be a good idea to arrive at a list of test cases that
> are failing at random and assign owners to address them (default owner
> being the submitter of the test case). In addition to these, I have
> also seen tests like bd.t and xml.t fail pretty regularly.
> 
> Justin - can we publish a consolidated list of regression tests that
> fail and owners for them on an etherpad or similar?
> 
> Fixing these test cases will enable us to bring in more jenkins
> instances for parallel regression runs etc. and will also provide more
> determinism for our regression tests. Your help to address the
> regression test suite problems will be greatly appreciated!

Indeed, getting the regression tests stable seems like a blocker before 
we can move to a scalable Jenkins solution. Unfortunately, it may not be 
trivial to debug these test cases... Any suggestion on capturing useful 
data that helps in figuring out why the test cases don't pass?

Thanks,
Niels
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] portability

2014-05-15 Thread Emmanuel Dreyfus
Hi

I have not built master for a while, and now find GNU specific extensions
that are not portable. Since it is not the first time I address them, 
I would like to send a reminder about it:

1) bash-specific syntax
Do not write:
test $foo == "bar" 
But instead write:
test $foo = "bar" 

The = operator is POSIX compliant. The == operator works in bash end ksh.

2) GNU sed specific flag
Do not write:
sed -i 's/foo/bar/' buz
But instead write:
sed  's/foo/bar/' buz > buz.new && mv buz.new buz

The -i  flags is a GNU extension that is not implemented in BSD sed.

3) GNU make specific syntax
Do not write:
foo.c:  foo.x foo.h
${RPCGEN} $<
But instead write:
foo.c:  foo.x foo.h
${RPCGEN} foo.x
Or even:
foo.c:  foo.x foo.h
${RPCGEN} ${@:.c=.c}

$< does not work everywhere in non GNU make. If I understand 
autoconf doc correctly, it does not work outside of generic rules (.c.o:)


-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] cluster/ec: Added the erasure code xlator

2014-05-15 Thread Xavier Hernandez
The cli changes for disperse volumes are ready for review:

http://review.gluster.org/7782/

Xavi

On Thursday 15 May 2014 10:11:49 Xavier Hernandez wrote:
> Hi Kaushal,
> 
> On Tuesday 13 May 2014 20:22:06 Kaushal M wrote:
> > The syntax looks good. If you need help with the cli and glusterd changes,
> > I'll be happy to help.
> 
> Thanks. It will be really appreciated.
> 
> I think I've a working modification. Not sure if there's something else that
> have to be modified. I'll push it for review very soon and add you as a
> reviewer.
> 
> I also decided to change the volume option 'size' to 'redundancy', and its
> format. It seems more intuitive now. The 'size' option had the format 'N:R',
> where N was the total number of subvolumes and R the redundancy. Since the
> number of subvolumes can be directly calculated from the 'subvolumes'
> keyword, only the redundancy is really needed.
> 
> Xavi
> 
> > On Tue, May 13, 2014 at 8:08 PM, Xavier Hernandez
> 
> wrote:
> > > I'm trying to modify the cli to allow the creation of dispersed volumes.
> > > 
> > > Current syntax for volume creation is like this:
> > > volume create  [stripe ] \
> > > 
> > > [replica ] \
> > > [transport ] \
> > > ?... \
> > > [force]
> > > 
> > > I propose to use this modified syntax:
> > > volume create  [stripe ] \
> > > 
> > > [replica ] \
> > > [disperse ] \
> > > [redundancy ] \
> > > [transport ] \
> > > ?... \
> > > [force]
> > > 
> > > If 'disperse' is specified and 'redundancy' is not, 1 is assumed for
> > > redundancy.
> > > 
> > > If 'redundancy' is specified and 'disperse' is not, disperse count is
> > > taken
> > > from the number of bricks.
> > > 
> > > If 'disperse' is specified and the number of bricks is greater than the
> > > number
> > > indicated (and it is a multiple), a distributed-dispersed volume is
> > > created.
> > > 
> > > 'disperse' and 'redundancy' cannot be combined with 'stripe' or
> > > 'replica'.
> > > 
> > > Would this syntax be ok ?
> > > 
> > > Xavi
> > > 
> > > On Tuesday 13 May 2014 12:29:34 Xavier Hernandez wrote:
> > > > I forgot to say that performance is not good, however there are some
> > > > optimizations not yet incorporated that may improve it. They will be
> > > 
> > > added
> > > 
> > > > in following patches.
> > > > 
> > > > Xavi
> > > > 
> > > > On Tuesday 13 May 2014 12:23:15 Xavier Hernandez wrote:
> > > > > Hello,
> > > > > 
> > > > > I've just added the cluster/ec translator for review [1].
> > > > > 
> > > > > It's a rewrite that does not use any additional translator or
> > > > > library.
> > > > > It's
> > > > > still a work in progress with some bugs to solve, but its
> > > > > architecture
> > > > > should be stable. The main missing feature is self-heal, that will
> > > > > be
> > > > > added
> > > > > once the main code is stabilized and reviewed.
> > > > > 
> > > > > Feel free to review it and send any comment you think appropiate.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Xavi
> > > > > 
> > > > > [1] http://review.gluster.org/7749
> > > > > ___
> > > > > Gluster-devel mailing list
> > > > > Gluster-devel@gluster.org
> > > > > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> > > > 
> > > > ___
> > > > Gluster-devel mailing list
> > > > Gluster-devel@gluster.org
> > > > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> > > 
> > > ___
> > > Gluster-devel mailing list
> > > Gluster-devel@gluster.org
> > > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] Progress report for regression tests in Rackspace

2014-05-15 Thread Vijay Bellur

On 04/30/2014 07:03 PM, Justin Clift wrote:

Hi us,

Was trying out the GlusterFS regression tests in Rackspace VMs last
night for each of the release-3.4, release-3.5, and master branches.

The regression test is just a run of "run-tests.sh", from a git
checkout of the appropriate branch.

The good news is we're adding a lot of testing code with each release:

  * release-3.4 -  6303 lines  (~30 mins to run test)
  * release-3.5 -  9776 lines  (~85 mins to run test)
  * master  - 11660 lines  (~90 mins to run test)

(lines counted using:
  $ find tests -type f -iname "*.t" -exec cat {} >> a \;; wc -l a; rm -f a)

The bad news is the tests only "kind of" pass now.  I say kind of because
although the regression run *can* pass for each of these branch's, it's
inconsistent. :(

Results from testing overnight:

  * release-3.4 - 20 runs - 17 PASS, 3 FAIL. 85% success.
* bug-857330/normal.t failed in one run
* bug-887098-gmount-crash.t failed in one run
* bug-857330/normal.t failed in one run

  * release-3.5 - 20 runs, 18 PASS, 2 FAIL. 90% success.
* bug-857330/xml.t failed in one run
* bug-1004744.t failed in another run (same vm for both failures)

  * master - 20 runs, 6 PASS, 14 FAIL. 30% success.
* bug-1070734.t failed in one run
* bug-1087198.t & bug-860663.t failed in one run (same vm as bug-1070734.t 
failure above)
* bug-1087198.t & bug-857330/normal.t failed in one run (new vm, a 
subsequent run on same vm passed)
* bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1070734.t & bug-1087198.t failed in one run (new vm)
* bug-860663.t failed in one run
* bug-1023974.t & bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1004744.t & bug-1023974.t & bug-1087198.t & bug-948686.t failed in 
one run (new vm)
* bug-948686.t failed in one run (new vm)
* bug-1070734.t failed in one run (new vm)
* bug-1023974.t failed in one run (new vm)
* bug-1087198.t & bug-948686.t failed in one run (new vm)
* bug-1070734.t failed in one run (new vm)
* bug-1087198.t failed in one run (new vm)

The occasional failing tests aren't completely random, suggesting
something is going on.  Possible race conditions maybe? (no idea).

  * 8 failures - bug-1087198.t
  * 5 failures - bug-948686.t
  * 4 failures - bug-1070734.t
  * 3 failures - bug-1023974.t
  * 3 failures - bug-857330/normal.t
  * 2 failures - bug-860663.t
  * 2 failures - bug-1004744.t
  * 1 failures - bug-857330/xml.t
  * 1 failures - bug-887098-gmount-crash.t

Anyone have suggestions on how to make this work reliably?




I think it would be a good idea to arrive at a list of test cases that
are failing at random and assign owners to address them (default owner
being the submitter of the test case). In addition to these, I have also 
seen tests like bd.t and xml.t fail pretty regularly.


Justin - can we publish a consolidated list of regression tests that
fail and owners for them on an etherpad or similar?

Fixing these test cases will enable us to bring in more jenkins
instances for parallel regression runs etc. and will also provide more
determinism for our regression tests. Your help to address the
regression test suite problems will be greatly appreciated!

Thanks,
Vijay



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] cluster/ec: Added the erasure code xlator

2014-05-15 Thread Xavier Hernandez
Hi Kaushal,

On Tuesday 13 May 2014 20:22:06 Kaushal M wrote:
> The syntax looks good. If you need help with the cli and glusterd changes,
> I'll be happy to help.
Thanks. It will be really appreciated.

I think I've a working modification. Not sure if there's something else that 
have to be modified. I'll push it for review very soon and add you as a 
reviewer.

I also decided to change the volume option 'size' to 'redundancy', and its 
format. It seems more intuitive now. The 'size' option had the format 'N:R', 
where N was the total number of subvolumes and R the redundancy. Since the 
number of subvolumes can be directly calculated from the 'subvolumes' keyword, 
only the redundancy is really needed.

Xavi

> 
> On Tue, May 13, 2014 at 8:08 PM, Xavier Hernandez 
wrote:
> > I'm trying to modify the cli to allow the creation of dispersed volumes.
> > 
> > Current syntax for volume creation is like this:
> > volume create  [stripe ] \
> > 
> > [replica ] \
> > [transport ] \
> > ?... \
> > [force]
> > 
> > I propose to use this modified syntax:
> > volume create  [stripe ] \
> > 
> > [replica ] \
> > [disperse ] \
> > [redundancy ] \
> > [transport ] \
> > ?... \
> > [force]
> > 
> > If 'disperse' is specified and 'redundancy' is not, 1 is assumed for
> > redundancy.
> > 
> > If 'redundancy' is specified and 'disperse' is not, disperse count is
> > taken
> > from the number of bricks.
> > 
> > If 'disperse' is specified and the number of bricks is greater than the
> > number
> > indicated (and it is a multiple), a distributed-dispersed volume is
> > created.
> > 
> > 'disperse' and 'redundancy' cannot be combined with 'stripe' or 'replica'.
> > 
> > Would this syntax be ok ?
> > 
> > Xavi
> > 
> > On Tuesday 13 May 2014 12:29:34 Xavier Hernandez wrote:
> > > I forgot to say that performance is not good, however there are some
> > > optimizations not yet incorporated that may improve it. They will be
> > 
> > added
> > 
> > > in following patches.
> > > 
> > > Xavi
> > > 
> > > On Tuesday 13 May 2014 12:23:15 Xavier Hernandez wrote:
> > > > Hello,
> > > > 
> > > > I've just added the cluster/ec translator for review [1].
> > > > 
> > > > It's a rewrite that does not use any additional translator or library.
> > > > It's
> > > > still a work in progress with some bugs to solve, but its architecture
> > > > should be stable. The main missing feature is self-heal, that will be
> > > > added
> > > > once the main code is stabilized and reviewed.
> > > > 
> > > > Feel free to review it and send any comment you think appropiate.
> > > > 
> > > > Thanks,
> > > > 
> > > > Xavi
> > > > 
> > > > [1] http://review.gluster.org/7749
> > > > ___
> > > > Gluster-devel mailing list
> > > > Gluster-devel@gluster.org
> > > > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> > > 
> > > ___
> > > Gluster-devel mailing list
> > > Gluster-devel@gluster.org
> > > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> > 
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel