Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-22 Thread Vijaikumar M

I have posted a patch that fixes this issue:
http://review.gluster.org/#/c/7842/

Thanks,
Vijay


On Thursday 22 May 2014 11:35 AM, Vijay Bellur wrote:

On 05/21/2014 08:50 PM, Vijaikumar M wrote:

KP, Atin and myself did some debugging and found that there was a
deadlock in glusterd.

When creating a volume snapshot, the back-end operation 'taking a
lvm_snapshot and starting brick' for the each brick
are executed in parallel using synctask framework.

brick_start was releasing a big_lock with brick_connect and does a lock
again.
This caused a deadlock in some race condition where main-thread waiting
for one of the synctask thread to finish and
synctask-thread waiting for the big_lock.


We are working on fixing this issue.



If this fix is going to take more time, can we please log a bug to 
track this problem and remove the test cases that need to be addressed 
from the test unit? This way other valid patches will not be blocked 
by the failure of the snapshot test unit.


We can introduce these tests again as part of the fix for the problem.

-Vijay



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-21 Thread Vijay Bellur

On 05/21/2014 08:50 PM, Vijaikumar M wrote:

KP, Atin and myself did some debugging and found that there was a
deadlock in glusterd.

When creating a volume snapshot, the back-end operation 'taking a
lvm_snapshot and starting brick' for the each brick
are executed in parallel using synctask framework.

brick_start was releasing a big_lock with brick_connect and does a lock
again.
This caused a deadlock in some race condition where main-thread waiting
for one of the synctask thread to finish and
synctask-thread waiting for the big_lock.


We are working on fixing this issue.



If this fix is going to take more time, can we please log a bug to track 
this problem and remove the test cases that need to be addressed from 
the test unit? This way other valid patches will not be blocked by the 
failure of the snapshot test unit.


We can introduce these tests again as part of the fix for the problem.

-Vijay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-21 Thread Vijaikumar M
KP, Atin and myself did some debugging and found that there was a 
deadlock in glusterd.


When creating a volume snapshot, the back-end operation 'taking a 
lvm_snapshot and starting brick' for the each brick

are executed in parallel using synctask framework.

brick_start was releasing a big_lock with brick_connect and does a lock 
again.
This caused a deadlock in some race condition where main-thread waiting 
for one of the synctask thread to finish and

synctask-thread waiting for the big_lock.


We are working on fixing this issue.

Thanks,
Vijay


On Wednesday 21 May 2014 12:23 PM, Vijaikumar M wrote:
From the log: 
http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a17%3a10%3a51.tgzit 
looks like glusterd was hung:


*Glusterd log:**
* 5305 [2014-05-20 20:08:55.040665] E 
[glusterd-snapshot.c:3805:glusterd_add_brick_to_snap_volume] 
0-management: Unable to fetch snap device (vol1.brick_snapdevice0). 
Leaving empty
 5306 [2014-05-20 20:08:55.649146] I 
[rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
 5307 [2014-05-20 20:08:55.663181] I 
[rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
 5308 [2014-05-20 20:16:55.541197] W 
[glusterfsd.c:1182:cleanup_and_exit] (--> 0-: received signum (15), 
shutting down


Glusterd was hung when executing the testcase ./tests/bugs/bug-1090042.t.

*Cli log:**
*72649 [2014-05-20 20:12:51.960765] T 
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
 72650 [2014-05-20 20:12:51.960850] T [socket.c:2689:socket_connect] 
(-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) 
[0x7ff8b6609994] 
(-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) 
[0x7ff8b5d3305b] (- 
->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) 
[0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport 
already connected
 72651 [2014-05-20 20:12:52.960943] T 
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
 72652 [2014-05-20 20:12:52.960999] T [socket.c:2697:socket_connect] 
0-glusterfs: connecting 0x1e0fcc0, state=0 gen=0 sock=-1
 72653 [2014-05-20 20:12:52.961038] W [dict.c:1059:data_to_str] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) 
[0x7ff8ad9e95f3] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien 
t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0xf1) 
[0x7ff8ad9ec7d0]))) 0-dict: data is NULL
 72654 [2014-05-20 20:12:52.961070] W [dict.c:1059:data_to_str] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) 
[0x7ff8ad9e95f3] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien 
t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0x100) 
[0x7ff8ad9ec7df]))) 0-dict: data is NULL
 72655 [2014-05-20 20:12:52.961079] E 
[name.c:140:client_fill_address_family] 0-glusterfs: 
transport.address-family not specified. Could not guess default value 
from (remote-host:(null) or transport.unix.connect-path:(null)) 
optio   ns
 72656 [2014-05-20 20:12:54.961273] T 
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
 72657 [2014-05-20 20:12:54.961404] T [socket.c:2689:socket_connect] 
(-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) 
[0x7ff8b6609994] 
(-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) 
[0x7ff8b5d3305b] (- 
->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) 
[0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport 
already connected
 72658 [2014-05-20 20:12:55.120645] D [cli-cmd.c:384:cli_cmd_submit] 
0-cli: Returning 110
 72659 [2014-05-20 20:12:55.120723] D 
[cli-rpc-ops.c:8716:gf_cli_snapshot] 0-cli: Returning 110



Now we need to find why glusterd was hung.


Thanks,
Vijay



On Wednesday 21 May 2014 06:46 AM, Pranith Kumar Karampuri wrote:

Hey,
 Seems like even after this fix is merged, the regression tests are failing 
for the same script. You can check the logs 
athttp://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz

Relevant logs:
[2014-05-20 20:17:07.026045]  : volume create patchy 
build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : 
SUCCESS
[2014-05-20 20:17:08.030673]  : volume start patchy : SUCCESS
[2014-05-20 20:17:08.279148]  : volume barrier patchy enable : SUCCESS
[2014-05-20 20:17:08.476785]  : volume barrier patchy enable : FAILED : Failed 
to reconfigure barrier.
[2014-05-20 20:17:08.727429]  : volume barrier patchy disable : SUCCESS
[2014-05-20 20:17:08.926995]  : volume barrier patchy disable : FAILED : Failed 
to reconfigure barrier.

Pranith

- Original Message -

From: "Pranith Kumar Karampuri"
To: "Gluster Devel"
Cc: "Joseph Fernandes", "Vijaikumar M"
Sent: Tuesday, May 20, 2014 3:41:11 PM
Subject: Re: Spuriou

Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-21 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Atin Mukherjee" 
> To: gluster-devel@gluster.org, "Pranith Kumar Karampuri" 
> Sent: Wednesday, May 21, 2014 3:39:21 PM
> Subject: Re: Fwd: Re: [Gluster-devel] Spurious failures because of nfs and 
> snapshots
> 
> 
> 
> On 05/21/2014 11:42 AM, Atin Mukherjee wrote:
> > 
> > 
> > On 05/21/2014 10:54 AM, SATHEESARAN wrote:
> >> Guys,
> >>
> >> This is the issue pointed out by Pranith with regard to Barrier.
> >> I was reading through it.
> >>
> >> But I wanted to bring it to concern
> >>
> >> -- S
> >>
> >>
> >>  Original Message 
> >> Subject:   Re: [Gluster-devel] Spurious failures because of nfs and
> >> snapshots
> >> Date:  Tue, 20 May 2014 21:16:57 -0400 (EDT)
> >> From:  Pranith Kumar Karampuri 
> >> To:Vijaikumar M , Joseph Fernandes
> >> 
> >> CC:Gluster Devel 
> >>
> >>
> >>
> >> Hey,
> >> Seems like even after this fix is merged, the regression tests are
> >> failing for the same script. You can check the logs at
> >> 
> >> http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz
> > Pranith,
> > 
> > Is this the correct link? I don't see any log having this sequence there.
> > Also looking at the log from this mail, this is expected as per the
> > barrier functionality, an enable request followed by another enable
> > should always fail and the same happens for disable.
> > 
> > Can you please confirm the link and which particular regression test is
> > causing this issue, is it bug-1090042.t?
> > 
> > --Atin
> >>
> >> Relevant logs:
> >> [2014-05-20 20:17:07.026045]  : volume create patchy
> >> build.gluster.org:/d/backends/patchy1
> >> build.gluster.org:/d/backends/patchy2 : SUCCESS
> >> [2014-05-20 20:17:08.030673]  : volume start patchy : SUCCESS
> >> [2014-05-20 20:17:08.279148]  : volume barrier patchy enable : SUCCESS
> >> [2014-05-20 20:17:08.476785]  : volume barrier patchy enable : FAILED :
> >> Failed to reconfigure barrier.
> >> [2014-05-20 20:17:08.727429]  : volume barrier patchy disable : SUCCESS
> >> [2014-05-20 20:17:08.926995]  : volume barrier patchy disable : FAILED :
> >> Failed to reconfigure barrier.
> >>
> This log is for bug-1092841.t and its expected.

Damn :-(. I think I screwed up the timestamps while checking Sorry about 
that :-(. But there are failures. Check 
http://build.gluster.org/job/regression/4501/consoleFull

Pranith

> 
> --Atin
> >> Pranith
> >>
> >> - Original Message -
> >>> From: "Pranith Kumar Karampuri" 
> >>> To: "Gluster Devel" 
> >>> Cc: "Joseph Fernandes" , "Vijaikumar M"
> >>> 
> >>> Sent: Tuesday, May 20, 2014 3:41:11 PM
> >>> Subject: Re: Spurious failures because of nfs and snapshots
> >>>
> >>> hi,
> >>> Please resubmit the patches on top of
> >>> http://review.gluster.com/#/c/7753
> >>> to prevent frequent regression failures.
> >>>
> >>> Pranith
> >>> - Original Message -
> >>>> From: "Vijaikumar M" 
> >>>> To: "Pranith Kumar Karampuri" 
> >>>> Cc: "Joseph Fernandes" , "Gluster Devel"
> >>>> 
> >>>> Sent: Monday, May 19, 2014 2:40:47 PM
> >>>> Subject: Re: Spurious failures because of nfs and snapshots
> >>>>
> >>>> Brick disconnected with ping-time out:
> >>>>
> >>>> Here is the log message
> >>>> [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main]
> >>>> 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi
> >>>> n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s
> >>>> build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9
> >>>> 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3
> >>>> -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f
> >>>> bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid
> >>>> -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name
> >>>> /var/run/glu

Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-20 Thread Vijaikumar M
From the log: 
http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a17%3a10%3a51.tgzit 
looks like glusterd was hung:


*Glusterd log:**
* 5305 [2014-05-20 20:08:55.040665] E 
[glusterd-snapshot.c:3805:glusterd_add_brick_to_snap_volume] 
0-management: Unable to fetch snap device (vol1.brick_snapdevice0). 
Leaving empty
 5306 [2014-05-20 20:08:55.649146] I 
[rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
 5307 [2014-05-20 20:08:55.663181] I 
[rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
 5308 [2014-05-20 20:16:55.541197] W 
[glusterfsd.c:1182:cleanup_and_exit] (--> 0-: received signum (15), 
shutting down


Glusterd was hung when executing the testcase ./tests/bugs/bug-1090042.t.

*Cli log:**
*72649 [2014-05-20 20:12:51.960765] T 
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
 72650 [2014-05-20 20:12:51.960850] T [socket.c:2689:socket_connect] 
(-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) 
[0x7ff8b6609994] 
(-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) 
[0x7ff8b5d3305b] (- 
->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) 
[0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already 
connected
 72651 [2014-05-20 20:12:52.960943] T 
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
 72652 [2014-05-20 20:12:52.960999] T [socket.c:2697:socket_connect] 
0-glusterfs: connecting 0x1e0fcc0, state=0 gen=0 sock=-1
 72653 [2014-05-20 20:12:52.961038] W [dict.c:1059:data_to_str] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) 
[0x7ff8ad9e95f3] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien 
t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0xf1) 
[0x7ff8ad9ec7d0]))) 0-dict: data is NULL
 72654 [2014-05-20 20:12:52.961070] W [dict.c:1059:data_to_str] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) 
[0x7ff8ad9e95f3] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien 
t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] 
(-->/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0x100) 
[0x7ff8ad9ec7df]))) 0-dict: data is NULL
 72655 [2014-05-20 20:12:52.961079] E 
[name.c:140:client_fill_address_family] 0-glusterfs: 
transport.address-family not specified. Could not guess default value 
from (remote-host:(null) or transport.unix.connect-path:(null)) 
optio   ns
 72656 [2014-05-20 20:12:54.961273] T 
[rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
 72657 [2014-05-20 20:12:54.961404] T [socket.c:2689:socket_connect] 
(-->/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) 
[0x7ff8b6609994] 
(-->/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) 
[0x7ff8b5d3305b] (- 
->/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) 
[0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already 
connected
 72658 [2014-05-20 20:12:55.120645] D [cli-cmd.c:384:cli_cmd_submit] 
0-cli: Returning 110
 72659 [2014-05-20 20:12:55.120723] D 
[cli-rpc-ops.c:8716:gf_cli_snapshot] 0-cli: Returning 110



Now we need to find why glusterd was hung.


Thanks,
Vijay



On Wednesday 21 May 2014 06:46 AM, Pranith Kumar Karampuri wrote:

Hey,
 Seems like even after this fix is merged, the regression tests are failing 
for the same script. You can check the logs at 
http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz

Relevant logs:
[2014-05-20 20:17:07.026045]  : volume create patchy 
build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : 
SUCCESS
[2014-05-20 20:17:08.030673]  : volume start patchy : SUCCESS
[2014-05-20 20:17:08.279148]  : volume barrier patchy enable : SUCCESS
[2014-05-20 20:17:08.476785]  : volume barrier patchy enable : FAILED : Failed 
to reconfigure barrier.
[2014-05-20 20:17:08.727429]  : volume barrier patchy disable : SUCCESS
[2014-05-20 20:17:08.926995]  : volume barrier patchy disable : FAILED : Failed 
to reconfigure barrier.

Pranith

- Original Message -

From: "Pranith Kumar Karampuri" 
To: "Gluster Devel" 
Cc: "Joseph Fernandes" , "Vijaikumar M" 

Sent: Tuesday, May 20, 2014 3:41:11 PM
Subject: Re: Spurious failures because of nfs and snapshots

hi,
 Please resubmit the patches on top of http://review.gluster.com/#/c/7753
 to prevent frequent regression failures.

Pranith
- Original Message -

From: "Vijaikumar M" 
To: "Pranith Kumar Karampuri" 
Cc: "Joseph Fernandes" , "Gluster Devel"

Sent: Monday, May 19, 2014 2:40:47 PM
Subject: Re: Spurious failures because of nfs and snapshots

Brick disconnected with ping-time out:

Here is the log message
[2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main]
0-/build/install/sbin/glusterfsd: Started running /build/install/sbi
n/

Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-20 Thread Pranith Kumar Karampuri
Hey,
Seems like even after this fix is merged, the regression tests are failing 
for the same script. You can check the logs at 
http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz

Relevant logs:
[2014-05-20 20:17:07.026045]  : volume create patchy 
build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : 
SUCCESS
[2014-05-20 20:17:08.030673]  : volume start patchy : SUCCESS
[2014-05-20 20:17:08.279148]  : volume barrier patchy enable : SUCCESS
[2014-05-20 20:17:08.476785]  : volume barrier patchy enable : FAILED : Failed 
to reconfigure barrier.
[2014-05-20 20:17:08.727429]  : volume barrier patchy disable : SUCCESS
[2014-05-20 20:17:08.926995]  : volume barrier patchy disable : FAILED : Failed 
to reconfigure barrier.

Pranith

- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Gluster Devel" 
> Cc: "Joseph Fernandes" , "Vijaikumar M" 
> 
> Sent: Tuesday, May 20, 2014 3:41:11 PM
> Subject: Re: Spurious failures because of nfs and snapshots
> 
> hi,
> Please resubmit the patches on top of http://review.gluster.com/#/c/7753
> to prevent frequent regression failures.
> 
> Pranith
> - Original Message -
> > From: "Vijaikumar M" 
> > To: "Pranith Kumar Karampuri" 
> > Cc: "Joseph Fernandes" , "Gluster Devel"
> > 
> > Sent: Monday, May 19, 2014 2:40:47 PM
> > Subject: Re: Spurious failures because of nfs and snapshots
> > 
> > Brick disconnected with ping-time out:
> > 
> > Here is the log message
> > [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main]
> > 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi
> > n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s
> > build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9
> > 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3
> > -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f
> > bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid
> > -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name
> > /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l
> > /build/install/var/log/glusterfs/br
> > icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log
> > --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba
> > 7b4b0 --brick-port 49164 --xlator-option
> > 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164)
> >2 [2014-05-19 04:29:38.141118] I
> > [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting
> > ping-timeout to 30secs
> >3 [2014-05-19 04:30:09.139521] C
> > [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server
> > 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting.
> > 
> > 
> > 
> > Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where
> > ping-timer will be disabled by default for all the rpc connection except
> > for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec).
> > 
> > 
> > Thanks,
> > Vijay
> > 
> > 
> > On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote:
> > > The latest build failure also has the same issue:
> > > Download it from here:
> > > http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz
> > >
> > > Pranith
> > >
> > > - Original Message -
> > >> From: "Vijaikumar M" 
> > >> To: "Joseph Fernandes" 
> > >> Cc: "Pranith Kumar Karampuri" , "Gluster Devel"
> > >> 
> > >> Sent: Monday, 19 May, 2014 11:41:28 AM
> > >> Subject: Re: Spurious failures because of nfs and snapshots
> > >>
> > >> Hi Joseph,
> > >>
> > >> In the log mentioned below, it say ping-time is set to default value
> > >> 30sec.I think issue is different.
> > >> Can you please point me to the logs where you where able to re-create
> > >> the problem.
> > >>
> > >> Thanks,
> > >> Vijay
> > >>
> > >>
> > >>
> > >> On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote:
> > >>> hi Vijai, Joseph,
> > >>>   In 2 of the last 3 build failures,
> > >>>   http://build.gluster.org/job/regression/4479/console,
> > >>>   http://build.gluster.org/job/regression/4478/console this
> > >>>   test(tests/bugs/bug-1090042.t) failed. Do you guys think it is
> > >>>   better
> > >>>   to revert this test until the fix is available? Please send a
> > >>>   patch
> > >>>   to revert the test case if you guys feel so. You can re-submit it
> > >>>   along with the fix to the bug mentioned by Joseph.
> > >>>
> > >>> Pranith.
> > >>>
> > >>> - Original Message -
> >  From: "Joseph Fernandes" 
> >  To: "Pranith Kumar Karampuri" 
> >  Cc: "Gluster Devel" 
> >  Sent: Friday, 16 May, 2014 5:13:57 PM
> >  Subject: Re: Spurious failures because of nfs and snapshots
> > 
> > 
> >  Hi All,
> > 
> >  tests/bugs/bug-1090042.t :
> > 
> >  I was able to reproduce th

Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-20 Thread Pranith Kumar Karampuri
hi,
Please resubmit the patches on top of http://review.gluster.com/#/c/7753 to 
prevent frequent regression failures.

Pranith
- Original Message -
> From: "Vijaikumar M" 
> To: "Pranith Kumar Karampuri" 
> Cc: "Joseph Fernandes" , "Gluster Devel" 
> 
> Sent: Monday, May 19, 2014 2:40:47 PM
> Subject: Re: Spurious failures because of nfs and snapshots
> 
> Brick disconnected with ping-time out:
> 
> Here is the log message
> [2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main]
> 0-/build/install/sbin/glusterfsd: Started running /build/install/sbi
> n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s
> build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9
> 1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3
> -p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f
> bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid
> -S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name
> /var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l
> /build/install/var/log/glusterfs/br
> icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log
> --xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba
> 7b4b0 --brick-port 49164 --xlator-option
> 3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164)
>2 [2014-05-19 04:29:38.141118] I
> [rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting
> ping-timeout to 30secs
>3 [2014-05-19 04:30:09.139521] C
> [rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server
> 10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting.
> 
> 
> 
> Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where
> ping-timer will be disabled by default for all the rpc connection except
> for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec).
> 
> 
> Thanks,
> Vijay
> 
> 
> On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote:
> > The latest build failure also has the same issue:
> > Download it from here:
> > http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz
> >
> > Pranith
> >
> > - Original Message -
> >> From: "Vijaikumar M" 
> >> To: "Joseph Fernandes" 
> >> Cc: "Pranith Kumar Karampuri" , "Gluster Devel"
> >> 
> >> Sent: Monday, 19 May, 2014 11:41:28 AM
> >> Subject: Re: Spurious failures because of nfs and snapshots
> >>
> >> Hi Joseph,
> >>
> >> In the log mentioned below, it say ping-time is set to default value
> >> 30sec.I think issue is different.
> >> Can you please point me to the logs where you where able to re-create
> >> the problem.
> >>
> >> Thanks,
> >> Vijay
> >>
> >>
> >>
> >> On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote:
> >>> hi Vijai, Joseph,
> >>>   In 2 of the last 3 build failures,
> >>>   http://build.gluster.org/job/regression/4479/console,
> >>>   http://build.gluster.org/job/regression/4478/console this
> >>>   test(tests/bugs/bug-1090042.t) failed. Do you guys think it is
> >>>   better
> >>>   to revert this test until the fix is available? Please send a patch
> >>>   to revert the test case if you guys feel so. You can re-submit it
> >>>   along with the fix to the bug mentioned by Joseph.
> >>>
> >>> Pranith.
> >>>
> >>> - Original Message -
>  From: "Joseph Fernandes" 
>  To: "Pranith Kumar Karampuri" 
>  Cc: "Gluster Devel" 
>  Sent: Friday, 16 May, 2014 5:13:57 PM
>  Subject: Re: Spurious failures because of nfs and snapshots
> 
> 
>  Hi All,
> 
>  tests/bugs/bug-1090042.t :
> 
>  I was able to reproduce the issue i.e when this test is done in a loop
> 
>  for i in {1..135} ; do  ./bugs/bug-1090042.t
> 
>  When checked the logs
>  [2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init]
>  0-management: setting frame-timeout to 600
>  [2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init]
>  0-management: defaulting ping-timeout to 30secs
>  [2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init]
>  0-management: setting frame-timeout to 600
>  [2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init]
>  0-management: defaulting ping-timeout to 30secs
> 
>  The issue is with ping-timeout and is tracked under the bug
> 
>  https://bugzilla.redhat.com/show_bug.cgi?id=1096729
> 
> 
>  The workaround is mentioned in
>  https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8
> 
> 
>  Regards,
>  Joe
> 
>  - Original Message -
>  From: "Pranith Kumar Karampuri" 
>  To: "Gluster Devel" 
>  Cc: "Joseph Fernandes" 
>  Sent: Friday, May 16, 2014 6:19:54 AM
>  Subject: Spurious failures because of nfs and snapshots
> 
>  hi,
>    In

Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-19 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Justin Clift" 
> Cc: "Gluster Devel" 
> Sent: Monday, May 19, 2014 11:20:21 AM
> Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots
> 
> 
> 
> - Original Message -
> > From: "Justin Clift" 
> > To: "Pranith Kumar Karampuri" 
> > Cc: "Gluster Devel" 
> > Sent: Monday, 19 May, 2014 10:41:03 AM
> > Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots
> > 
> > On 19/05/2014, at 6:00 AM, Pranith Kumar Karampuri wrote:
> > 
> > > This particular class is eliminated :-). Patch was merged on Friday.
> > 
> > 
> > Excellent.  I've just kicked off 10 instances in Rackspace to each run
> > the regression tests on master head.
> > 
> > Hopefully less than 1/2 of them fail this time.  Has been about 30%
> > pass rate recently. :)
> 
> I am working on one more patch about timeouts at the moment. Will be sending
> it shortly. That should help us manage waiting for timeouts easily.
> With the work kaushal, vijay did for providing logs, core files, we should be
> able to reduce the number of spurious regressions. Because now, we can debug
> them without stopping running of regressions :-).

The patch is now ready for review at http://review.gluster.com/7799. Intention 
of the patch is to make sure similar events wait for same time for those events 
to happen. i.e. EXPECT_WITHIN related ones. So next time some event stops 
completing the action in the expected time-limit and we want to increase the 
timeout, we only have to change it in one place.

Pranith

> 
> Pranith
> > 
> > + Justin
> > 
> > --
> > Open Source and Standards @ Red Hat
> > 
> > twitter.com/realjustinclift
> > 
> > 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-19 Thread Vijaikumar M

Brick disconnected with ping-time out:

Here is the log message
[2014-05-19 04:29:38.133266] I [MSGID: 100030] [glusterfsd.c:1998:main] 
0-/build/install/sbin/glusterfsd: Started running /build/install/sbi
n/glusterfsd version 3.5qa2 (args: /build/install/sbin/glusterfsd -s 
build.gluster.org --volfile-id /snaps/patchy_snap1/3f2ae3fbb4a74587b1a9 
1013f07d327f.build.gluster.org.var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3 
-p /var/lib/glusterd/snaps/patchy_snap1/3f2ae3f 
bb4a74587b1a91013f07d327f/run/build.gluster.org-var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.pid 
-S /var/run/51fe50a6faf0aae006c815da946caf3a.socket --brick-name 
/var/run/gluster/snaps/3f2ae3fbb4a74587b1a91013f07d327f/brick3 -l 
/build/install/var/log/glusterfs/br 
icks/var-run-gluster-snaps-3f2ae3fbb4a74587b1a91013f07d327f-brick3.log 
--xlator-option *-posix.glusterd-uuid=494ef3cd-15fc-4c8c-8751-2d441ba
7b4b0 --brick-port 49164 --xlator-option 
3f2ae3fbb4a74587b1a91013f07d327f-server.listen-port=49164)
  2 [2014-05-19 04:29:38.141118] I 
[rpc-clnt.c:988:rpc_clnt_connection_init] 0-glusterfs: defaulting 
ping-timeout to 30secs
  3 [2014-05-19 04:30:09.139521] C 
[rpc-clnt-ping.c:105:rpc_clnt_ping_timer_expired] 0-glusterfs: server 
10.3.129.13:24007 has not responded in the last 30 seconds, disconnecting.




Patch 'http://review.gluster.org/#/c/7753/' will fix the problem, where 
ping-timer will be disabled by default for all the rpc connection except 
for glusterd-glusterd (set to 30sec) and client-glusterd (set to 42sec).



Thanks,
Vijay


On Monday 19 May 2014 11:56 AM, Pranith Kumar Karampuri wrote:

The latest build failure also has the same issue:
Download it from here:
http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz

Pranith

- Original Message -

From: "Vijaikumar M" 
To: "Joseph Fernandes" 
Cc: "Pranith Kumar Karampuri" , "Gluster Devel" 

Sent: Monday, 19 May, 2014 11:41:28 AM
Subject: Re: Spurious failures because of nfs and snapshots

Hi Joseph,

In the log mentioned below, it say ping-time is set to default value
30sec.I think issue is different.
Can you please point me to the logs where you where able to re-create
the problem.

Thanks,
Vijay



On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote:

hi Vijai, Joseph,
  In 2 of the last 3 build failures,
  http://build.gluster.org/job/regression/4479/console,
  http://build.gluster.org/job/regression/4478/console this
  test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better
  to revert this test until the fix is available? Please send a patch
  to revert the test case if you guys feel so. You can re-submit it
  along with the fix to the bug mentioned by Joseph.

Pranith.

- Original Message -

From: "Joseph Fernandes" 
To: "Pranith Kumar Karampuri" 
Cc: "Gluster Devel" 
Sent: Friday, 16 May, 2014 5:13:57 PM
Subject: Re: Spurious failures because of nfs and snapshots


Hi All,

tests/bugs/bug-1090042.t :

I was able to reproduce the issue i.e when this test is done in a loop

for i in {1..135} ; do  ./bugs/bug-1090042.t

When checked the logs
[2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init]
0-management: defaulting ping-timeout to 30secs
[2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init]
0-management: defaulting ping-timeout to 30secs

The issue is with ping-timeout and is tracked under the bug

https://bugzilla.redhat.com/show_bug.cgi?id=1096729


The workaround is mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8


Regards,
Joe

- Original Message -
From: "Pranith Kumar Karampuri" 
To: "Gluster Devel" 
Cc: "Joseph Fernandes" 
Sent: Friday, May 16, 2014 6:19:54 AM
Subject: Spurious failures because of nfs and snapshots

hi,
  In the latest build I fired for review.gluster.com/7766
  (http://build.gluster.org/job/regression/4443/console) failed because
  of
  spurious failure. The script doesn't wait for nfs export to be
  available. I fixed that, but interestingly I found quite a few
  scripts
  with same problem. Some of the scripts are relying on 'sleep 5' which
  also could lead to spurious failures if the export is not available
  in 5
  seconds. We found that waiting for 20 seconds is better, but 'sleep
  20'
  would unnecessarily delay the build execution. So if you guys are
  going
  to write any scripts which has to do nfs mounts, please do it the
  following way:

EXPECT_WITHIN 20 "1" is_nfs_export_available;
TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;

Please review http://review.gluster.com/7773 :-)

I saw one more spurious failure in a snapshot related script
tests/b

Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-18 Thread Pranith Kumar Karampuri
The latest build failure also has the same issue:
Download it from here:
http://build.gluster.org:443/logs/glusterfs-logs-20140518%3a22%3a27%3a31.tgz

Pranith

- Original Message -
> From: "Vijaikumar M" 
> To: "Joseph Fernandes" 
> Cc: "Pranith Kumar Karampuri" , "Gluster Devel" 
> 
> Sent: Monday, 19 May, 2014 11:41:28 AM
> Subject: Re: Spurious failures because of nfs and snapshots
> 
> Hi Joseph,
> 
> In the log mentioned below, it say ping-time is set to default value
> 30sec.I think issue is different.
> Can you please point me to the logs where you where able to re-create
> the problem.
> 
> Thanks,
> Vijay
> 
> 
> 
> On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote:
> > hi Vijai, Joseph,
> >  In 2 of the last 3 build failures,
> >  http://build.gluster.org/job/regression/4479/console,
> >  http://build.gluster.org/job/regression/4478/console this
> >  test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better
> >  to revert this test until the fix is available? Please send a patch
> >  to revert the test case if you guys feel so. You can re-submit it
> >  along with the fix to the bug mentioned by Joseph.
> >
> > Pranith.
> >
> > - Original Message -
> >> From: "Joseph Fernandes" 
> >> To: "Pranith Kumar Karampuri" 
> >> Cc: "Gluster Devel" 
> >> Sent: Friday, 16 May, 2014 5:13:57 PM
> >> Subject: Re: Spurious failures because of nfs and snapshots
> >>
> >>
> >> Hi All,
> >>
> >> tests/bugs/bug-1090042.t :
> >>
> >> I was able to reproduce the issue i.e when this test is done in a loop
> >>
> >> for i in {1..135} ; do  ./bugs/bug-1090042.t
> >>
> >> When checked the logs
> >> [2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init]
> >> 0-management: setting frame-timeout to 600
> >> [2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init]
> >> 0-management: defaulting ping-timeout to 30secs
> >> [2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init]
> >> 0-management: setting frame-timeout to 600
> >> [2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init]
> >> 0-management: defaulting ping-timeout to 30secs
> >>
> >> The issue is with ping-timeout and is tracked under the bug
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1096729
> >>
> >>
> >> The workaround is mentioned in
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8
> >>
> >>
> >> Regards,
> >> Joe
> >>
> >> - Original Message -
> >> From: "Pranith Kumar Karampuri" 
> >> To: "Gluster Devel" 
> >> Cc: "Joseph Fernandes" 
> >> Sent: Friday, May 16, 2014 6:19:54 AM
> >> Subject: Spurious failures because of nfs and snapshots
> >>
> >> hi,
> >>  In the latest build I fired for review.gluster.com/7766
> >>  (http://build.gluster.org/job/regression/4443/console) failed because
> >>  of
> >>  spurious failure. The script doesn't wait for nfs export to be
> >>  available. I fixed that, but interestingly I found quite a few
> >>  scripts
> >>  with same problem. Some of the scripts are relying on 'sleep 5' which
> >>  also could lead to spurious failures if the export is not available
> >>  in 5
> >>  seconds. We found that waiting for 20 seconds is better, but 'sleep
> >>  20'
> >>  would unnecessarily delay the build execution. So if you guys are
> >>  going
> >>  to write any scripts which has to do nfs mounts, please do it the
> >>  following way:
> >>
> >> EXPECT_WITHIN 20 "1" is_nfs_export_available;
> >> TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;
> >>
> >> Please review http://review.gluster.com/7773 :-)
> >>
> >> I saw one more spurious failure in a snapshot related script
> >> tests/bugs/bug-1090042.t on the next build fired by Niels.
> >> Joesph (CCed) is debugging it. He agreed to reply what he finds and share
> >> it
> >> with us so that we won't introduce similar bugs in future.
> >>
> >> I encourage you guys to share what you fix to prevent spurious failures in
> >> future.
> >>
> >> Thanks
> >> Pranith
> >>
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-18 Thread Vijaikumar M

Hi Joseph,

In the log mentioned below, it say ping-time is set to default value 
30sec.I think issue is different.
Can you please point me to the logs where you where able to re-create 
the problem.


Thanks,
Vijay



On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote:

hi Vijai, Joseph,
 In 2 of the last 3 build failures, 
http://build.gluster.org/job/regression/4479/console, 
http://build.gluster.org/job/regression/4478/console this 
test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better to revert 
this test until the fix is available? Please send a patch to revert the test 
case if you guys feel so. You can re-submit it along with the fix to the bug 
mentioned by Joseph.

Pranith.

- Original Message -

From: "Joseph Fernandes" 
To: "Pranith Kumar Karampuri" 
Cc: "Gluster Devel" 
Sent: Friday, 16 May, 2014 5:13:57 PM
Subject: Re: Spurious failures because of nfs and snapshots


Hi All,

tests/bugs/bug-1090042.t :

I was able to reproduce the issue i.e when this test is done in a loop

for i in {1..135} ; do  ./bugs/bug-1090042.t

When checked the logs
[2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init]
0-management: defaulting ping-timeout to 30secs
[2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init]
0-management: setting frame-timeout to 600
[2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init]
0-management: defaulting ping-timeout to 30secs

The issue is with ping-timeout and is tracked under the bug

https://bugzilla.redhat.com/show_bug.cgi?id=1096729


The workaround is mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8


Regards,
Joe

- Original Message -
From: "Pranith Kumar Karampuri" 
To: "Gluster Devel" 
Cc: "Joseph Fernandes" 
Sent: Friday, May 16, 2014 6:19:54 AM
Subject: Spurious failures because of nfs and snapshots

hi,
 In the latest build I fired for review.gluster.com/7766
 (http://build.gluster.org/job/regression/4443/console) failed because of
 spurious failure. The script doesn't wait for nfs export to be
 available. I fixed that, but interestingly I found quite a few scripts
 with same problem. Some of the scripts are relying on 'sleep 5' which
 also could lead to spurious failures if the export is not available in 5
 seconds. We found that waiting for 20 seconds is better, but 'sleep 20'
 would unnecessarily delay the build execution. So if you guys are going
 to write any scripts which has to do nfs mounts, please do it the
 following way:

EXPECT_WITHIN 20 "1" is_nfs_export_available;
TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;

Please review http://review.gluster.com/7773 :-)

I saw one more spurious failure in a snapshot related script
tests/bugs/bug-1090042.t on the next build fired by Niels.
Joesph (CCed) is debugging it. He agreed to reply what he finds and share it
with us so that we won't introduce similar bugs in future.

I encourage you guys to share what you fix to prevent spurious failures in
future.

Thanks
Pranith



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-18 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Justin Clift" 
> To: "Pranith Kumar Karampuri" 
> Cc: "Gluster Devel" 
> Sent: Monday, 19 May, 2014 10:41:03 AM
> Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots
> 
> On 19/05/2014, at 6:00 AM, Pranith Kumar Karampuri wrote:
> 
> > This particular class is eliminated :-). Patch was merged on Friday.
> 
> 
> Excellent.  I've just kicked off 10 instances in Rackspace to each run
> the regression tests on master head.
> 
> Hopefully less than 1/2 of them fail this time.  Has been about 30%
> pass rate recently. :)

I am working on one more patch about timeouts at the moment. Will be sending it 
shortly. That should help us manage waiting for timeouts easily.
With the work kaushal, vijay did for providing logs, core files, we should be 
able to reduce the number of spurious regressions. Because now, we can debug 
them without stopping running of regressions :-).

Pranith
> 
> + Justin
> 
> --
> Open Source and Standards @ Red Hat
> 
> twitter.com/realjustinclift
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-18 Thread Justin Clift
On 19/05/2014, at 6:00 AM, Pranith Kumar Karampuri wrote:

> This particular class is eliminated :-). Patch was merged on Friday.


Excellent.  I've just kicked off 10 instances in Rackspace to each run
the regression tests on master head.

Hopefully less than 1/2 of them fail this time.  Has been about 30%
pass rate recently. :)

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-18 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Justin Clift" 
> To: "Pranith Kumar Karampuri" 
> Cc: "Gluster Devel" 
> Sent: Monday, 19 May, 2014 10:26:04 AM
> Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots
> 
> On 16/05/2014, at 1:49 AM, Pranith Kumar Karampuri wrote:
> > hi,
> >In the latest build I fired for review.gluster.com/7766
> >(http://build.gluster.org/job/regression/4443/console) failed because
> >of spurious failure. The script doesn't wait for nfs export to be
> >available. I fixed that, but interestingly I found quite a few scripts
> >with same problem. Some of the scripts are relying on 'sleep 5' which
> >also could lead to spurious failures if the export is not available in
> >5 seconds.
> 
> Cool.  Fixing this NFS problem across all of the tests would be really
> welcome.  That specific failed test (bug-1087198.t) is the most common
> one I've seen over the last few weeks, causing about half of all
> failures in master.
> 
> Eliminating this class of regression failure would be really helpful. :)

This particular class is eliminated :-). Patch was merged on Friday.

Pranith
> 
> + Justin
> 
> --
> Open Source and Standards @ Red Hat
> 
> twitter.com/realjustinclift
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-18 Thread Justin Clift
On 16/05/2014, at 1:49 AM, Pranith Kumar Karampuri wrote:
> hi,
>In the latest build I fired for review.gluster.com/7766 
> (http://build.gluster.org/job/regression/4443/console) failed because of 
> spurious failure. The script doesn't wait for nfs export to be available. I 
> fixed that, but interestingly I found quite a few scripts with same problem. 
> Some of the scripts are relying on 'sleep 5' which also could lead to 
> spurious failures if the export is not available in 5 seconds.

Cool.  Fixing this NFS problem across all of the tests would be really
welcome.  That specific failed test (bug-1087198.t) is the most common
one I've seen over the last few weeks, causing about half of all
failures in master.

Eliminating this class of regression failure would be really helpful. :)

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-18 Thread Pranith Kumar Karampuri
hi Vijai, Joseph,
In 2 of the last 3 build failures, 
http://build.gluster.org/job/regression/4479/console, 
http://build.gluster.org/job/regression/4478/console this 
test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better to revert 
this test until the fix is available? Please send a patch to revert the test 
case if you guys feel so. You can re-submit it along with the fix to the bug 
mentioned by Joseph.

Pranith.

- Original Message -
> From: "Joseph Fernandes" 
> To: "Pranith Kumar Karampuri" 
> Cc: "Gluster Devel" 
> Sent: Friday, 16 May, 2014 5:13:57 PM
> Subject: Re: Spurious failures because of nfs and snapshots
> 
> 
> Hi All,
> 
> tests/bugs/bug-1090042.t :
> 
> I was able to reproduce the issue i.e when this test is done in a loop
> 
> for i in {1..135} ; do  ./bugs/bug-1090042.t
> 
> When checked the logs
> [2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
> [2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init]
> 0-management: defaulting ping-timeout to 30secs
> [2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init]
> 0-management: setting frame-timeout to 600
> [2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init]
> 0-management: defaulting ping-timeout to 30secs
> 
> The issue is with ping-timeout and is tracked under the bug
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1096729
> 
> 
> The workaround is mentioned in
> https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8
> 
> 
> Regards,
> Joe
> 
> - Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Gluster Devel" 
> Cc: "Joseph Fernandes" 
> Sent: Friday, May 16, 2014 6:19:54 AM
> Subject: Spurious failures because of nfs and snapshots
> 
> hi,
> In the latest build I fired for review.gluster.com/7766
> (http://build.gluster.org/job/regression/4443/console) failed because of
> spurious failure. The script doesn't wait for nfs export to be
> available. I fixed that, but interestingly I found quite a few scripts
> with same problem. Some of the scripts are relying on 'sleep 5' which
> also could lead to spurious failures if the export is not available in 5
> seconds. We found that waiting for 20 seconds is better, but 'sleep 20'
> would unnecessarily delay the build execution. So if you guys are going
> to write any scripts which has to do nfs mounts, please do it the
> following way:
> 
> EXPECT_WITHIN 20 "1" is_nfs_export_available;
> TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;
> 
> Please review http://review.gluster.com/7773 :-)
> 
> I saw one more spurious failure in a snapshot related script
> tests/bugs/bug-1090042.t on the next build fired by Niels.
> Joesph (CCed) is debugging it. He agreed to reply what he finds and share it
> with us so that we won't introduce similar bugs in future.
> 
> I encourage you guys to share what you fix to prevent spurious failures in
> future.
> 
> Thanks
> Pranith
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-16 Thread Joseph Fernandes

Hi All,

tests/bugs/bug-1090042.t : 

I was able to reproduce the issue i.e when this test is done in a loop 

for i in {1..135} ; do  ./bugs/bug-1090042.t

When checked the logs 
[2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init] 
0-management: defaulting ping-timeout to 30secs
[2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init] 
0-management: defaulting ping-timeout to 30secs

The issue is with ping-timeout and is tracked under the bug 

https://bugzilla.redhat.com/show_bug.cgi?id=1096729


The workaround is mentioned in 
https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8


Regards,
Joe

- Original Message -
From: "Pranith Kumar Karampuri" 
To: "Gluster Devel" 
Cc: "Joseph Fernandes" 
Sent: Friday, May 16, 2014 6:19:54 AM
Subject: Spurious failures because of nfs and snapshots

hi,
In the latest build I fired for review.gluster.com/7766 
(http://build.gluster.org/job/regression/4443/console) failed because of 
spurious failure. The script doesn't wait for nfs export to be available. I 
fixed that, but interestingly I found quite a few scripts with same problem. 
Some of the scripts are relying on 'sleep 5' which also could lead to spurious 
failures if the export is not available in 5 seconds. We found that waiting for 
20 seconds is better, but 'sleep 20' would unnecessarily delay the build 
execution. So if you guys are going to write any scripts which has to do nfs 
mounts, please do it the following way:

EXPECT_WITHIN 20 "1" is_nfs_export_available;
TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;

Please review http://review.gluster.com/7773 :-)

I saw one more spurious failure in a snapshot related script 
tests/bugs/bug-1090042.t on the next build fired by Niels.
Joesph (CCed) is debugging it. He agreed to reply what he finds and share it 
with us so that we won't introduce similar bugs in future.

I encourage you guys to share what you fix to prevent spurious failures in 
future.

Thanks
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-15 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Anand Avati" 
> To: "Pranith Kumar Karampuri" 
> Cc: "Gluster Devel" 
> Sent: Friday, May 16, 2014 6:30:44 AM
> Subject: Re: [Gluster-devel] Spurious failures because of nfs and snapshots
> 
> On Thu, May 15, 2014 at 5:49 PM, Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
> 
> > hi,
> > In the latest build I fired for review.gluster.com/7766 (
> > http://build.gluster.org/job/regression/4443/console) failed because of
> > spurious failure. The script doesn't wait for nfs export to be available. I
> > fixed that, but interestingly I found quite a few scripts with same
> > problem. Some of the scripts are relying on 'sleep 5' which also could lead
> > to spurious failures if the export is not available in 5 seconds. We found
> > that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily
> > delay the build execution. So if you guys are going to write any scripts
> > which has to do nfs mounts, please do it the following way:
> >
> > EXPECT_WITHIN 20 "1" is_nfs_export_available;
> > TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;
> >
> 
> Always please also add mount -o soft,intr in the regression scripts for
> mounting nfs. Becomes so much easier to cleanup any "hung" mess. We
> probably need an NFS mounting helper function which can be called like:
> 
> TEST mount_nfs $H0:/$V0 $N0;

Will do, there seems to be some extra-options(noac etc) for some of these, so 
will add one more argument for any extra options for nfs mount.

Pranith
> 
> Thanks
> 
> Avati
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-15 Thread Anand Avati
On Thu, May 15, 2014 at 5:49 PM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

> hi,
> In the latest build I fired for review.gluster.com/7766 (
> http://build.gluster.org/job/regression/4443/console) failed because of
> spurious failure. The script doesn't wait for nfs export to be available. I
> fixed that, but interestingly I found quite a few scripts with same
> problem. Some of the scripts are relying on 'sleep 5' which also could lead
> to spurious failures if the export is not available in 5 seconds. We found
> that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily
> delay the build execution. So if you guys are going to write any scripts
> which has to do nfs mounts, please do it the following way:
>
> EXPECT_WITHIN 20 "1" is_nfs_export_available;
> TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;
>

Always please also add mount -o soft,intr in the regression scripts for
mounting nfs. Becomes so much easier to cleanup any "hung" mess. We
probably need an NFS mounting helper function which can be called like:

TEST mount_nfs $H0:/$V0 $N0;

Thanks

Avati
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-15 Thread Pranith Kumar Karampuri
hi,
In the latest build I fired for review.gluster.com/7766 
(http://build.gluster.org/job/regression/4443/console) failed because of 
spurious failure. The script doesn't wait for nfs export to be available. I 
fixed that, but interestingly I found quite a few scripts with same problem. 
Some of the scripts are relying on 'sleep 5' which also could lead to spurious 
failures if the export is not available in 5 seconds. We found that waiting for 
20 seconds is better, but 'sleep 20' would unnecessarily delay the build 
execution. So if you guys are going to write any scripts which has to do nfs 
mounts, please do it the following way:

EXPECT_WITHIN 20 "1" is_nfs_export_available;
TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;

Please review http://review.gluster.com/7773 :-)

I saw one more spurious failure in a snapshot related script 
tests/bugs/bug-1090042.t on the next build fired by Niels.
Joesph (CCed) is debugging it. He agreed to reply what he finds and share it 
with us so that we won't introduce similar bugs in future.

I encourage you guys to share what you fix to prevent spurious failures in 
future.

Thanks
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel