Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-24 Thread Jeff Darcy
>2) N3 tries to start the brick B2. Now the problem lies here. N3 uses
>glusterd_resolve_brick() to resolve the UUID of B2->hostname(N2).
>   In glusterd_resolve_brick(), it cannot find  N2 in the peerinfo
>   list. Then it checks if N2 is a local loop back address. Since
>   N2(127.1.1.2) starts with
>   "127" it decides that its a local loop back address. Thus
>   glusterd_resolve_brick() fills brickinfo->uuid with [UUID3]. Now
>   as brickinfo->uuid == MY_UUID is
>   true, N3 initiates the brick process B2 with -s 127.1.1.2 and
>   *-posix.glusterd-uuid=[UUID3]. This process dies off immediately,
>   But for a short amount of
>   time it holds on to the  --brick-port, say for example 49155

This is the part that seems "off" to me.  If an address doesn't
*exactly* match that on some local interface, it's not local.  When we
implemented the cluster.rc infrastructure so that we could simulate
multi-node testing, we had to root out a bunch of stuff like this, but
apparently some crept back in.  If we just fixed the "127.* == local"
mistake, would that be adequate to prevent these errors?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-24 Thread Anders Blomdell
On 2014-07-24 08:21, Joseph Fernandes wrote:
> Hi All,
> 
> After further investigation we have the root cause for this issue. 
> The root cause is the way in which a new node is added to the cluster.
> 
> Now we have N1(127.1.1.1) and N2(127.1.1.2) as two nodes in the cluster, each 
> having a brick N1:B1 (127.1.1.1 : 49146) and N2:B2 (127.1.1.2 : 49147)
> 
> Now lets peer probe N3(127.1.1.3) from N1
> 
> 1) Friend request is sent from N1 to N3. N3 added N1 in the peerinfo list i.e 
> N1 and its uuid say [UUID1]
> 2) N3 get the brick infos from N1 
> 3) N3 tries to start the bricks
>1) N3 tries to start the brick B1 and find its not a local brick, 
> using the logic MY_UUID == brickinfo->uuid, which is false in this case,
>   as the UUID of brickinfo->hostname (N1) is [UUID1] (as suggested by 
> the peerinfo list) and MY_UUID is [UUID3]. Hence doesn't start it. 
>2) N3 tries to start the brick B2. Now the problem lies here. N3 uses 
> glusterd_resolve_brick() to resolve the UUID of B2->hostname(N2). 
>   In glusterd_resolve_brick(), it cannot find  N2 in the peerinfo 
> list. Then it checks if N2 is a local loop back address. Since N2(127.1.1.2) 
> starts with 
>   "127" it decides that its a local loop back address. Thus 
> glusterd_resolve_brick() fills brickinfo->uuid with [UUID3]. Now as 
> brickinfo->uuid == MY_UUID is 
>   true, N3 initiates the brick process B2 with -s 127.1.1.2 and 
> *-posix.glusterd-uuid=[UUID3]. This process dies off immediately, But for a 
> short amount of 
>   time it holds on to the  --brick-port, say for example 49155
> 
> All the above is observed & inferred from glusterd logs from N3 (by adding 
> some extra debug messages) 
> 
> Now coming back to our test case, i.e firing snapshot create and peer probe 
> together. If N2 has assigned 49155 as the port --brick-port for the snapshot 
> brick, then it finds that 49155 is Already acquired by some other process(i.e 
> faulty brick process N3:B2 (127.1.1.2 : 49155), which as the -s 127.1.1.2 and 
> *-posix.glusterd-uuid=[UUID3]) and hence fails to start the snapshot brick 
> process.
> 
> 1) The error is spurious, as its all about chance when N2 and N3 use the same 
> port for their brick processes.
> 2) This issue is possible only in a regression test scenario, As all the 
> nodes are on the same machine, differentiated only by a different loop back 
> address (127.1.1.*). 
> 3) Plus The logic that "127" is a local loop back address is also not wrong 
> as glusterd's are suppose to run on different machines in real usage cases.
> 
> Please do share your thoughts on this, And what would be a possible fix.

Possible solutions (many/all of them probably breaks important assumptions):

* Use some alias address range instead of 127.*.*.* for testing purposes
* Stop treating localhost as special
* Adopt the systemd LISTEN_FDS approach and have a special program that
  tries to bind to ports and then hands the port over to the proper daemon

/Anders
-- 
Anders Blomdell  Email: anders.blomd...@control.lth.se
Department of Automatic Control
Lund University  Phone:+46 46 222 4625
P.O. Box 118 Fax:  +46 46 138118
SE-221 00 Lund, Sweden

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-24 Thread Justin Clift
On Thu, 24 Jul 2014 02:21:37 -0400 (EDT)
Joseph Fernandes  wrote:

> Please do share your thoughts on this, And what would be a possible fix.

Any idea if there is a cross platform way to check if a port is already in
use? If so, that sounds like the first thing to add, and then some way to
try using a different port number. :)

+ Justin

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-23 Thread Joseph Fernandes
Hi All,

After further investigation we have the root cause for this issue. 
The root cause is the way in which a new node is added to the cluster.

Now we have N1(127.1.1.1) and N2(127.1.1.2) as two nodes in the cluster, each 
having a brick N1:B1 (127.1.1.1 : 49146) and N2:B2 (127.1.1.2 : 49147)

Now lets peer probe N3(127.1.1.3) from N1

1) Friend request is sent from N1 to N3. N3 added N1 in the peerinfo list i.e 
N1 and its uuid say [UUID1]
2) N3 get the brick infos from N1 
3) N3 tries to start the bricks
   1) N3 tries to start the brick B1 and find its not a local brick, using 
the logic MY_UUID == brickinfo->uuid, which is false in this case,
  as the UUID of brickinfo->hostname (N1) is [UUID1] (as suggested by 
the peerinfo list) and MY_UUID is [UUID3]. Hence doesn't start it. 
   2) N3 tries to start the brick B2. Now the problem lies here. N3 uses 
glusterd_resolve_brick() to resolve the UUID of B2->hostname(N2). 
  In glusterd_resolve_brick(), it cannot find  N2 in the peerinfo list. 
Then it checks if N2 is a local loop back address. Since N2(127.1.1.2) starts 
with 
  "127" it decides that its a local loop back address. Thus 
glusterd_resolve_brick() fills brickinfo->uuid with [UUID3]. Now as 
brickinfo->uuid == MY_UUID is 
  true, N3 initiates the brick process B2 with -s 127.1.1.2 and 
*-posix.glusterd-uuid=[UUID3]. This process dies off immediately, But for a 
short amount of 
  time it holds on to the  --brick-port, say for example 49155

All the above is observed & inferred from glusterd logs from N3 (by adding some 
extra debug messages) 

Now coming back to our test case, i.e firing snapshot create and peer probe 
together. If N2 has assigned 49155 as the port --brick-port for the snapshot 
brick, then it finds that 49155 is Already acquired by some other process(i.e 
faulty brick process N3:B2 (127.1.1.2 : 49155), which as the -s 127.1.1.2 and 
*-posix.glusterd-uuid=[UUID3]) and hence fails to start the snapshot brick 
process.

1) The error is spurious, as its all about chance when N2 and N3 use the same 
port for their brick processes.
2) This issue is possible only in a regression test scenario, As all the nodes 
are on the same machine, differentiated only by a different loop back address 
(127.1.1.*). 
3) Plus The logic that "127" is a local loop back address is also not wrong as 
glusterd's are suppose to run on different machines in real usage cases.

Please do share your thoughts on this, And what would be a possible fix.

Regards,
Joe
 
- Original Message -
From: "Joseph Fernandes" 
To: "Avra Sengupta" , "Gluster Devel" 

Sent: Tuesday, July 22, 2014 6:42:02 PM
Subject: Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

Hi All,

As with further investigation found the following,

1) Was the able to reproduce the issue, without running the complete 
regression, just by running bug-1112559.t only on slave30(which is been 
rebooted and a clean gluster setup).
   This rules out any involvement of previous failure from other spurious 
errors like mgmt_v3-locks.t. 
2) Added some messages and script (netstat and ps -ef | grep gluster ) 
execution when the binding to a port fails (in 
rpc/rpc-transport/socket/src/socket.c) and found the following,

Always the snapshot brick in second node (127.1.1.2) fails to acquire 
the port (eg : 127.1.1.2 : 49155 )

Netstat output shows: 
tcp0  0 127.1.1.2:49155 0.0.0.0:*   
LISTEN  3555/glusterfsd

and the process that is holding the port 49155 is 

root  3555 1  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
/d/backends/3/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
 -S /var/run/ff772f1ff85950660f389b0ed43ba2b7.socket --brick-name 
/d/backends/2/patchy_snap_mnt -l 
/usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
--xlator-option *-posix.glusterd-uuid=3af134ec-5552-440f-ad24-1811308ca3a8 
--brick-port 49155 --xlator-option patchy-server.listen-port=49155

Please note even though it says 127.1.1.2 its shows the glusterd-uuid 
of the 3 node that was been probed when the snapshot was created 
"3af134ec-5552-440f-ad24-1811308ca3a8"

To clarify things there, there are already a volume brick in 127.1.1.2

root  3446 1  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
/d/backends/2/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
 -S /var/run/e667c69aa7a1481c7bd567b917cd1b05.socket --brick-name 
/d/backends/2/patchy_snap_mnt -l 
/usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
--xlator-option *-posix.glusterd-uuid=a7f4

Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-22 Thread Pranith Kumar Karampuri

Thanks for the update :-)

Pranith
On 07/22/2014 06:42 PM, Joseph Fernandes wrote:

Hi All,

As with further investigation found the following,

1) Was the able to reproduce the issue, without running the complete 
regression, just by running bug-1112559.t only on slave30(which is been 
rebooted and a clean gluster setup).
This rules out any involvement of previous failure from other spurious 
errors like mgmt_v3-locks.t.
2) Added some messages and script (netstat and ps -ef | grep gluster ) 
execution when the binding to a port fails (in 
rpc/rpc-transport/socket/src/socket.c) and found the following,

 Always the snapshot brick in second node (127.1.1.2) fails to acquire 
the port (eg : 127.1.1.2 : 49155 )

 Netstat output shows:
 tcp0  0 127.1.1.2:49155 0.0.0.0:*  
 LISTEN  3555/glusterfsd

 and the process that is holding the port 49155 is

 root  3555 1  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
/d/backends/3/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
 -S /var/run/ff772f1ff85950660f389b0ed43ba2b7.socket --brick-name 
/d/backends/2/patchy_snap_mnt -l 
/usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
--xlator-option *-posix.glusterd-uuid=3af134ec-5552-440f-ad24-1811308ca3a8 
--brick-port 49155 --xlator-option patchy-server.listen-port=49155

 Please note even though it says 127.1.1.2 its shows the glusterd-uuid of the 3 
node that was been probed when the snapshot was created 
"3af134ec-5552-440f-ad24-1811308ca3a8"

 To clarify things there, there are already a volume brick in 127.1.1.2

 root  3446 1  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
/d/backends/2/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
 -S /var/run/e667c69aa7a1481c7bd567b917cd1b05.socket --brick-name 
/d/backends/2/patchy_snap_mnt -l 
/usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
--xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
--brick-port 49153 --xlator-option patchy-server.listen-port=49153

 And the above brick process(3555) is not visible before the snap 
creation or after the failure of the snap brick start on the 127.1.1.2
 This means that this process was spawned and died during the creation 
of the snapshot and probe of the 3rd node (which happens simultaneously)

 In addition to these process, we can see multiple snap brick process 
for the second brick on second node, which are not seen after the failure to 
start the snap brick on 127.1.1.2

 root  3582 1  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2
 -p 
/d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid
 -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name 
/var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l 
/usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log
 --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
--brick-port 49155 --xlator-option 
66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155
 root  3583  3582  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2
 -p 
/d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid
 -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name 
/var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l 
/usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log
 --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
--brick-port 49155 --xlator-option 
66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155



This looks like the second node tries to start snap brick
1) with wrong brickinfo and peerinfo (process 3555)
2) Multiple times with the correct brickinfo (process 3582,3583)

3) This issue is not seen when, snapshots are created and peer probe is NOT 
done simultaneously.

Will continue on the investigation and will keep you posted.


Regards,
Joe




- Original Message -
From: "Joseph Fernandes" 
To: "Avra Sengupta" 
Cc: "Pranith Kumar Karampuri" , "Gluster Devel" , "Varun 
Shastry" , "Justin Clift" 
Sent: Thursday, July 17, 2014 10:58:14 AM
Subject: Re: [G

Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-22 Thread Anders Blomdell
On 2014-07-22 16:44, Justin Clift wrote:
> On 22/07/2014, at 3:28 PM, Joe Julian wrote:
>> On 07/22/2014 07:19 AM, Anders Blomdell wrote:
>>> Could this be a time to propose that gluster understands port reservation 
>>> a'la systemd (LISTEN_FDS),
>>> and make the test harness make sure that random ports do not collide with 
>>> the set of expected ports,
>>> which will be beneficial when starting from systemd as well.
>> Wouldn't that only work for Fedora and RHEL7?
> 
> Probably depends how it's done.  Maybe make it a conditional
> thing that's compiled in or not, depending on the platform?
Don't think so, the LISTEN_FDS is dead simple; if LISTEN_FDS is 
set in the environment, fd#3 to fd#3+LISTEN_FDS are sockets opened
by the calling process, and their function has to be deduced via 
getsockname and sockets should not opened by the process. If 
LISTEN_FDS is not set, proceed to open sockets just like before.

The good thing about this is that systemd can reserve the ports 
used very early during boot, and no other process can steal them
away. For testing purposes, this could be used to assure that
all ports are available before starting tests (if random port
stealing is the true problem here, that is still an unverified
shot in the dark).

> 
> Unless there's a better, cross platform approach of course. :)
> 
> Regards and best wishes,
> 
> Justin Clift
> 
/Anders


-- 
Anders Blomdell  Email: anders.blomd...@control.lth.se
Department of Automatic Control
Lund University  Phone:+46 46 222 4625
P.O. Box 118 Fax:  +46 46 138118
SE-221 00 Lund, Sweden

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-22 Thread Justin Clift
On 22/07/2014, at 3:28 PM, Joe Julian wrote:
> On 07/22/2014 07:19 AM, Anders Blomdell wrote:
>> Could this be a time to propose that gluster understands port reservation 
>> a'la systemd (LISTEN_FDS),
>> and make the test harness make sure that random ports do not collide with 
>> the set of expected ports,
>> which will be beneficial when starting from systemd as well.
> Wouldn't that only work for Fedora and RHEL7?

Probably depends how it's done.  Maybe make it a conditional
thing that's compiled in or not, depending on the platform?

Unless there's a better, cross platform approach of course. :)

Regards and best wishes,

Justin Clift

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-22 Thread Joe Julian

On 07/22/2014 07:19 AM, Anders Blomdell wrote:

Could this be a time to propose that gluster understands port reservation a'la 
systemd (LISTEN_FDS),
and make the test harness make sure that random ports do not collide with the 
set of expected ports,
which will be beneficial when starting from systemd as well.

Wouldn't that only work for Fedora and RHEL7?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-22 Thread Anders Blomdell
On 2014-07-22 15:12, Joseph Fernandes wrote:
> Hi All,
> 
> As with further investigation found the following,
> 
> 1) Was the able to reproduce the issue, without running the complete 
> regression, just by running bug-1112559.t only on slave30(which is been 
> rebooted and a clean gluster setup).
>This rules out any involvement of previous failure from other spurious 
> errors like mgmt_v3-locks.t. 
> 2) Added some messages and script (netstat and ps -ef | grep gluster ) 
> execution when the binding to a port fails (in 
> rpc/rpc-transport/socket/src/socket.c) and found the following,
> 
> Always the snapshot brick in second node (127.1.1.2) fails to acquire 
> the port (eg : 127.1.1.2 : 49155 )
> 
> Netstat output shows: 
> tcp0  0 127.1.1.2:49155 0.0.0.0:* 
>   LISTEN  3555/glusterfsd
Could this be a time to propose that gluster understands port reservation a'la 
systemd (LISTEN_FDS),
and make the test harness make sure that random ports do not collide with the 
set of expected ports,
which will be beneficial when starting from systemd as well.


> 
> and the process that is holding the port 49155 is 
> 
> root  3555 1  0 12:38 ?00:00:00 
> /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
> patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
> /d/backends/3/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
>  -S /var/run/ff772f1ff85950660f389b0ed43ba2b7.socket --brick-name 
> /d/backends/2/patchy_snap_mnt -l 
> /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
> --xlator-option *-posix.glusterd-uuid=3af134ec-5552-440f-ad24-1811308ca3a8 
> --brick-port 49155 --xlator-option patchy-server.listen-port=49155
> 
> Please note even though it says 127.1.1.2 its shows the glusterd-uuid 
> of the 3 node that was been probed when the snapshot was created 
> "3af134ec-5552-440f-ad24-1811308ca3a8"
> 
> To clarify things there, there are already a volume brick in 127.1.1.2
> 
> root  3446 1  0 12:38 ?00:00:00 
> /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
> patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
> /d/backends/2/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
>  -S /var/run/e667c69aa7a1481c7bd567b917cd1b05.socket --brick-name 
> /d/backends/2/patchy_snap_mnt -l 
> /usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
> --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
> --brick-port 49153 --xlator-option patchy-server.listen-port=49153
> 
> And the above brick process(3555) is not visible before the snap 
> creation or after the failure of the snap brick start on the 127.1.1.2
> This means that this process was spawned and died during the creation 
> of the snapshot and probe of the 3rd node (which happens simultaneously)
> 
> In addition to these process, we can see multiple snap brick process 
> for the second brick on second node, which are not seen after the failure to 
> start the snap brick on 127.1.1.2
> 
> root  3582 1  0 12:38 ?00:00:00 
> /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
> /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2
>  -p 
> /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid
>  -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name 
> /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l 
> /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log
>  --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
> --brick-port 49155 --xlator-option 
> 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155
> root  3583  3582  0 12:38 ?00:00:00 
> /usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
> /snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2
>  -p 
> /d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid
>  -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name 
> /var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l 
> /usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log
>  --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
> --brick-port 49155 --xlator-option 
> 66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155
> 
> 
> 
> This looks like the second node tries to start snap brick
> 1) with wrong brickinfo and peerinfo (process 3555)
> 2) Multiple times with the correct brickinfo (process 3582,3583)
3583 is a su

Re: [Gluster-devel] spurious regression failures again! [bug-1112559.t]

2014-07-22 Thread Joseph Fernandes
Hi All,

As with further investigation found the following,

1) Was the able to reproduce the issue, without running the complete 
regression, just by running bug-1112559.t only on slave30(which is been 
rebooted and a clean gluster setup).
   This rules out any involvement of previous failure from other spurious 
errors like mgmt_v3-locks.t. 
2) Added some messages and script (netstat and ps -ef | grep gluster ) 
execution when the binding to a port fails (in 
rpc/rpc-transport/socket/src/socket.c) and found the following,

Always the snapshot brick in second node (127.1.1.2) fails to acquire 
the port (eg : 127.1.1.2 : 49155 )

Netstat output shows: 
tcp0  0 127.1.1.2:49155 0.0.0.0:*   
LISTEN  3555/glusterfsd

and the process that is holding the port 49155 is 

root  3555 1  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
/d/backends/3/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
 -S /var/run/ff772f1ff85950660f389b0ed43ba2b7.socket --brick-name 
/d/backends/2/patchy_snap_mnt -l 
/usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
--xlator-option *-posix.glusterd-uuid=3af134ec-5552-440f-ad24-1811308ca3a8 
--brick-port 49155 --xlator-option patchy-server.listen-port=49155

Please note even though it says 127.1.1.2 its shows the glusterd-uuid 
of the 3 node that was been probed when the snapshot was created 
"3af134ec-5552-440f-ad24-1811308ca3a8"

To clarify things there, there are already a volume brick in 127.1.1.2

root  3446 1  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
patchy.127.1.1.2.d-backends-2-patchy_snap_mnt -p 
/d/backends/2/glusterd/vols/patchy/run/127.1.1.2-d-backends-2-patchy_snap_mnt.pid
 -S /var/run/e667c69aa7a1481c7bd567b917cd1b05.socket --brick-name 
/d/backends/2/patchy_snap_mnt -l 
/usr/local/var/log/glusterfs/bricks/d-backends-2-patchy_snap_mnt.log 
--xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
--brick-port 49153 --xlator-option patchy-server.listen-port=49153

And the above brick process(3555) is not visible before the snap 
creation or after the failure of the snap brick start on the 127.1.1.2
This means that this process was spawned and died during the creation 
of the snapshot and probe of the 3rd node (which happens simultaneously)

In addition to these process, we can see multiple snap brick process 
for the second brick on second node, which are not seen after the failure to 
start the snap brick on 127.1.1.2

root  3582 1  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2
 -p 
/d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid
 -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name 
/var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l 
/usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log
 --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
--brick-port 49155 --xlator-option 
66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155
root  3583  3582  0 12:38 ?00:00:00 
/usr/local/sbin/glusterfsd -s 127.1.1.2 --volfile-id 
/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d.127.1.1.2.var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2
 -p 
/d/backends/2/glusterd/snaps/patchy_snap1/66ac70130be84b5c9695df8252a56a6d/run/127.1.1.2-var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.pid
 -S /var/run/668f3d4b1c55477fd5ad1ae381de0447.socket --brick-name 
/var/run/gluster/snaps/66ac70130be84b5c9695df8252a56a6d/brick2 -l 
/usr/local/var/log/glusterfs/bricks/var-run-gluster-snaps-66ac70130be84b5c9695df8252a56a6d-brick2.log
 --xlator-option *-posix.glusterd-uuid=a7f461d0-5ea7-4b25-b6c5-388d8eb1893f 
--brick-port 49155 --xlator-option 
66ac70130be84b5c9695df8252a56a6d-server.listen-port=49155



This looks like the second node tries to start snap brick
1) with wrong brickinfo and peerinfo (process 3555)
2) Multiple times with the correct brickinfo (process 3582,3583)

3) This issue is not seen when, snapshots are created and peer probe is NOT 
done simultaneously.

Will continue on the investigation and will keep you posted.


Regards,
Joe




- Original Message -
From: "Joseph Fernandes" 
To: "Avra Sengupta" 
Cc: "Pranith Kumar Karampuri" , "Gluster Devel" 
, "Varun Shastry" , "Justin 
Clift" 
Sent: Thursday, July 17, 2014 10:58:14 AM
Subject: Re: [Gluster-devel] spurious regression failures again!

Hi Avra,

Just clarifying things here,
1) Wh