----- Original Message -----
> From: "Vijay Bellur" <vbel...@redhat.com>
> To: "Atin Mukherjee" <amukh...@redhat.com>
> Cc: "Oleksandr Natalenko" <oleksa...@natalenko.name>, "Nithya Balachandran" 
> <nbala...@redhat.com>, "Raghavendra
> Gowdappa" <rgowd...@redhat.com>, "Shyam Ranganathan" <srang...@redhat.com>, 
> "Gluster Devel"
> <gluster-devel@gluster.org>
> Sent: Tuesday, October 18, 2016 11:07:39 PM
> Subject: Re: [Gluster-devel] Spurious failure of 
> ./tests/bugs/glusterd/bug-913555.t
> 
> On Tue, Oct 18, 2016 at 12:28 PM, Atin Mukherjee <amukh...@redhat.com> wrote:
> > Final reminder before I take out the test case from the test file.
> >
> >
> > On Thursday 13 October 2016, Atin Mukherjee <amukh...@redhat.com> wrote:
> >>
> >>
> >>
> >> On Wednesday 12 October 2016, Atin Mukherjee <amukh...@redhat.com> wrote:
> >>>
> >>> So the test fails (intermittently) in check_fs which tries to do a df on
> >>> the mount point for a volume which is carved out of three bricks from 3
> >>> nodes and one node is completely down. A quick look at the mount log
> >>> reveals
> >>> the following:
> >>>
> >>> [2016-10-10 13:58:59.279446]:++++++++++
> >>> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
> >>> /mnt/glusterfs/0 ++++++++++
> >>> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> >>> remote
> >>> operation failed. Path: / (00000000-0000-0000-0000-000000000001)
> >>> [Transport
> >>> endpoint is not connected]
> >>> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
> >>> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies in
> >>> /
> >>> (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
> >>> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
> >>> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
> >>> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
> >>> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> >>> remote
> >>> operation failed. Path: / (00000000-0000-0000-0000-000000000001)
> >>> [Transport
> >>> endpoint is not connected]
> >>> [2016-10-10 13:58:59.288927] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> >>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to
> >>> resolve
> >>> (Stale file handle)
> >>> [2016-10-10 13:58:59.288949] W [fuse-bridge.c:2597:fuse_opendir_resume]
> >>> 0-glusterfs-fuse: 7: OPENDIR (00000000-0000- 0000-0000-000000000001)
> >>> resolution failed
> >>> [2016-10-10 13:58:59.289505] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> >>> 0-fuse: 00000000-0000-0000-0000-           000000000001: failed to
> >>> resolve
> >>> (Stale file handle)
> >>> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
> >>> 0-glusterfs-fuse: 8: STATFS (00000000-0000-   0000-0000-000000000001)
> >>> resolution fail
> >>>
> >>> DHT  team - are these anomalies expected here? I also see opendir and
> >>> statfs failing here too.
> >>
> >>
> >> Any luck with this? I don't see any relevance of having a check_fs test
> >> w.r.t the bug this test case is tagged to. If I don't get to hear on this
> >> in
> >> few days, I'd go ahead and remove this check from the test to avoid the
> >> spurious failure.
> >>
> 
> 
> Looks like dht was not aware of a subvolume being down. We pick up
> first_up_subvolume for winding lookup on the root gfid in dht and in
> this case we have picked up the subvolume referring to the brick which
> was brought down and hence the failure.

Hadn't observed that DHT treats nameless lookups on root different from other 
paths. Thanks for pointing it out. My initial code reading didn't reveal any 
reasons as to why lookup is failed in DHT (since the other two subvols were 
up). Will dig more into it and report back the findings.

> 
> The test has this snippet:
> 
> <snippet>
> # Kill one pseudo-node, make sure the others survive and volume stays up.
> TEST kill_node 3;
> EXPECT_WITHIN $PROBE_TIMEOUT 1 check_peers;
> EXPECT 0 check_fs $M0;
> </snippet>
> 
> Maybe we should change EXPECT to an EXPECT_WITHIN to let CHILD_DOWN
> percolate to dht?
> 
> Logs indicate that dht was not aware of the subvolume being down for
> at least 1 second after protocol/client sensed the disconnection.
> 
> [2016-10-10 13:58:58.235700] I [MSGID: 114018]
> [client.c:2276:client_rpc_notify] 0-patchy-client-2: disconnected from
> patchy-client-2. Client process will keep trying to connect to
> glusterd until brick's port is available
> [2016-10-10 13:58:58.245060]:++++++++++
> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 47 3
> online_brick_count ++++++++++
> [2016-10-10 13:58:59.279446]:++++++++++
> G_LOG:./tests/bugs/glusterd/bug-913555.t: TEST: 48 0 check_fs
> /mnt/glusterfs/0 ++++++++++
> [2016-10-10 13:58:59.287973] W [MSGID: 114031]
> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
> [2016-10-10 13:58:59.288326] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 0-patchy-dht: Found anomalies
> in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
> [2016-10-10 13:58:59.288352] W [MSGID: 109005]
> [dht-selfheal.c:2102:dht_selfheal_directory] 0-patchy-dht: Directory
> selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid =
> [2016-10-10 13:58:59.288643] W [MSGID: 114031]
> [client-rpc-fops.c:2930:client3_3_lookup_cbk] 0-patchy-client-2:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
> [2016-10-10 13:58:59.288927] W
> [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
> 00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
> handle)
> [2016-10-10 13:58:59.288949] W
> [fuse-bridge.c:2597:fuse_opendir_resume] 0-glusterfs-fuse: 7: OPENDIR
> (00000000-0000-0000-0000-000000000001) resolution failed
> [2016-10-10 13:58:59.289505] W
> [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse:
> 00000000-0000-0000-0000-000000000001: failed to resolve (Stale file
> handle)
> [2016-10-10 13:58:59.289524] W [fuse-bridge.c:3137:fuse_statfs_resume]
> 0-glusterfs-fuse: 8: STATFS (00000000-0000-0000-0000-000000000001)
> resolution fail
> 
> Regards,
> Vijay
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to