Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Deepshikha, I see the failure here[1] which ran on builder206. So, we are good. [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal wrote: > Sanju, can you please give us more info about the failures. > > I see the failures occurring on just one of the builder (builder206). I'm > taking it back offline for now. > > On Tue, May 7, 2019 at 9:42 PM Michael Scherer > wrote: > >> Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit : >> > Looks like is_nfs_export_available started failing again in recent >> > centos-regressions. >> > >> > Michael, can you please check? >> >> I will try but I am leaving for vacation tonight, so if I find nothing, >> until I leave, I guess Deepshika will have to look. >> >> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: >> > >> > > >> > > >> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >> > > msche...@redhat.com> >> > > wrote: >> > > >> > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : >> > > > > Is this back again? The recent patches are failing regression >> > > > > :-\ . >> > > > >> > > > So, on builder206, it took me a while to find that the issue is >> > > > that >> > > > nfs (the service) was running. >> > > > >> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >> > > > initialisation >> > > > failed with a rather cryptic message: >> > > > >> > > > [2019-04-23 13:17:05.371733] I >> > > > [socket.c:991:__socket_server_bind] 0- >> > > > socket.nfs-server: process started listening on port (38465) >> > > > [2019-04-23 13:17:05.385819] E >> > > > [socket.c:972:__socket_server_bind] 0- >> > > > socket.nfs-server: binding to failed: Address already in use >> > > > [2019-04-23 13:17:05.385843] E >> > > > [socket.c:974:__socket_server_bind] 0- >> > > > socket.nfs-server: Port is already in use >> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >> > > > >> > > > I found where this came from, but a few stuff did surprised me: >> > > > >> > > > - the order of print is different that the order in the code >> > > > >> > > >> > > Indeed strange... >> > > >> > > > - the message on "started listening" didn't take in account the >> > > > fact >> > > > that bind failed on: >> > > > >> > > >> > > Shouldn't it bail out if it failed to bind? >> > > Some missing 'goto out' around line 975/976? >> > > Y. >> > > >> > > > >> > > > >> > > > >> > > > >> >> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >> > > > >> > > > The message about port 38465 also threw me off the track. The >> > > > real >> > > > issue is that the service nfs was already running, and I couldn't >> > > > find >> > > > anything listening on port 38465 >> > > > >> > > > once I do service nfs stop, it no longer failed. >> > > > >> > > > So far, I do know why nfs.service was activated. >> > > > >> > > > But at least, 206 should be fixed, and we know a bit more on what >> > > > would >> > > > be causing some failure. >> > > > >> > > > >> > > > >> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >> > > > > msche...@redhat.com> >> > > > > wrote: >> > > > > >> > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a >> > > > > > écrit : >> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >> > > > > > > jthot...@redhat.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi, >> > > > > > > > >> > > > > > > > is_nfs_export_available is just a wrapper around >> > > > > > > > "showmount" >> > > > > > > > command AFAIR. >> > > > > > > > I saw following messages in console output. >> > > > > > > > mount.nfs: rpc.statd is not running but is required for >> > > > > > > > remote >> > > > > > > > locking. >> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >> > > > > > > > local, >> > > > > > > > or >> > > > > > > > start >> > > > > > > > statd. >> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >> > > > > > > > specified >> > > > > > > > >> > > > > > > > For me it looks rpcbind may not be running on the >> > > > > > > > machine. >> > > > > > > > Usually rpcbind starts automatically on machines, don't >> > > > > > > > know >> > > > > > > > whether it >> > > > > > > > can happen or not. >> > > > > > > > >> > > > > > > >> > > > > > > That's precisely what the question is. Why suddenly we're >> > > > > > > seeing >> > > > > > > this >> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such >> > > > > > > failures >> > > > > > > already. >> > > > > > > >> > > > > > > Deepshika - Can you please help in inspecting this? >> > > > > > >> > > > > > So we think (we are not sure) that the issue is a bit >> > > > > > complex. >> > > > > > >> > > > > > What we were investigating was nightly run fail on aws. When >> > > > > > the >> > > > > > build >> > > > > > crash, the builder is restarted, since that's t
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Looks like is_nfs_export_available started failing again in recent centos-regressions. Michael, can you please check? On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer > wrote: > >> Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : >> > Is this back again? The recent patches are failing regression :-\ . >> >> So, on builder206, it took me a while to find that the issue is that >> nfs (the service) was running. >> >> ./tests/basic/afr/tarissue.t failed, because the nfs initialisation >> failed with a rather cryptic message: >> >> [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- >> socket.nfs-server: process started listening on port (38465) >> [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- >> socket.nfs-server: binding to failed: Address already in use >> [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- >> socket.nfs-server: Port is already in use >> [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >> socket.nfs-server: __socket_server_bind failed;closing socket 14 >> >> I found where this came from, but a few stuff did surprised me: >> >> - the order of print is different that the order in the code >> > > Indeed strange... > >> - the message on "started listening" didn't take in account the fact >> that bind failed on: >> > > Shouldn't it bail out if it failed to bind? > Some missing 'goto out' around line 975/976? > Y. > >> >> >> >> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >> >> The message about port 38465 also threw me off the track. The real >> issue is that the service nfs was already running, and I couldn't find >> anything listening on port 38465 >> >> once I do service nfs stop, it no longer failed. >> >> So far, I do know why nfs.service was activated. >> >> But at least, 206 should be fixed, and we know a bit more on what would >> be causing some failure. >> >> >> >> > On Wed, 3 Apr 2019 at 19:26, Michael Scherer >> > wrote: >> > >> > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : >> > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >> > > > jthot...@redhat.com> >> > > > wrote: >> > > > >> > > > > Hi, >> > > > > >> > > > > is_nfs_export_available is just a wrapper around "showmount" >> > > > > command AFAIR. >> > > > > I saw following messages in console output. >> > > > > mount.nfs: rpc.statd is not running but is required for remote >> > > > > locking. >> > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, >> > > > > or >> > > > > start >> > > > > statd. >> > > > > 05:06:55 mount.nfs: an incorrect mount option was specified >> > > > > >> > > > > For me it looks rpcbind may not be running on the machine. >> > > > > Usually rpcbind starts automatically on machines, don't know >> > > > > whether it >> > > > > can happen or not. >> > > > > >> > > > >> > > > That's precisely what the question is. Why suddenly we're seeing >> > > > this >> > > > happening too frequently. Today I saw atleast 4 to 5 such >> > > > failures >> > > > already. >> > > > >> > > > Deepshika - Can you please help in inspecting this? >> > > >> > > So we think (we are not sure) that the issue is a bit complex. >> > > >> > > What we were investigating was nightly run fail on aws. When the >> > > build >> > > crash, the builder is restarted, since that's the easiest way to >> > > clean >> > > everything (since even with a perfect test suite that would clean >> > > itself, we could always end in a corrupt state on the system, WRT >> > > mount, fs, etc). >> > > >> > > In turn, this seems to cause trouble on aws, since cloud-init or >> > > something rename eth0 interface to ens5, without cleaning to the >> > > network configuration. >> > > >> > > So the network init script fail (because the image say "start eth0" >> > > and >> > > that's not present), but fail in a weird way. Network is >> > > initialised >> > > and working (we can connect), but the dhclient process is not in >> > > the >> > > right cgroup, and network.service is in failed state. Restarting >> > > network didn't work. In turn, this mean that rpc-statd refuse to >> > > start >> > > (due to systemd dependencies), which seems to impact various NFS >> > > tests. >> > > >> > > We have also seen that on some builders, rpcbind pick some IP v6 >> > > autoconfiguration, but we can't reproduce that, and there is no ip >> > > v6 >> > > set up anywhere. I suspect the network.service failure is somehow >> > > involved, but fail to see how. In turn, rpcbind.socket not starting >> > > could cause NFS test troubles. >> > > >> > > Our current stop gap fix was to fix all the builders one by one. >> > > Remove >> > > the config, kill the rogue dhclient, restart network service. >> > > >> > > However, we can't be sure this is going to fix the problem long >> > > term >> > > since this only manifest after a crash of the test suite, and it >
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
I took a quick look at the builders and noticed both have the same error of 'Cannot allocate memory' which comes up every time when the builder is rebooted after a build abort. It is happening in the same pattern. Though there's no such memory consumption on the builders. I’m investigating more on this. On Thu, May 9, 2019 at 10:02 AM Atin Mukherjee wrote: > > > On Wed, May 8, 2019 at 7:38 PM Atin Mukherjee wrote: > >> builder204 needs to be fixed, too many failures, mostly none of the >> patches are passing regression. >> > > And with that builder201 joins the pool, > https://build.gluster.org/job/centos7-regression/5943/consoleFull > > >> On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee >> wrote: >> >>> >>> >>> On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde >>> wrote: >>> Deepshikha, I see the failure here[1] which ran on builder206. So, we are good. >>> >>> Not really, >>> https://build.gluster.org/job/centos7-regression/5909/consoleFull >>> failed on builder204 for similar reasons I believe? >>> >>> I am bit more worried on this issue being resurfacing more often these >>> days. What can we do to fix this permanently? >>> >>> [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal < dkhan...@redhat.com> wrote: > Sanju, can you please give us more info about the failures. > > I see the failures occurring on just one of the builder (builder206). > I'm taking it back offline for now. > > On Tue, May 7, 2019 at 9:42 PM Michael Scherer > wrote: > >> Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit : >> > Looks like is_nfs_export_available started failing again in recent >> > centos-regressions. >> > >> > Michael, can you please check? >> >> I will try but I am leaving for vacation tonight, so if I find >> nothing, >> until I leave, I guess Deepshika will have to look. >> >> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul >> wrote: >> > >> > > >> > > >> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >> > > msche...@redhat.com> >> > > wrote: >> > > >> > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : >> > > > > Is this back again? The recent patches are failing regression >> > > > > :-\ . >> > > > >> > > > So, on builder206, it took me a while to find that the issue is >> > > > that >> > > > nfs (the service) was running. >> > > > >> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >> > > > initialisation >> > > > failed with a rather cryptic message: >> > > > >> > > > [2019-04-23 13:17:05.371733] I >> > > > [socket.c:991:__socket_server_bind] 0- >> > > > socket.nfs-server: process started listening on port (38465) >> > > > [2019-04-23 13:17:05.385819] E >> > > > [socket.c:972:__socket_server_bind] 0- >> > > > socket.nfs-server: binding to failed: Address already in use >> > > > [2019-04-23 13:17:05.385843] E >> > > > [socket.c:974:__socket_server_bind] 0- >> > > > socket.nfs-server: Port is already in use >> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >> > > > >> > > > I found where this came from, but a few stuff did surprised me: >> > > > >> > > > - the order of print is different that the order in the code >> > > > >> > > >> > > Indeed strange... >> > > >> > > > - the message on "started listening" didn't take in account the >> > > > fact >> > > > that bind failed on: >> > > > >> > > >> > > Shouldn't it bail out if it failed to bind? >> > > Some missing 'goto out' around line 975/976? >> > > Y. >> > > >> > > > >> > > > >> > > > >> > > > >> >> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >> > > > >> > > > The message about port 38465 also threw me off the track. The >> > > > real >> > > > issue is that the service nfs was already running, and I >> couldn't >> > > > find >> > > > anything listening on port 38465 >> > > > >> > > > once I do service nfs stop, it no longer failed. >> > > > >> > > > So far, I do know why nfs.service was activated. >> > > > >> > > > But at least, 206 should be fixed, and we know a bit more on >> what >> > > > would >> > > > be causing some failure. >> > > > >> > > > >> > > > >> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >> > > > > msche...@redhat.com> >> > > > > wrote: >> > > > > >> > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a >> > > > > > écrit : >> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >> > > > > > > jthot...@redhat.com> >
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
On Wed, May 8, 2019 at 7:38 PM Atin Mukherjee wrote: > builder204 needs to be fixed, too many failures, mostly none of the > patches are passing regression. > And with that builder201 joins the pool, https://build.gluster.org/job/centos7-regression/5943/consoleFull > On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee wrote: > >> >> >> On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde wrote: >> >>> Deepshikha, >>> >>> I see the failure here[1] which ran on builder206. So, we are good. >>> >> >> Not really, >> https://build.gluster.org/job/centos7-regression/5909/consoleFull failed >> on builder204 for similar reasons I believe? >> >> I am bit more worried on this issue being resurfacing more often these >> days. What can we do to fix this permanently? >> >> >>> [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull >>> >>> On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal < >>> dkhan...@redhat.com> wrote: >>> Sanju, can you please give us more info about the failures. I see the failures occurring on just one of the builder (builder206). I'm taking it back offline for now. On Tue, May 7, 2019 at 9:42 PM Michael Scherer wrote: > Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit : > > Looks like is_nfs_export_available started failing again in recent > > centos-regressions. > > > > Michael, can you please check? > > I will try but I am leaving for vacation tonight, so if I find nothing, > until I leave, I guess Deepshika will have to look. > > > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > > > > > > > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < > > > msche...@redhat.com> > > > wrote: > > > > > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : > > > > > Is this back again? The recent patches are failing regression > > > > > :-\ . > > > > > > > > So, on builder206, it took me a while to find that the issue is > > > > that > > > > nfs (the service) was running. > > > > > > > > ./tests/basic/afr/tarissue.t failed, because the nfs > > > > initialisation > > > > failed with a rather cryptic message: > > > > > > > > [2019-04-23 13:17:05.371733] I > > > > [socket.c:991:__socket_server_bind] 0- > > > > socket.nfs-server: process started listening on port (38465) > > > > [2019-04-23 13:17:05.385819] E > > > > [socket.c:972:__socket_server_bind] 0- > > > > socket.nfs-server: binding to failed: Address already in use > > > > [2019-04-23 13:17:05.385843] E > > > > [socket.c:974:__socket_server_bind] 0- > > > > socket.nfs-server: Port is already in use > > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > > > > > > > I found where this came from, but a few stuff did surprised me: > > > > > > > > - the order of print is different that the order in the code > > > > > > > > > > Indeed strange... > > > > > > > - the message on "started listening" didn't take in account the > > > > fact > > > > that bind failed on: > > > > > > > > > > Shouldn't it bail out if it failed to bind? > > > Some missing 'goto out' around line 975/976? > > > Y. > > > > > > > > > > > > > > > > > > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > > > > > > > The message about port 38465 also threw me off the track. The > > > > real > > > > issue is that the service nfs was already running, and I couldn't > > > > find > > > > anything listening on port 38465 > > > > > > > > once I do service nfs stop, it no longer failed. > > > > > > > > So far, I do know why nfs.service was activated. > > > > > > > > But at least, 206 should be fixed, and we know a bit more on what > > > > would > > > > be causing some failure. > > > > > > > > > > > > > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < > > > > > msche...@redhat.com> > > > > > wrote: > > > > > > > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a > > > > > > écrit : > > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > > > > jthot...@redhat.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > is_nfs_export_available is just a wrapper around > > > > > > > > "showmount" > > > > > > > > command AFAIR. > > > > > > > > I saw following messages in console output. > > > > > > > > mount.nfs: rpc.statd is not running but is required for > > > > > > > > remote > > > > > > > > locking. > > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks > > > > > > > > local, > > > > > >
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
builder204 needs to be fixed, too many failures, mostly none of the patches are passing regression. On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee wrote: > > > On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde wrote: > >> Deepshikha, >> >> I see the failure here[1] which ran on builder206. So, we are good. >> > > Not really, > https://build.gluster.org/job/centos7-regression/5909/consoleFull failed > on builder204 for similar reasons I believe? > > I am bit more worried on this issue being resurfacing more often these > days. What can we do to fix this permanently? > > >> [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull >> >> On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal < >> dkhan...@redhat.com> wrote: >> >>> Sanju, can you please give us more info about the failures. >>> >>> I see the failures occurring on just one of the builder (builder206). >>> I'm taking it back offline for now. >>> >>> On Tue, May 7, 2019 at 9:42 PM Michael Scherer >>> wrote: >>> Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit : > Looks like is_nfs_export_available started failing again in recent > centos-regressions. > > Michael, can you please check? I will try but I am leaving for vacation tonight, so if I find nothing, until I leave, I guess Deepshika will have to look. > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > > > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < > > msche...@redhat.com> > > wrote: > > > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : > > > > Is this back again? The recent patches are failing regression > > > > :-\ . > > > > > > So, on builder206, it took me a while to find that the issue is > > > that > > > nfs (the service) was running. > > > > > > ./tests/basic/afr/tarissue.t failed, because the nfs > > > initialisation > > > failed with a rather cryptic message: > > > > > > [2019-04-23 13:17:05.371733] I > > > [socket.c:991:__socket_server_bind] 0- > > > socket.nfs-server: process started listening on port (38465) > > > [2019-04-23 13:17:05.385819] E > > > [socket.c:972:__socket_server_bind] 0- > > > socket.nfs-server: binding to failed: Address already in use > > > [2019-04-23 13:17:05.385843] E > > > [socket.c:974:__socket_server_bind] 0- > > > socket.nfs-server: Port is already in use > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > > > > > I found where this came from, but a few stuff did surprised me: > > > > > > - the order of print is different that the order in the code > > > > > > > Indeed strange... > > > > > - the message on "started listening" didn't take in account the > > > fact > > > that bind failed on: > > > > > > > Shouldn't it bail out if it failed to bind? > > Some missing 'goto out' around line 975/976? > > Y. > > > > > > > > > > > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > > > > > The message about port 38465 also threw me off the track. The > > > real > > > issue is that the service nfs was already running, and I couldn't > > > find > > > anything listening on port 38465 > > > > > > once I do service nfs stop, it no longer failed. > > > > > > So far, I do know why nfs.service was activated. > > > > > > But at least, 206 should be fixed, and we know a bit more on what > > > would > > > be causing some failure. > > > > > > > > > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < > > > > msche...@redhat.com> > > > > wrote: > > > > > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a > > > > > écrit : > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > > > jthot...@redhat.com> > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > is_nfs_export_available is just a wrapper around > > > > > > > "showmount" > > > > > > > command AFAIR. > > > > > > > I saw following messages in console output. > > > > > > > mount.nfs: rpc.statd is not running but is required for > > > > > > > remote > > > > > > > locking. > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks > > > > > > > local, > > > > > > > or > > > > > > > start > > > > > > > statd. > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was > > > > > > > specified > > > > > > > > > > > > > > For me it looks rpcbind may not be running on the > > > > > > > machine. > > > > > > > Usually rpcbin
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde wrote: > Deepshikha, > > I see the failure here[1] which ran on builder206. So, we are good. > Not really, https://build.gluster.org/job/centos7-regression/5909/consoleFull failed on builder204 for similar reasons I believe? I am bit more worried on this issue being resurfacing more often these days. What can we do to fix this permanently? > [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull > > On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal > wrote: > >> Sanju, can you please give us more info about the failures. >> >> I see the failures occurring on just one of the builder (builder206). I'm >> taking it back offline for now. >> >> On Tue, May 7, 2019 at 9:42 PM Michael Scherer >> wrote: >> >>> Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit : >>> > Looks like is_nfs_export_available started failing again in recent >>> > centos-regressions. >>> > >>> > Michael, can you please check? >>> >>> I will try but I am leaving for vacation tonight, so if I find nothing, >>> until I leave, I guess Deepshika will have to look. >>> >>> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: >>> > >>> > > >>> > > >>> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >>> > > msche...@redhat.com> >>> > > wrote: >>> > > >>> > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : >>> > > > > Is this back again? The recent patches are failing regression >>> > > > > :-\ . >>> > > > >>> > > > So, on builder206, it took me a while to find that the issue is >>> > > > that >>> > > > nfs (the service) was running. >>> > > > >>> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >>> > > > initialisation >>> > > > failed with a rather cryptic message: >>> > > > >>> > > > [2019-04-23 13:17:05.371733] I >>> > > > [socket.c:991:__socket_server_bind] 0- >>> > > > socket.nfs-server: process started listening on port (38465) >>> > > > [2019-04-23 13:17:05.385819] E >>> > > > [socket.c:972:__socket_server_bind] 0- >>> > > > socket.nfs-server: binding to failed: Address already in use >>> > > > [2019-04-23 13:17:05.385843] E >>> > > > [socket.c:974:__socket_server_bind] 0- >>> > > > socket.nfs-server: Port is already in use >>> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >>> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >>> > > > >>> > > > I found where this came from, but a few stuff did surprised me: >>> > > > >>> > > > - the order of print is different that the order in the code >>> > > > >>> > > >>> > > Indeed strange... >>> > > >>> > > > - the message on "started listening" didn't take in account the >>> > > > fact >>> > > > that bind failed on: >>> > > > >>> > > >>> > > Shouldn't it bail out if it failed to bind? >>> > > Some missing 'goto out' around line 975/976? >>> > > Y. >>> > > >>> > > > >>> > > > >>> > > > >>> > > > >>> >>> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >>> > > > >>> > > > The message about port 38465 also threw me off the track. The >>> > > > real >>> > > > issue is that the service nfs was already running, and I couldn't >>> > > > find >>> > > > anything listening on port 38465 >>> > > > >>> > > > once I do service nfs stop, it no longer failed. >>> > > > >>> > > > So far, I do know why nfs.service was activated. >>> > > > >>> > > > But at least, 206 should be fixed, and we know a bit more on what >>> > > > would >>> > > > be causing some failure. >>> > > > >>> > > > >>> > > > >>> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >>> > > > > msche...@redhat.com> >>> > > > > wrote: >>> > > > > >>> > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a >>> > > > > > écrit : >>> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >>> > > > > > > jthot...@redhat.com> >>> > > > > > > wrote: >>> > > > > > > >>> > > > > > > > Hi, >>> > > > > > > > >>> > > > > > > > is_nfs_export_available is just a wrapper around >>> > > > > > > > "showmount" >>> > > > > > > > command AFAIR. >>> > > > > > > > I saw following messages in console output. >>> > > > > > > > mount.nfs: rpc.statd is not running but is required for >>> > > > > > > > remote >>> > > > > > > > locking. >>> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >>> > > > > > > > local, >>> > > > > > > > or >>> > > > > > > > start >>> > > > > > > > statd. >>> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >>> > > > > > > > specified >>> > > > > > > > >>> > > > > > > > For me it looks rpcbind may not be running on the >>> > > > > > > > machine. >>> > > > > > > > Usually rpcbind starts automatically on machines, don't >>> > > > > > > > know >>> > > > > > > > whether it >>> > > > > > > > can happen or not. >>> > > > > > > > >>> > > > > > > >>> > > > > > > That's precisely what the question is. Why suddenly we're >>> > > > > > > seeing >>> > > > > > > this >>> > > > > > > happening too frequently. T
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Sanju, can you please give us more info about the failures. I see the failures occurring on just one of the builder (builder206). I'm taking it back offline for now. On Tue, May 7, 2019 at 9:42 PM Michael Scherer wrote: > Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit : > > Looks like is_nfs_export_available started failing again in recent > > centos-regressions. > > > > Michael, can you please check? > > I will try but I am leaving for vacation tonight, so if I find nothing, > until I leave, I guess Deepshika will have to look. > > > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > > > > > > > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < > > > msche...@redhat.com> > > > wrote: > > > > > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : > > > > > Is this back again? The recent patches are failing regression > > > > > :-\ . > > > > > > > > So, on builder206, it took me a while to find that the issue is > > > > that > > > > nfs (the service) was running. > > > > > > > > ./tests/basic/afr/tarissue.t failed, because the nfs > > > > initialisation > > > > failed with a rather cryptic message: > > > > > > > > [2019-04-23 13:17:05.371733] I > > > > [socket.c:991:__socket_server_bind] 0- > > > > socket.nfs-server: process started listening on port (38465) > > > > [2019-04-23 13:17:05.385819] E > > > > [socket.c:972:__socket_server_bind] 0- > > > > socket.nfs-server: binding to failed: Address already in use > > > > [2019-04-23 13:17:05.385843] E > > > > [socket.c:974:__socket_server_bind] 0- > > > > socket.nfs-server: Port is already in use > > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > > > > > > > I found where this came from, but a few stuff did surprised me: > > > > > > > > - the order of print is different that the order in the code > > > > > > > > > > Indeed strange... > > > > > > > - the message on "started listening" didn't take in account the > > > > fact > > > > that bind failed on: > > > > > > > > > > Shouldn't it bail out if it failed to bind? > > > Some missing 'goto out' around line 975/976? > > > Y. > > > > > > > > > > > > > > > > > > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > > > > > > > The message about port 38465 also threw me off the track. The > > > > real > > > > issue is that the service nfs was already running, and I couldn't > > > > find > > > > anything listening on port 38465 > > > > > > > > once I do service nfs stop, it no longer failed. > > > > > > > > So far, I do know why nfs.service was activated. > > > > > > > > But at least, 206 should be fixed, and we know a bit more on what > > > > would > > > > be causing some failure. > > > > > > > > > > > > > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < > > > > > msche...@redhat.com> > > > > > wrote: > > > > > > > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a > > > > > > écrit : > > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > > > > jthot...@redhat.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > is_nfs_export_available is just a wrapper around > > > > > > > > "showmount" > > > > > > > > command AFAIR. > > > > > > > > I saw following messages in console output. > > > > > > > > mount.nfs: rpc.statd is not running but is required for > > > > > > > > remote > > > > > > > > locking. > > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks > > > > > > > > local, > > > > > > > > or > > > > > > > > start > > > > > > > > statd. > > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was > > > > > > > > specified > > > > > > > > > > > > > > > > For me it looks rpcbind may not be running on the > > > > > > > > machine. > > > > > > > > Usually rpcbind starts automatically on machines, don't > > > > > > > > know > > > > > > > > whether it > > > > > > > > can happen or not. > > > > > > > > > > > > > > > > > > > > > > That's precisely what the question is. Why suddenly we're > > > > > > > seeing > > > > > > > this > > > > > > > happening too frequently. Today I saw atleast 4 to 5 such > > > > > > > failures > > > > > > > already. > > > > > > > > > > > > > > Deepshika - Can you please help in inspecting this? > > > > > > > > > > > > So we think (we are not sure) that the issue is a bit > > > > > > complex. > > > > > > > > > > > > What we were investigating was nightly run fail on aws. When > > > > > > the > > > > > > build > > > > > > crash, the builder is restarted, since that's the easiest way > > > > > > to > > > > > > clean > > > > > > everything (since even with a perfect test suite that would > > > > > > clean > > > > > > itself, we could always end in a corrupt state on the system, > > > > > > WRT > > > > > > mount, fs, etc). > > > > > > > > > > > > In turn, this seems to cause trouble on aws, since cloud-init > > > > > > or >
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Le mardi 07 mai 2019 à 20:04 +0530, Sanju Rakonde a écrit : > Looks like is_nfs_export_available started failing again in recent > centos-regressions. > > Michael, can you please check? I will try but I am leaving for vacation tonight, so if I find nothing, until I leave, I guess Deepshika will have to look. > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > > > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < > > msche...@redhat.com> > > wrote: > > > > > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : > > > > Is this back again? The recent patches are failing regression > > > > :-\ . > > > > > > So, on builder206, it took me a while to find that the issue is > > > that > > > nfs (the service) was running. > > > > > > ./tests/basic/afr/tarissue.t failed, because the nfs > > > initialisation > > > failed with a rather cryptic message: > > > > > > [2019-04-23 13:17:05.371733] I > > > [socket.c:991:__socket_server_bind] 0- > > > socket.nfs-server: process started listening on port (38465) > > > [2019-04-23 13:17:05.385819] E > > > [socket.c:972:__socket_server_bind] 0- > > > socket.nfs-server: binding to failed: Address already in use > > > [2019-04-23 13:17:05.385843] E > > > [socket.c:974:__socket_server_bind] 0- > > > socket.nfs-server: Port is already in use > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > > > > > I found where this came from, but a few stuff did surprised me: > > > > > > - the order of print is different that the order in the code > > > > > > > Indeed strange... > > > > > - the message on "started listening" didn't take in account the > > > fact > > > that bind failed on: > > > > > > > Shouldn't it bail out if it failed to bind? > > Some missing 'goto out' around line 975/976? > > Y. > > > > > > > > > > > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > > > > > The message about port 38465 also threw me off the track. The > > > real > > > issue is that the service nfs was already running, and I couldn't > > > find > > > anything listening on port 38465 > > > > > > once I do service nfs stop, it no longer failed. > > > > > > So far, I do know why nfs.service was activated. > > > > > > But at least, 206 should be fixed, and we know a bit more on what > > > would > > > be causing some failure. > > > > > > > > > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < > > > > msche...@redhat.com> > > > > wrote: > > > > > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a > > > > > écrit : > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > > > jthot...@redhat.com> > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > is_nfs_export_available is just a wrapper around > > > > > > > "showmount" > > > > > > > command AFAIR. > > > > > > > I saw following messages in console output. > > > > > > > mount.nfs: rpc.statd is not running but is required for > > > > > > > remote > > > > > > > locking. > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks > > > > > > > local, > > > > > > > or > > > > > > > start > > > > > > > statd. > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was > > > > > > > specified > > > > > > > > > > > > > > For me it looks rpcbind may not be running on the > > > > > > > machine. > > > > > > > Usually rpcbind starts automatically on machines, don't > > > > > > > know > > > > > > > whether it > > > > > > > can happen or not. > > > > > > > > > > > > > > > > > > > That's precisely what the question is. Why suddenly we're > > > > > > seeing > > > > > > this > > > > > > happening too frequently. Today I saw atleast 4 to 5 such > > > > > > failures > > > > > > already. > > > > > > > > > > > > Deepshika - Can you please help in inspecting this? > > > > > > > > > > So we think (we are not sure) that the issue is a bit > > > > > complex. > > > > > > > > > > What we were investigating was nightly run fail on aws. When > > > > > the > > > > > build > > > > > crash, the builder is restarted, since that's the easiest way > > > > > to > > > > > clean > > > > > everything (since even with a perfect test suite that would > > > > > clean > > > > > itself, we could always end in a corrupt state on the system, > > > > > WRT > > > > > mount, fs, etc). > > > > > > > > > > In turn, this seems to cause trouble on aws, since cloud-init > > > > > or > > > > > something rename eth0 interface to ens5, without cleaning to > > > > > the > > > > > network configuration. > > > > > > > > > > So the network init script fail (because the image say "start > > > > > eth0" > > > > > and > > > > > that's not present), but fail in a weird way. Network is > > > > > initialised > > > > > and working (we can connect), but the dhclient process is not > > > > > in > > > > > the > > > > > right cgroup, and network.service is in
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer wrote: > Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : > > Is this back again? The recent patches are failing regression :-\ . > > So, on builder206, it took me a while to find that the issue is that > nfs (the service) was running. > > ./tests/basic/afr/tarissue.t failed, because the nfs initialisation > failed with a rather cryptic message: > > [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- > socket.nfs-server: process started listening on port (38465) > [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- > socket.nfs-server: binding to failed: Address already in use > [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- > socket.nfs-server: Port is already in use > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > I found where this came from, but a few stuff did surprised me: > > - the order of print is different that the order in the code > Indeed strange... > - the message on "started listening" didn't take in account the fact > that bind failed on: > Shouldn't it bail out if it failed to bind? Some missing 'goto out' around line 975/976? Y. > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > The message about port 38465 also threw me off the track. The real > issue is that the service nfs was already running, and I couldn't find > anything listening on port 38465 > > once I do service nfs stop, it no longer failed. > > So far, I do know why nfs.service was activated. > > But at least, 206 should be fixed, and we know a bit more on what would > be causing some failure. > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer > > wrote: > > > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > jthot...@redhat.com> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > > > command AFAIR. > > > > > I saw following messages in console output. > > > > > mount.nfs: rpc.statd is not running but is required for remote > > > > > locking. > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, > > > > > or > > > > > start > > > > > statd. > > > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > > > > > For me it looks rpcbind may not be running on the machine. > > > > > Usually rpcbind starts automatically on machines, don't know > > > > > whether it > > > > > can happen or not. > > > > > > > > > > > > > That's precisely what the question is. Why suddenly we're seeing > > > > this > > > > happening too frequently. Today I saw atleast 4 to 5 such > > > > failures > > > > already. > > > > > > > > Deepshika - Can you please help in inspecting this? > > > > > > So we think (we are not sure) that the issue is a bit complex. > > > > > > What we were investigating was nightly run fail on aws. When the > > > build > > > crash, the builder is restarted, since that's the easiest way to > > > clean > > > everything (since even with a perfect test suite that would clean > > > itself, we could always end in a corrupt state on the system, WRT > > > mount, fs, etc). > > > > > > In turn, this seems to cause trouble on aws, since cloud-init or > > > something rename eth0 interface to ens5, without cleaning to the > > > network configuration. > > > > > > So the network init script fail (because the image say "start eth0" > > > and > > > that's not present), but fail in a weird way. Network is > > > initialised > > > and working (we can connect), but the dhclient process is not in > > > the > > > right cgroup, and network.service is in failed state. Restarting > > > network didn't work. In turn, this mean that rpc-statd refuse to > > > start > > > (due to systemd dependencies), which seems to impact various NFS > > > tests. > > > > > > We have also seen that on some builders, rpcbind pick some IP v6 > > > autoconfiguration, but we can't reproduce that, and there is no ip > > > v6 > > > set up anywhere. I suspect the network.service failure is somehow > > > involved, but fail to see how. In turn, rpcbind.socket not starting > > > could cause NFS test troubles. > > > > > > Our current stop gap fix was to fix all the builders one by one. > > > Remove > > > the config, kill the rogue dhclient, restart network service. > > > > > > However, we can't be sure this is going to fix the problem long > > > term > > > since this only manifest after a crash of the test suite, and it > > > doesn't happen so often. (plus, it was working before some day in > > > the > > > past, when something did make this fail, and I do not know if > > > that's a > > > system upgrade, or a test change, or both). > > > > > > So we are still looking at it to have a complete understanding of > > > the > > > iss
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Below looks like kernel nfs was started (may be enabled on the machine). Did u start rpcbind manually on that machine, if yes can u please check kernel nfs status before and after that service? -- Jiffin - Original Message - From: "Michael Scherer" To: "Atin Mukherjee" Cc: "Deepshikha Khandelwal" , "Gluster Devel" , "Jiffin Thottan" , "gluster-infra" Sent: Tuesday, April 23, 2019 7:44:49 PM Subject: Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : > Is this back again? The recent patches are failing regression :-\ . So, on builder206, it took me a while to find that the issue is that nfs (the service) was running. ./tests/basic/afr/tarissue.t failed, because the nfs initialisation failed with a rather cryptic message: [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- socket.nfs-server: process started listening on port (38465) [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- socket.nfs-server: binding to failed: Address already in use [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- socket.nfs-server: Port is already in use [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- socket.nfs-server: __socket_server_bind failed;closing socket 14 I found where this came from, but a few stuff did surprised me: - the order of print is different that the order in the code - the message on "started listening" didn't take in account the fact that bind failed on: https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 The message about port 38465 also threw me off the track. The real issue is that the service nfs was already running, and I couldn't find anything listening on port 38465 once I do service nfs stop, it no longer failed. So far, I do know why nfs.service was activated. But at least, 206 should be fixed, and we know a bit more on what would be causing some failure. > On Wed, 3 Apr 2019 at 19:26, Michael Scherer > wrote: > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > jthot...@redhat.com> > > > wrote: > > > > > > > Hi, > > > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > > command AFAIR. > > > > I saw following messages in console output. > > > > mount.nfs: rpc.statd is not running but is required for remote > > > > locking. > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, > > > > or > > > > start > > > > statd. > > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > > > For me it looks rpcbind may not be running on the machine. > > > > Usually rpcbind starts automatically on machines, don't know > > > > whether it > > > > can happen or not. > > > > > > > > > > That's precisely what the question is. Why suddenly we're seeing > > > this > > > happening too frequently. Today I saw atleast 4 to 5 such > > > failures > > > already. > > > > > > Deepshika - Can you please help in inspecting this? > > > > So we think (we are not sure) that the issue is a bit complex. > > > > What we were investigating was nightly run fail on aws. When the > > build > > crash, the builder is restarted, since that's the easiest way to > > clean > > everything (since even with a perfect test suite that would clean > > itself, we could always end in a corrupt state on the system, WRT > > mount, fs, etc). > > > > In turn, this seems to cause trouble on aws, since cloud-init or > > something rename eth0 interface to ens5, without cleaning to the > > network configuration. > > > > So the network init script fail (because the image say "start eth0" > > and > > that's not present), but fail in a weird way. Network is > > initialised > > and working (we can connect), but the dhclient process is not in > > the > > right cgroup, and network.service is in failed state. Restarting > > network didn't work. In turn, this mean that rpc-statd refuse to > > start > > (due to systemd dependencies), which seems to impact various NFS > > tests. > > > > We have also seen that on some builders, rpcbind pick some IP v6 > > autoconfiguration, but we can't reproduce that, and there is
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Le lundi 22 avril 2019 à 22:57 +0530, Atin Mukherjee a écrit : > Is this back again? The recent patches are failing regression :-\ . So, on builder206, it took me a while to find that the issue is that nfs (the service) was running. ./tests/basic/afr/tarissue.t failed, because the nfs initialisation failed with a rather cryptic message: [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- socket.nfs-server: process started listening on port (38465) [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- socket.nfs-server: binding to failed: Address already in use [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- socket.nfs-server: Port is already in use [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- socket.nfs-server: __socket_server_bind failed;closing socket 14 I found where this came from, but a few stuff did surprised me: - the order of print is different that the order in the code - the message on "started listening" didn't take in account the fact that bind failed on: https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 The message about port 38465 also threw me off the track. The real issue is that the service nfs was already running, and I couldn't find anything listening on port 38465 once I do service nfs stop, it no longer failed. So far, I do know why nfs.service was activated. But at least, 206 should be fixed, and we know a bit more on what would be causing some failure. > On Wed, 3 Apr 2019 at 19:26, Michael Scherer > wrote: > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > jthot...@redhat.com> > > > wrote: > > > > > > > Hi, > > > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > > command AFAIR. > > > > I saw following messages in console output. > > > > mount.nfs: rpc.statd is not running but is required for remote > > > > locking. > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, > > > > or > > > > start > > > > statd. > > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > > > For me it looks rpcbind may not be running on the machine. > > > > Usually rpcbind starts automatically on machines, don't know > > > > whether it > > > > can happen or not. > > > > > > > > > > That's precisely what the question is. Why suddenly we're seeing > > > this > > > happening too frequently. Today I saw atleast 4 to 5 such > > > failures > > > already. > > > > > > Deepshika - Can you please help in inspecting this? > > > > So we think (we are not sure) that the issue is a bit complex. > > > > What we were investigating was nightly run fail on aws. When the > > build > > crash, the builder is restarted, since that's the easiest way to > > clean > > everything (since even with a perfect test suite that would clean > > itself, we could always end in a corrupt state on the system, WRT > > mount, fs, etc). > > > > In turn, this seems to cause trouble on aws, since cloud-init or > > something rename eth0 interface to ens5, without cleaning to the > > network configuration. > > > > So the network init script fail (because the image say "start eth0" > > and > > that's not present), but fail in a weird way. Network is > > initialised > > and working (we can connect), but the dhclient process is not in > > the > > right cgroup, and network.service is in failed state. Restarting > > network didn't work. In turn, this mean that rpc-statd refuse to > > start > > (due to systemd dependencies), which seems to impact various NFS > > tests. > > > > We have also seen that on some builders, rpcbind pick some IP v6 > > autoconfiguration, but we can't reproduce that, and there is no ip > > v6 > > set up anywhere. I suspect the network.service failure is somehow > > involved, but fail to see how. In turn, rpcbind.socket not starting > > could cause NFS test troubles. > > > > Our current stop gap fix was to fix all the builders one by one. > > Remove > > the config, kill the rogue dhclient, restart network service. > > > > However, we can't be sure this is going to fix the problem long > > term > > since this only manifest after a crash of the test suite, and it > > doesn't happen so often. (plus, it was working before some day in > > the > > past, when something did make this fail, and I do not know if > > that's a > > system upgrade, or a test change, or both). > > > > So we are still looking at it to have a complete understanding of > > the > > issue, but so far, we hacked our way to make it work (or so do I > > think). > > > > Deepshika is working to fix it long term, by fixing the issue > > regarding > > eth0/ens5 with a new base image. > > -- > > Michael Scherer > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > -- > > - Atin (atinm) -- Michael Scherer Sysadmin, Community Infrastructure signature.asc Des
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Is this back again? The recent patches are failing regression :-\ . On Wed, 3 Apr 2019 at 19:26, Michael Scherer wrote: > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan > > wrote: > > > > > Hi, > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > command AFAIR. > > > I saw following messages in console output. > > > mount.nfs: rpc.statd is not running but is required for remote > > > locking. > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > > start > > > statd. > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > For me it looks rpcbind may not be running on the machine. > > > Usually rpcbind starts automatically on machines, don't know > > > whether it > > > can happen or not. > > > > > > > That's precisely what the question is. Why suddenly we're seeing this > > happening too frequently. Today I saw atleast 4 to 5 such failures > > already. > > > > Deepshika - Can you please help in inspecting this? > > So we think (we are not sure) that the issue is a bit complex. > > What we were investigating was nightly run fail on aws. When the build > crash, the builder is restarted, since that's the easiest way to clean > everything (since even with a perfect test suite that would clean > itself, we could always end in a corrupt state on the system, WRT > mount, fs, etc). > > In turn, this seems to cause trouble on aws, since cloud-init or > something rename eth0 interface to ens5, without cleaning to the > network configuration. > > So the network init script fail (because the image say "start eth0" and > that's not present), but fail in a weird way. Network is initialised > and working (we can connect), but the dhclient process is not in the > right cgroup, and network.service is in failed state. Restarting > network didn't work. In turn, this mean that rpc-statd refuse to start > (due to systemd dependencies), which seems to impact various NFS tests. > > We have also seen that on some builders, rpcbind pick some IP v6 > autoconfiguration, but we can't reproduce that, and there is no ip v6 > set up anywhere. I suspect the network.service failure is somehow > involved, but fail to see how. In turn, rpcbind.socket not starting > could cause NFS test troubles. > > Our current stop gap fix was to fix all the builders one by one. Remove > the config, kill the rogue dhclient, restart network service. > > However, we can't be sure this is going to fix the problem long term > since this only manifest after a crash of the test suite, and it > doesn't happen so often. (plus, it was working before some day in the > past, when something did make this fail, and I do not know if that's a > system upgrade, or a test change, or both). > > So we are still looking at it to have a complete understanding of the > issue, but so far, we hacked our way to make it work (or so do I > think). > > Deepshika is working to fix it long term, by fixing the issue regarding > eth0/ens5 with a new base image. > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > -- - Atin (atinm) ___ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan > wrote: > > > Hi, > > > > is_nfs_export_available is just a wrapper around "showmount" > > command AFAIR. > > I saw following messages in console output. > > mount.nfs: rpc.statd is not running but is required for remote > > locking. > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > start > > statd. > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > For me it looks rpcbind may not be running on the machine. > > Usually rpcbind starts automatically on machines, don't know > > whether it > > can happen or not. > > > > That's precisely what the question is. Why suddenly we're seeing this > happening too frequently. Today I saw atleast 4 to 5 such failures > already. > > Deepshika - Can you please help in inspecting this? So we think (we are not sure) that the issue is a bit complex. What we were investigating was nightly run fail on aws. When the build crash, the builder is restarted, since that's the easiest way to clean everything (since even with a perfect test suite that would clean itself, we could always end in a corrupt state on the system, WRT mount, fs, etc). In turn, this seems to cause trouble on aws, since cloud-init or something rename eth0 interface to ens5, without cleaning to the network configuration. So the network init script fail (because the image say "start eth0" and that's not present), but fail in a weird way. Network is initialised and working (we can connect), but the dhclient process is not in the right cgroup, and network.service is in failed state. Restarting network didn't work. In turn, this mean that rpc-statd refuse to start (due to systemd dependencies), which seems to impact various NFS tests. We have also seen that on some builders, rpcbind pick some IP v6 autoconfiguration, but we can't reproduce that, and there is no ip v6 set up anywhere. I suspect the network.service failure is somehow involved, but fail to see how. In turn, rpcbind.socket not starting could cause NFS test troubles. Our current stop gap fix was to fix all the builders one by one. Remove the config, kill the rogue dhclient, restart network service. However, we can't be sure this is going to fix the problem long term since this only manifest after a crash of the test suite, and it doesn't happen so often. (plus, it was working before some day in the past, when something did make this fail, and I do not know if that's a system upgrade, or a test change, or both). So we are still looking at it to have a complete understanding of the issue, but so far, we hacked our way to make it work (or so do I think). Deepshika is working to fix it long term, by fixing the issue regarding eth0/ens5 with a new base image. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Le mercredi 03 avril 2019 à 15:12 +0300, Yaniv Kaul a écrit : > On Wed, Apr 3, 2019 at 2:53 PM Michael Scherer > wrote: > > > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > jthot...@redhat.com> > > > wrote: > > > > > > > Hi, > > > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > > command AFAIR. > > > > I saw following messages in console output. > > > > mount.nfs: rpc.statd is not running but is required for remote > > > > locking. > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, > > > > or > > > > start > > > > statd. > > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > > > For me it looks rpcbind may not be running on the machine. > > > > Usually rpcbind starts automatically on machines, don't know > > > > whether it > > > > can happen or not. > > > > > > > > > > That's precisely what the question is. Why suddenly we're seeing > > > this > > > happening too frequently. Today I saw atleast 4 to 5 such > > > failures > > > already. > > > > > > Deepshika - Can you please help in inspecting this? > > > > So in the past, this kind of stuff did happen with ipv6, so this > > could > > be a change on AWS and/or a upgrade. > > > > We need to enable IPv6, for two reasons: > 1. IPv6 is common these days, even if we don't test with it, it > should be > there. > 2. We should test with IPv6... > > I'm not sure, but I suspect we do disable IPv6 here and there. > Example[1]. > Y. > > [1] > https://github.com/gluster/centosci/blob/master/jobs/scripts/glusto/setup-glusto.yml We do disable ipv6 for sure, Nigel spent 3 days just on that for the AWS migration, and we do have a dedicated playbook applied on all builders that try to disable everything in every possible way: https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/jenkins_builder/tasks/disable_ipv6_linux.yml According to the comment, that's from 2016, and I am sure this go further in the past since it wasn't just documented before. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
On Wed, Apr 3, 2019 at 2:53 PM Michael Scherer wrote: > Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan > > wrote: > > > > > Hi, > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > command AFAIR. > > > I saw following messages in console output. > > > mount.nfs: rpc.statd is not running but is required for remote > > > locking. > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > > start > > > statd. > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > For me it looks rpcbind may not be running on the machine. > > > Usually rpcbind starts automatically on machines, don't know > > > whether it > > > can happen or not. > > > > > > > That's precisely what the question is. Why suddenly we're seeing this > > happening too frequently. Today I saw atleast 4 to 5 such failures > > already. > > > > Deepshika - Can you please help in inspecting this? > > So in the past, this kind of stuff did happen with ipv6, so this could > be a change on AWS and/or a upgrade. > We need to enable IPv6, for two reasons: 1. IPv6 is common these days, even if we don't test with it, it should be there. 2. We should test with IPv6... I'm not sure, but I suspect we do disable IPv6 here and there. Example[1]. Y. [1] https://github.com/gluster/centosci/blob/master/jobs/scripts/glusto/setup-glusto.yml > > We are currently investigating a set of failure that happen after > reboot (resulting in partial network bring up, causing all kind of > weird issue), but it take some time to verify it, and since we lost 33% > of the team with Nigel departure, stuff do not move as fast as before. > > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > ___ > Gluster-devel mailing list > gluster-de...@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan > wrote: > > > Hi, > > > > is_nfs_export_available is just a wrapper around "showmount" > > command AFAIR. > > I saw following messages in console output. > > mount.nfs: rpc.statd is not running but is required for remote > > locking. > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > start > > statd. > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > For me it looks rpcbind may not be running on the machine. > > Usually rpcbind starts automatically on machines, don't know > > whether it > > can happen or not. > > > > That's precisely what the question is. Why suddenly we're seeing this > happening too frequently. Today I saw atleast 4 to 5 such failures > already. > > Deepshika - Can you please help in inspecting this? So in the past, this kind of stuff did happen with ipv6, so this could be a change on AWS and/or a upgrade. We are currently investigating a set of failure that happen after reboot (resulting in partial network bring up, causing all kind of weird issue), but it take some time to verify it, and since we lost 33% of the team with Nigel departure, stuff do not move as fast as before. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS signature.asc Description: This is a digitally signed message part ___ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan wrote: > Hi, > > is_nfs_export_available is just a wrapper around "showmount" command AFAIR. > I saw following messages in console output. > mount.nfs: rpc.statd is not running but is required for remote locking. > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or start > statd. > 05:06:55 mount.nfs: an incorrect mount option was specified > > For me it looks rpcbind may not be running on the machine. > Usually rpcbind starts automatically on machines, don't know whether it > can happen or not. > That's precisely what the question is. Why suddenly we're seeing this happening too frequently. Today I saw atleast 4 to 5 such failures already. Deepshika - Can you please help in inspecting this? > Regards, > Jiffin > > > - Original Message - > From: "Atin Mukherjee" > To: "gluster-infra" , "Gluster Devel" < > gluster-de...@gluster.org> > Sent: Wednesday, April 3, 2019 10:46:51 AM > Subject: [Gluster-devel] is_nfs_export_available from nfs.rc failing too > often? > > I'm observing the above test function failing too often because of which > arbiter-mount.t test fails in many regression jobs. Such frequency of > failures wasn't there earlier. Does anyone know what has changed recently > to cause these failures in regression? I also hear when such failure > happens a reboot is required, is that true and if so why? > > One of the reference : > https://build.gluster.org/job/centos7-regression/5340/consoleFull > > > ___ > Gluster-devel mailing list > gluster-de...@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often?
Hi, is_nfs_export_available is just a wrapper around "showmount" command AFAIR. I saw following messages in console output. mount.nfs: rpc.statd is not running but is required for remote locking. 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or start statd. 05:06:55 mount.nfs: an incorrect mount option was specified For me it looks rpcbind may not be running on the machine. Usually rpcbind starts automatically on machines, don't know whether it can happen or not. Regards, Jiffin - Original Message - From: "Atin Mukherjee" To: "gluster-infra" , "Gluster Devel" Sent: Wednesday, April 3, 2019 10:46:51 AM Subject: [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? I'm observing the above test function failing too often because of which arbiter-mount.t test fails in many regression jobs. Such frequency of failures wasn't there earlier. Does anyone know what has changed recently to cause these failures in regression? I also hear when such failure happens a reboot is required, is that true and if so why? One of the reference : https://build.gluster.org/job/centos7-regression/5340/consoleFull ___ Gluster-devel mailing list gluster-de...@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra