Is this back again? The recent patches are failing regression :-\ . On Wed, 3 Apr 2019 at 19:26, Michael Scherer <[email protected]> wrote:
> Le mercredi 03 avril 2019 à 16:30 +0530, Atin Mukherjee a écrit : > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan <[email protected]> > > wrote: > > > > > Hi, > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > command AFAIR. > > > I saw following messages in console output. > > > mount.nfs: rpc.statd is not running but is required for remote > > > locking. > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > > start > > > statd. > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > For me it looks rpcbind may not be running on the machine. > > > Usually rpcbind starts automatically on machines, don't know > > > whether it > > > can happen or not. > > > > > > > That's precisely what the question is. Why suddenly we're seeing this > > happening too frequently. Today I saw atleast 4 to 5 such failures > > already. > > > > Deepshika - Can you please help in inspecting this? > > So we think (we are not sure) that the issue is a bit complex. > > What we were investigating was nightly run fail on aws. When the build > crash, the builder is restarted, since that's the easiest way to clean > everything (since even with a perfect test suite that would clean > itself, we could always end in a corrupt state on the system, WRT > mount, fs, etc). > > In turn, this seems to cause trouble on aws, since cloud-init or > something rename eth0 interface to ens5, without cleaning to the > network configuration. > > So the network init script fail (because the image say "start eth0" and > that's not present), but fail in a weird way. Network is initialised > and working (we can connect), but the dhclient process is not in the > right cgroup, and network.service is in failed state. Restarting > network didn't work. In turn, this mean that rpc-statd refuse to start > (due to systemd dependencies), which seems to impact various NFS tests. > > We have also seen that on some builders, rpcbind pick some IP v6 > autoconfiguration, but we can't reproduce that, and there is no ip v6 > set up anywhere. I suspect the network.service failure is somehow > involved, but fail to see how. In turn, rpcbind.socket not starting > could cause NFS test troubles. > > Our current stop gap fix was to fix all the builders one by one. Remove > the config, kill the rogue dhclient, restart network service. > > However, we can't be sure this is going to fix the problem long term > since this only manifest after a crash of the test suite, and it > doesn't happen so often. (plus, it was working before some day in the > past, when something did make this fail, and I do not know if that's a > system upgrade, or a test change, or both). > > So we are still looking at it to have a complete understanding of the > issue, but so far, we hacked our way to make it work (or so do I > think). > > Deepshika is working to fix it long term, by fixing the issue regarding > eth0/ens5 with a new base image. > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > -- - Atin (atinm)
_______________________________________________ Gluster-infra mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-infra
