Hello all, In order to work at diagnosing the underlying issue causing the NFS issues we've been having, I will be copying the shared storage to a non-thin-provisioned filesystem and rolling back the NFS server kernel to a version known to be properly compatible with the controller hardware without issue (i.e.: the same used in production with identical hardware).
What this means in practice is that there will be a short outage to NFS service (~30 minutes) during the switch, after which the filesystem will return without the timetravel snapshot features (which is the reason why we were using the newer kernel). Annoyingly, due to some technical constraints with NFS, this probably means that instances having mounted NFS filesystems will have to be rebooted after the switch (as the FSID will change). If your instance gives you errors stating that you have "stale NFS handle"s after the switch, this is what happened and will be fixed with a reboot. If the problem persists with the older kernel and driver, then we have actual hardware issues and will switch hardware around to solve it (which will require another outage in the following days). If the switch to the older kernel /does/ fix the issue, then we will continue using that configuration (no snapshots) until the driver regression has been solved upstream or with the vendor. I am planning the outage for 20:00 UTC; provided the copy takes roughly the estimated amount of time. In case the actual stalls slow things down and I need to push it back, I'll send another update to the mailing list. -- Marc _______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
