It's not too often we see good news on the zfs-discuss list, so here's some:

We at the High Performance Computing Center at MSU have finally worked out the root cause of a long-standing issue with our OpenSolaris NFS servers. It was a minor configuration issue, involving a ZFS file system property.

A little backstory: We chose to go with Sun X4540s, running OpenSolaris and ZFS for our home directory space. We initially implemented 100TB of usable space. All was well for a while, but then some mostly annoying issues started popping up:

1. 0-byte files named '4913' were appearing in user directories. We discovered that vi was doing:

open("4913")
close("4913")
remove("4913")

The remove() operation would fail intermittently. With assistance from the helpful folks at SGI (because we originally thought this was a Linux NFSv4 client problem), testing revealed that this behavior is caused by the NFS server on Solaris occasionally returning NFS4ERR_FILE_OPEN, which is not handled by the client. According to a Linux NFS kernel developer, "the error is usually due to ordering issues with asynchronous RPC calls." http://www.linux-nfs.org/Linux-2.6.x/2.6.18/linux-2.6.18-068-handle_nfs4err_file_open.dif We applied a patch to the Linux NFSv4 client, which told the client to wait and retry when the client received that error.

2. There was also an issue with gedit. When opening then saving an already existing file, it did:

open("file")
rename("file","file~")

rename() returned "Input/Output Error." After applying the fix for #1, rename() hung indefinitely. We also noticed a similar problem with gcc.

Interestingly, running this test locally on the OpenSolaris server on same file system, this test resulted in a "permission denied" error. If we mounted this same file system over NFSv4 on another OpenSolaris system, we received the same "permission denied" error.


Yesterday, we discovered the property 'nbmand' was set on the ZFS file systems in question. This was a leftover from our initial testing with Solaris CIFS. It was set because the documentation at http://dlc.sun.com/osol/docs/content/SSMBAG/managingsmbsharestm.html and http://204.152.191.100/wiki/index.php/Getting_Started_With_the_Solaris_CIFS_Service instructed that nbmand should be turned on when using CIFS. What isn't mentioned, however, is that nbmand can adversely affect the behavior of NFSv4 and even local file systems. The ZFS admin guide also states that nbmand applies only to CIFS clients, when it actually applies to NFSv4 clients as well as local file system access.

I think nbmand is also a bit slow in releasing its locks, which explains the behavior of bug number 1. The only tests we've run so far show that the "slow" locking behavior goes away when nbmand is turned off. Would filing a bug about this slow behavior of nbmand be the correct thing to do at this point? If so, where is the proper place to file this bug? The OpenSolaris BugZilla is where I've been told these bug reports go to, but I'm not sure if this should be filed in bugs.opensolaris.org or not.

Disabling nbmand on a test file system resolved both bugs, as well as other known issues that our users have been running into. All the various known issues this caused can be found at the MSU HPCC wiki: https://wiki.hpcc.msu.edu/display/Issues/Known+Issues, under "Home Directory file system."

-Greg


--
Greg Mason
System Administrator
High Performance Computing Center
Michigan State University
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to