Hello Wolfgang,

Wolfgang Rosner:
> short story before
> aufs on server, nfs export as nfsroot for clients
>
> the symptom on the client (first login after boot):
>
> root@blade-008:~# la
> ls: cannot open directory .: Stale NFS file handle
> root@blade-008:~# cd .
> root@blade-008:~# la
        :::
> I'd rather suspect a latency problem, since aufs mount and nfs eports are 
> generated within the same perl scripts. May it be that aufs mount is forked 
> from the script and not completed, before nfs export is called?
> How can I find out? how can I avoid? Other explanations?

Is the client correctly booted?
And prompted for user login?
If so, I don't think the problem shoule not be latency you call, because
the booted client is a strong evidence of that aufs is mounted correctly
and is exported correctly.
The mount procedure should be done synchronously and nfsd should be able
to use it just after the completion of mount. But it is totally up to
your mount(8) command. Is it ordinary one from linux-utils?

If ESTALE happend in the very early stage of mounting nfsroot, then your
guess ("latency problem") might be possible.

As a first step, I'd suggest you to see what was done between the
completion of system boot and "ls" (where ESTALE happened). In other
words, is there something unusual around "getty", "login" or
"~user/.profile"? This is a story on your nfs-client.
And, just to make sure, the story is same to your nfs-server. Is there
something unusual on your nfs-server after the completion of client
boot?

To investigate more, you need to find out which systemcall and which
module returns ESTALE. The candidates are
- open(2) in ls(1)
- nfs client
- nfs server
- aufs on the server
- branch fs in aufs

The debugging method for those will be
- strace
- wireshark
- aufs module parameter "debug"

With these tools and feature, you will see the behaviour of these
modules.


> - layer 2 is masking among others /root/some.files and /var/log
> it's mounted as "ro+wh" (not visible in  /sys/fs/aufs/)

Do you mean
- you specified "=ro+wh" as a branch permission
- but /sys/fs/aufs/si_*/br2 shows "ro"
right?
If so, there is something wrong. But I don't know it is related to your
"latency" problem.


> root@cruncher:/cluster/etc/scripts/available# exportfs -v
> ....
> /cluster/mp/nfsr/aufs_008                192.168.130.8
> (rw,wdelay,crossmnt,no_root_squash,no_subtree_check,fsid=158,sec=sys,rw,no_root_squash,no_all_squash)

Are you setting fsid for each client with different values?
Such as
- fsid=158 for /cluster/mp/nfsr/aufs_008
- fsid=159 for /cluster/mp/nfsr/aufs_009
        :::


J. R. Okajima

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/

Reply via email to