Hi Mark,

> On 3 Nov 2017, at 15:51, Mark Vitale <[email protected]> wrote:
> 
> Ragge,
> 
>> On Nov 3, 2017, at 9:46 AM, Ragnar Sundblad <[email protected]> wrote:
>> 
>> We have compute clusters where the nodes have almost everything of their 
>> roots in afs; most things in /, as /etc and /usr, are soft links into a 
>> complete os installation in afs. To be able to have some writable files and 
>> directories, such as /etc/adjtime or /var/tmp, we bind mount files and 
>> directories in the tree which is actually in afs (mainly using the rwtab 
>> functionality), and a lustre client that also gets mounted in the afs tree.
>> 
>> When we upgraded from CentOS 7.3 to 7.4, kernel 3.10.0-693.5.2.el7.x86_64, 
>> and using OpenAFS client 1.6.21.1 or 1.6.20.1, when users having home 
>> directories in afs log in and start accessing their data, mounts in the afs 
>> tree starts to get randomly unmounted. In the lustre case, the lustre client 
>> nicely reports that it unmounts, so the unmounts seem to be handled in an 
>> orderly manner.
>> 
>> We have a suspicion this may be related to the problem reported in the 
>> thread “getcwd() error for RHEL 7.4 kernel”, and that the kernel for 
>> some reason decides that path to the mount point is no good and unmounts.
>> In addition, when this has started to happen, we are not able to mount 
>> anything more into afs, mount returns ENOENT.
>> 
>> This is pretty easy to repeat.
> Thank you for your detailed report.
> I have an idea about what this may be, but I will try to duplicate it on my 
> test system first.

Thanks for investigating! :-)

>> Our workaround for now is to use the tpmfs based root all the way down to 
>> the mount points, and have soft links into afs further down for the rest, 
>> which seems to work.
> It’s good that you have a workaround; thank you for sharing that as well.
> 
>> Please let us know if we can provide any help debugging this.
> For now I would like to see your afsd options, and also the output from 
> ‘cmdebug <client> -cache’ for an affected client.  

We start it like so:
/bin/chroot /sysimage /usr/vice/etc/afsd -memcache -verbose -nosettime -dynroot 
-mountdir /afs
(Before systemd is started, we set up the runtime root in /sysimage, then 
chroot there, and start systemd to let it bring up the system.)

Here is a cmdebug:
# cmdebug tegner-login-2 -cache
Chunk files:   1562
Stat caches:   2343
Data caches:   1562
Volume caches: 200
Chunk size:    65536
Cache size:    100000 kB
Set time:      no
Cache type:    memory

I now see that I forgot to mention that we use memory cache (since the nodes 
are diskless).

> Although you haven’t reported the getcwd() problem, could you please 
> confirm if you’ve seen it or not?

We have not seen it, but we haven’t really looked for it either. Is there some 
test we could try?

> And finally, just to confirm, you have seen bind mounts in /afs unmounted at 
> CentOS 7.4 with both OpenAFS 1.6.21.1 and 1.6.20.1, but _not_ with CentOS 7.3 
> and those same OpenAFS client releases - correct?

With 7.3 (kernel 3.10.0-514.26.2.el7.x86_64) we actually used openafs client 
1.6.20.2, but with that combination this mount-within-afs thing worked just 
fine.

Thanks!

/ragge

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to