I have some more information that we just discovered. It seems to only happen when in the top level of the affected users' home directory. My colleague tried running (affected) commands in a subdirectory of his home directory and didn't have any issues. This makes sense why the users get that message upon login, as they are going into their top level home directory.
I was also able to get similar errors when trying to run commands out of his home directory, while authenticated as myself, on an updated system. I didn't get any login message when I logged into the system and don't have the issue in any other AFS directory (home directory or otherwise) while on the same system. For example, I ran the same pip2 command indicated below while in his home directory and got the same error message as below (see output below). I am able to run these same commands just fine in my colleague's home directory on other systems that are running an older kernel version (before RHEL 7.4 base kernel). It seems to happen with any command that is doing a getcwd() call, and we've seen it with the qsub/qsub.orig, tcsh, tmux, and pip2 commands, to name a few. Mutple people were able to replicate what I found. [mvanderw@02 ~]$ uname -r 3.10.0-693.5.2.el7.x86_64 [mvanderw@02 shampton]$ pwd /afs/crc.nd.edu/user/s/shampton [mvanderw@02 shampton]$ pip2 search PyYaml The folder you are executing pip from can no longer be found. [mvanderw@02 shampton]$ cd support/ [mvanderw@02 support]$ pip2 search PyYaml aspy.yaml (1.0.0) - A few extensions to pyyaml. ... [mvanderw@02 shampton]$ cd ~ [mvanderw@02 ~]$ pip2 search PyYaml aspy.yaml (1.0.0) - A few extensions to pyyaml. ... [mvanderw@01 shampton]$ uname -r 3.10.0-514.26.2.el7.x86_64 [mvanderw@01 shampton]$ pwd /afs/crc.nd.edu/user/s/shampton [mvanderw@01 shampton]$ pip2 search PyYaml aspy.yaml (1.0.0) - A few extensions to pyyaml. ... We tried to do a 'fs flushvolume' command on the user volume (containing his home directory) on the updated system and also tried to do an 'fs flushall' command on that system, and neither had any impact. We also tried moving the user volume of my colleague to another file server and that didn't have any impact either. Any ideas what might be the issue? Anything else we can try that might help diagnose this? Hopefully this additional information can shed some light! Thanks in advance for any help! -- Matt Vander Werf HPC System Administrator University of Notre Dame Center for Research Computing - Union Station 506 W. South Street South Bend, IN 46601 Phone: (574) 631-0692 On Thu, Oct 26, 2017 at 2:17 PM, Matt Vander Werf <[email protected]> wrote: > Hello all, > > One of my colleagues was able to reproduce the issue today and so I have > some additional information to share. We upgraded one of our main > interactive systems to the latest RHEL 7 kernel to use as a test system to > try and reproduce the issue. This was done early morning yesterday, > Wednesday, 10/25 (Eastern), and we didn't get any reports of any issues > until my colleague was able to reproduce it this afternoon. > > They got the error message at login, as specified below. And also got the > 'getcwd() failed' message for several applications. > > I had them run an strace on a couple of commands where they got the > '<application > name>: getcwd() failed' error message, as I thought that might be useful > potentially. I've attached the strace output from two commands, 'qsub' and > 'qsub.orig'. qsub is simply a wrapper bash script that performs some checks > and then ultimately calls the qsub.orig command with certain options. We > weren't sure if it was the qsub or qsub.orig that was causing the issue to > be triggered, but he got the error message on both commands. qsub is what > our users use to submit jobs to our batch system > > In addition, he also got a related message when trying to run the 'pip2' > command, which we haven't seen with this issue before. The error message > from doing a simple 'pip2 search PyYaml' is: > > The folder you are executing pip from can no longer be found. > > From the strace (also attached), you can see it is failing on a getcwd() > call. This is using Python 2.7.11, built with gcc 4.9.2. > > If there's anything else that would be useful for my colleague to gather, > please let us know! > > I've also ran a 'cmdebug <server> -long' command on the system in > question, and can send the output of that as well, if anyone thinks that'd > be useful. > > Hope this helps at least somewhat! > > Thanks! > > -- > Matt Vander Werf > HPC System Administrator > University of Notre Dame > Center for Research Computing - Union Station > 506 W. South Street > South Bend, IN 46601 > Phone: (574) 631-0692 > > On Fri, Oct 20, 2017 at 5:24 PM, Matt Vander Werf <[email protected]> wrote: > >> Okay, I've attached a config.log. >> >> Our afsd options are: >> >> AFSD_ARGS="-fakestat -chunksize 20 -daemons 6 -afsdb". >> >> Unfortunately, I've been unable to replicate this myself (despite >> numerous attempts). It only seems to happen on systems that are in heavy >> use with a lot of users, most likely since it doesn't happen for every user >> (takes enough user usage for us to get complaints). Most of the reports >> about this have been with our job submission software, specifically with >> the qsub command, but the error has also shown up right after users log in >> (as I indicated before): >> >> shell-init: error retrieving current directory: getcwd: cannot access >> parent directories: No such file or directory >> tcsh: No such file or directory >> tcsh: Trying to start from "<user AFS home directory>" >> >> >> We have not tried falling back to 1.6.20.2 or earlier, but we may try to >> do this yet. Unfortunately, until we are able to replicate it ourselves, >> it's going to be hard to test this without potentially causing disruption >> to our users again. >> >> Thanks. >> >> -- >> Matt Vander Werf >> HPC System Administrator >> University of Notre Dame >> Center for Research Computing - Union Station >> 506 W. South Street >> South Bend, IN 46601 >> Phone: (574) 631-0692 >> >> On Fri, Oct 20, 2017 at 3:16 PM, Mark Vitale <[email protected]> >> wrote: >> >>> >>> > On Oct 20, 2017, at 11:03 AM, Matt Vander Werf <[email protected]> >>> wrote: >>> > >>> > We can still do a manual configure on a system and attach the >>> config.log as well, if you'd still like that. Just let us know. >>> >>> Yes, I believe seeing a config log from your site could still be helpful >>> in tracking this down. >>> >>> I know the problem has been intermittent for you, but could you provide >>> any more information about your environment (especially afsd options) or >>> application(s) that trigger this? So far I’ve not been able to reproduce >>> any getcwd issues in my tests with your kernel level and OpenAFS release, >>> so any tips on what triggers this for you would be helpful. >>> >>> Also, have you tried falling back to OpenAFS 1.6.20.2 or earlier? If >>> so, what were your results? If not, could you try it and let us know? >>> >>> Thanks, >>> — >>> Mark Vitale >>> OpenAFS release team >>> >>> >> >
