Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-12-20 Thread Matt Vander Werf
Hi Mark, Thanks for the info. Can you elaborate on what exactly would result if it "unsafely continue its “walk” of the d_alias list after dropping the i_lock"? Kernel panic/crash? Segfault? Data corruption? We've been running the current 1.6.x patch (12796 with 1.6.22) on a production system

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-12-20 Thread Mark Vitale
> On Dec 5, 2017, at 11:28 AM, Matt Vander Werf wrote: > > I've created RPMs using the source (1.6.21.1) with this patch and have > installed it on several systems running the latest RHEL 7.4 kernel. I haven’t > noticed any issues from the fixes (can't say my testing has been

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-12-05 Thread Matt Vander Werf
I've created RPMs using the source (1.6.21.1) with this patch and have installed it on several systems running the latest RHEL 7.4 kernel. I haven't noticed any issues from the fixes (can't say my testing has been exhaustive though), but these also aren't very busy systems and I also haven't ever

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-12-01 Thread Mark Vitale
> On Dec 1, 2017, at 1:48 PM, Matt Vander Werf wrote: > > I noticed you added your patch(es) to gerrit for the RHEL 7.4 getcwd issue > (Thanks!). > > Responding to your comment on the latest commit, "I can submit an equivalent, > but simpler, "emergency" 1.6.x backport of

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-12-01 Thread Matt Vander Werf
I noticed you added your patch(es) to gerrit for the RHEL 7.4 getcwd issue (Thanks!). Responding to your comment on the latest commit, "I can submit an equivalent, but simpler, "emergency" 1.6.x backport of just this top commit on request.": This definitely would be preferred from our end! (Would

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-19 Thread Benjamin Kaduk
On Fri, Nov 17, 2017 at 05:35:25PM -0500, Garance A Drosehn wrote: > > I wonder if it has to do with the home directory being an AFS mount > point > (as opposed to a standard directory somewhere inside an AFS volume), but > I > have not had the time to do any tests of that idea. > > The fact

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-19 Thread Mark Vitale
> On Nov 16, 2017, at 12:26 PM, Stephan Wiesand wrote: > > > On Nov 16, 2017, at 07:06 , Benjamin Kaduk wrote: > >> On Wed, Nov 15, 2017 at 01:02:15PM -0500, Matt Vander Werf wrote: >>> Hello, >>> >>> Are there any updates or progress on a potential fix for this

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-17 Thread Garance A Drosehn
On 18 Oct 2017, at 19:21, Benjamin Kaduk wrote: On Tue, Oct 17, 2017 at 11:55:27AM -0400, Jacob Bonek wrote: This is a major issue that has caused us to have to stay at the latest pre-RHEL 7.4 kernel for a long time now while this issue has existed. This may be related to previous issues

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-16 Thread Stephan Wiesand
On Nov 16, 2017, at 07:06 , Benjamin Kaduk wrote: > On Wed, Nov 15, 2017 at 01:02:15PM -0500, Matt Vander Werf wrote: >> Hello, >> >> Are there any updates or progress on a potential fix for this issue? >> Anything we can do to help figure things out? > > This topic was on the agenda for our

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-15 Thread Benjamin Kaduk
On Wed, Nov 15, 2017 at 01:02:15PM -0500, Matt Vander Werf wrote: > Hello, > > Are there any updates or progress on a potential fix for this issue? > Anything we can do to help figure things out? This topic was on the agenda for our release-team meeting yesterday. If I remmber correctly,

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-15 Thread Jason Edgecombe
​I'm seeing this on some CentOS 7.4 systems that don't have AFS installed at all. It tends to happen in SMB network folders.​ --- Jason Edgecombe | Linux Administrator UNC Charlotte | The William States Lee College of

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-15 Thread Matt Vander Werf
Hello, Are there any updates or progress on a potential fix for this issue? Anything we can do to help figure things out? We are running into more and more users encountering the issue on systems we have updated, forcing us to have to downgrade the kernel on them yet as well (including the

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-15 Thread Fabien Wernli
Hi, We're experiencing the exact same issue as nd.edu, namely random getcwd() error messages especially from users having tcsh as their login shell. I've been trying to reproduce it using pwd loops in different account's home directory, without success. It's really very random. Our config: ##

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-09 Thread Matt Vander Werf
Hi Ben, Attached is the output from running the command 'lsof /afs' after running both 'echo 2 > /proc/sys/vm/drop_caches' and 'fs flushall' on the system we're testing with the updated kernel. Is this what you were looking for? Let me know if you were wanting something different at all. We do

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-11-01 Thread Matt Vander Werf
Thanks for the update! Let us know if there's anything else you need from us. We're happy to test out any potential fixes, if you'd like more testing done. Thanks. -- Matt Vander Werf HPC System Administrator University of Notre Dame Center for Research Computing - Union Station 506 W. South

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-30 Thread Mark Vitale
Matt, > On Oct 28, 2017, at 9:38 AM, Matt Vander Werf wrote: > > Attached is the output from running the command 'lsof /afs' after running > both 'echo 2 > /proc/sys/vm/drop_caches' and 'fs flushall' on the system > we're testing with the updated kernel. Is this what you were

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-28 Thread Benjamin Kaduk
On Fri, Oct 27, 2017 at 12:34:08PM -0400, Matt Vander Werf wrote: > Hi Ben, > > Following > https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache, > I dropped the pagecache (echo 1 > /proc/sys/vm/drop_caches) and that didn't > make any difference. I then

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-27 Thread Matt Vander Werf
Hi Ben, Following https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache, I dropped the pagecache (echo 1 > /proc/sys/vm/drop_caches) and that didn't make any difference. I then freed the dentries and inodes (echo 2 > /proc/sys/vm/drop_caches) and that

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-27 Thread Benjamin Kaduk
On Fri, Oct 27, 2017 at 12:13:32PM -0400, Matt Vander Werf wrote: > > Any ideas what might be the issue? Anything else we can try that might help > diagnose this? I forget if this was covered in the initial report, but did you try writing to /proc/sys/vm/drop_caches (IIRC the usable values are a

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-27 Thread Matt Vander Werf
I have some more information that we just discovered. It seems to only happen when in the top level of the affected users' home directory. My colleague tried running (affected) commands in a subdirectory of his home directory and didn't have any issues. This makes sense why the users get that

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-20 Thread Mark Vitale
> On Oct 20, 2017, at 11:03 AM, Matt Vander Werf wrote: > > We can still do a manual configure on a system and attach the config.log as > well, if you'd still like that. Just let us know. Yes, I believe seeing a config log from your site could still be helpful in tracking

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-20 Thread Stephan Wiesand
> On 20. Oct 2017, at 03:41, Benjamin Kaduk wrote: > > Hi Matt, > > On Thu, Oct 19, 2017 at 09:18:56AM -0400, Matt Vander Werf wrote: >> Hi Ben, >> >> What do you mean by an openafs config.log? Where would this be at? Would it >> be on the client or the AFS file server? Or is

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-19 Thread Benjamin Kaduk
Hi Matt, On Thu, Oct 19, 2017 at 09:18:56AM -0400, Matt Vander Werf wrote: > Hi Ben, > > What do you mean by an openafs config.log? Where would this be at? Would it > be on the client or the AFS file server? Or is there something that needs > to be done to generate this log file? This is the

Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-19 Thread Matt Vander Werf
Hi Ben, What do you mean by an openafs config.log? Where would this be at? Would it be on the client or the AFS file server? Or is there something that needs to be done to generate this log file? Thanks. -- Matt Vander Werf HPC System Administrator University of Notre Dame Center for Research

[OpenAFS] getcwd() error for RHEL 7.4 kernel

2017-10-17 Thread Jacob Bonek
Hello, We're having some strange issues with OpenAFS lately. It started after installing the base RHEL 7.4 kernel, 3.10.0-693.el7.x86_64 back in August, with the latest version of OpenAFS client at the time, 1.6.21. We've tried using the now latest version, 1.6.21.1, and still have the same