Coming back to this issue after a break. I reviewed the code carefully and I cannot find anything where we are doing something in the application code that affects the RTS locking mechanism. Let me walk through the steps of the test up to failure and what we are doing in the code. The test output is like this:
Ensuring dir: /tmp/fsevent_common-3725879775a96dfe/watch-root createFile: Creating file: /tmp/fsevent_common-3725879775a96dfe/watch-root/file1 FileSystem.Event: /tmp/fsevent_common-3725879775a96dfe/watch-root/file1: openFile: resource busy (file is locked) These steps are done before we start the file watching tests. We first create a "watch-root" directory using createDirectory and then use openFile to create a file named "file1" in that directory which we just created, this openFile operation fails with "file is locked" error. These steps are executed serially. The directory path is unique, this path is never used before in this test or in any other tests. This is the first time we are creating this file. Now let me explain a bit about the custom file opening/closing logic in the program. In each test, first we create a watch fd using inotify_init, and create a Handle from it: watch_fd <- c_inotify_init Note that the watch_fd does not belong to an actual file on the file system, so locking is irrelevant for this fd. FWIW, we avoid the lockFile code path in this case when creating the Handle from fd. To watch a file or directory we add that file to the watch using something like: watch_descriptor <- c_inotify_add_watch watch_fd path flags Note that we never create a GHC Handle from watch_descriptor. To remove a file from the watch we do something like this: rm = c_inotify_rm_watch watch_fd watch_descriptor This closes the watch_descriptor returned by inotify_add_watch. In any of these operations we have not inserted anything in the GHC RTS locking table. In fact we are not even creating a Handle in any case other than the watch_fd which does not even represent a file on the file system and we skip the lockFile code when creating this Handle. After adding the file to the watch, we open the file, perform read/write operations on it using the regular GHC file operations, and then close the file. I am not sure if closing the watch_descriptor interferes in any way with GHC's file descriptor which is open for normal reading/writing. I assume the watch_descriptor returned by inotify_add_watch operation on the file and the regular fd in the GHC Handle returned by the "openFile" operation on the file are different. So our closing the watch_descriptor should not affect the removal of the lock entry by RTS when the file Handle is closed. Regarding, how to debug this - I have not been able to reproduce this on a local Linux machine, even when using ext4 as temp directory's file system. On the local system I can see that inodes are being reused all the time on ext4, but still the problem is not reproduced. I ran the tests for several days and did not get a single instance of the issue. The only way forward seems to be to create a GHC build which logs the lock table entry and removal, and then run the CI using that GHC build. Even in CIs I can reproduce the problem only on lucky days. -harendra On Thu, 10 Oct 2024 at 04:32, Viktor Dukhovni <ietf-d...@dukhovni.org> wrote: > > On Wed, Oct 09, 2024 at 12:15:32PM +0530, Harendra Kumar wrote: > > We do use low level C APIs and GHC APIs to create a Handle in the > > event watching module. But that is for the watch-root and not for the > > file that is experiencing this problem. So here is how it works. We > > have a top level directory which is watched for events using inotify. > > We first create this directory, this directory is opened using > > inotify_init which returns a C file descriptor. We then create a > > Handle from this fd, this Handle is used for watching inotify events. > > We are then creating a file inside this directory which is being > > watched while we are reading events from the parent directory. The > > resource-busy issue occurs when creating a file inside this directory. > > So we are not creating the Handle for the file in question in a > > non-standard manner, but the parent directory Handle is being created > > in that manner. I do not know if that somehow affects anything. Or if > > the fact that the directory is being watched using inotify makes any > > difference? > > > > The code for creating the watch Handle is here: > > https://github.com/composewell/streamly/blob/bbac52d9e09fa5ad760ab6ee5572c701e198d4ee/src/Streamly/Internal/FileSystem/Event/Linux.hs#L589 > > . Viktor, you may want to take a quick look at this to see if it can > > make any difference to the issue at hand. > > I don't have the cycles to isolate the problem. I still suspect that > your code is somehow directly closing file descriptors associated with a > Handle. This then orphans the associated logical reader/writer lock, > which then gets inherited by the next incarnation of the same (dev, ino) > pair. However, if the filesystem underlying "/tmp" were actually "tmpfs", > inode reuse would be quite unlikely, because tmpfs inodes are assigned > from a strictly incrementing counter: > > $ for i in {1..10}; do touch /tmp/foobar; ls -i /tmp/foobar; rm > /tmp/foobar; done > 3830 /tmp/foobar > 3831 /tmp/foobar > 3832 /tmp/foobar > 3833 /tmp/foobar > 3834 /tmp/foobar > 3835 /tmp/foobar > 3836 /tmp/foobar > 3837 /tmp/foobar > 3838 /tmp/foobar > 3839 /tmp/foobar > > but IIRC you mentioned that on Github "/tmp" is ext4, not "tmpfs" > (perhaps RAM-backed storage is a more scarce resource), in which > case indeed inode reuse is quite likely: > > $ for i in {1..10}; do touch /var/tmp/foobar; ls -i /var/tmp/foobar; rm > /var/tmp/foobar; done > 25854141 /var/tmp/foobar > 25854142 /var/tmp/foobar > 25854141 /var/tmp/foobar > 25854142 /var/tmp/foobar > 25854141 /var/tmp/foobar > 25854142 /var/tmp/foobar > 25854141 /var/tmp/foobar > 25854142 /var/tmp/foobar > 25854141 /var/tmp/foobar > 25854142 /var/tmp/foobar > > But since normal open/close of Handles acquires the lock after open, and > releases it before close, the evidence points to a bypass of the normal > open file lifecycle. > > Your codebase contains a bunch of custom file management logic, which > could be the source the of problem. To find the problem code path, > you'd probably need to instrument the RTS lock/unlock code to log its > activity: (mode, descriptor, dev, ino) tuples being added and removed. > And strace execution to be able to identify descriptor open and close > events. Ideally the problem will be reproducible even with strace. > > Good luck. > > -- > Viktor. > _______________________________________________ > ghc-devs mailing list > ghc-devs@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs