Hi Ross, ----- Original Message ----- (snip) > We haven't observed any problems that can be directly attributed to this > without KASAN, although it is hard to tell what a stray write may do. We > have hit sporadic asserts and filesystem corruption during testing. > > When I added tracing, the time between freeing a glock and writing to it > varied but could be up to hundreds of milliseconds so I would guess that > this could easily happen without KASAN. It is relatively easy to > reproduce in our test environment. > > Do you have any suggestions for tracking down the root cause?
In the past, I've debugged problems with glock reference counting by using kernel tracing and instrumentation. Unfortunately, the "glock_put" trace point only shows you when the glock ref count goes to 0, and doesn't show when or how the glock is first created, which, of course, doesn't show if it's created and destroyed multiple times, and often that's important to figuring these out, otherwise it's just a lot of chaos. In the past, I've added my own temporary kernel trace point for when new glocks are created, and called it "glock_new." You probably also want to modify the glock put functions, such as gfs2_glock_put and gfs2_glock_queue_put, to call a trace point so you can tell that too, and have it save off the gl_lockref reference count in the trace. Then recreate the problem with the trace running. I attached a script I often use for these purposes. The script contains several bogus trace point references for various sets of temporary trace points I've added and deleted over the years, like a generic "debug" trace point where I can add generic messages of what's happening. So don't be surprised if you get errors about trying to cat values into non-existent debugfs files. Just ignore them. The script DOES contain a trigger for a "glock_new" trace point for just this purpose. I can try to dig out whether I still have that trace point (glock_new) and the generic debug trace point lying around somewhere in my many git repositories, but it might take longer than just writing them again from scratch. I know it pre-dates the concept of a "queued_put" so things will need to be tweaked anyway. The script had a bunch of declares at the top for which trace points to monitor and collect. I modified it for glock_new and glock_put, but you can play with it. To run the script and collect the trace, just do this: ./gfs2trace.sh & (recreate the problem) rm /var/run/gfs2-tracepoints.pid Removing that file triggers the trace script to stop tracing and save the results to a file in /tmp/ named after the machine's name (so we can keep them straight in clustered situations). Then, of course, someone needs to analyze the resulting trace file and figure out where the count is getting off. I hope this helps. Regards, Bob Peterson Red Hat File Systems