Exactly, i missed this definition of d_lock and treat it as a single member in dentry.
#define d_lock  d_lockref.lock

Thanks for the explanation. i will post a new patch as your suggestion.

Regards,
Liangyan

On 20/12/22 上午1:35, Al Viro wrote:
On Tue, Dec 22, 2020 at 12:51:27AM +0800, Liangyan wrote:
This is the race scenario based on call trace we captured which cause the
dentry leak.


      CPU 0                                CPU 1
ovl_set_redirect                       lookup_fast
   ovl_get_redirect                       __d_lookup
     dget_dlock
       //no lock protection here            spin_lock(&dentry->d_lock)
       dentry->d_lockref.count++            dentry->d_lockref.count++


If we use dget_parent instead, we may have this race.


      CPU 0                                    CPU 1
ovl_set_redirect                           lookup_fast
   ovl_get_redirect                           __d_lookup
     dget_parent
       raw_seqcount_begin(&dentry->d_seq)      spin_lock(&dentry->d_lock)
       lockref_get_not_zero(&ret->d_lockref)   dentry->d_lockref.count++

And?

lockref_get_not_zero() will observe ->d_lock held and fall back to
taking it.

The whole point of lockref is that counter and spinlock are next to each
other.  Fastpath in lockref_get_not_zero is cmpxchg on both, and
it is taken only if ->d_lock is *NOT* locked.  And the slow path
there will do spin_lock() around the manipulations of ->count.

Note that ->d_lock is simply ->d_lockref.lock; ->d_seq has nothing
to do with the whole thing.

The race in mainline is real; if you can observe anything of that
sort with dget_parent(), we have much worse problem.  Consider
dget() vs. lookup_fast() - no overlayfs weirdness in sight and the
same kind of concurrent access.

Again, lockref primitives can be safely mixed with other threads
doing operations on ->count while holding ->lock.

Reply via email to