Hi SeongJae,
Apologies for the delayed response.
Thank you for the detailed review and for actually testing the patch
on a kernel with the fix reverted.
You are right that the userspace selftest cannot reliably trigger this
race. The race window between kdamond cancelling requests and
unsetting the pointer is microseconds wide. Since userspace calls have
system call overhead, they cannot hit that window reliably — which is
why the test always passed even on a broken kernel.
Regarding the error condition question:
When kdamond stops naturally, any subsequent write to the state file
(which update_schemes_tried_regions() does internally) fails because
kdamond is no longer running. So receiving an error indicates kdamond
exited without deadlock — the expected success condition. I should
have explained this more clearly in the comment.
I agree that a KUnit test is the right approach since it runs inside
the kernel and can directly set walk_control_obsolete to simulate the
race condition reliably without timing dependencies.
I will drop the userspace selftest approach and write a KUnit test for
damos_walk() functionality including walk_control_obsolete. I will
verify it fails on a kernel with commit 33c3f6c2b48c ("mm/damon/core:
fix damos_walk() vs kdamond_fn() exit race") reverted and passes with
the fix present before sending v3.
Regarding other feedback:
- Will use correct commit description format with subject in brackets
- Will remove the Fixes: tag
- Will add blank line before _damon_sysfs import
- Will add changelog and link previous versions
- Will share revision plan and wait at least one day
before sending new version
Please give me some time to study mm/damon/tests/core-kunit.h and
implement this properly.
Thanks,
Sailesh Nandanavanam
On Fri, Jun 5, 2026 at 12:44 PM Sailesh Nandanavanam
<[email protected]> wrote:
>
> Hi SeongJae,
>
> Apologies for the delayed response.
>
> Thank you for the detailed review and for actually testing the patch on a
> kernel with the fix reverted.
>
> You are right that the userspace selftest cannot reliably trigger this race.
> The race window between kdamond cancelling requests and unsetting the pointer
> is microseconds wide. Since userspace calls have system call overhead, they
> cannot hit that window reliably — which is why the test always passed even on
> a broken kernel.
>
> Regarding the error condition question:
>
> When kdamond stops naturally, any subsequent write to the state file (which
> update_schemes_tried_regions() does internally) fails because kdamond is no
> longer running. So receiving an error indicates kdamond exited without
> deadlock — the expected success condition. I should have explained this more
> clearly in the comment.
>
> I agree that a KUnit test is the right approach since it runs inside the
> kernel and can directly set walk_control_obsolete to simulate the race
> condition reliably without timing dependencies.
>
> I will drop the userspace selftest approach and write a KUnit test for
> damos_walk() functionality including walk_control_obsolete. I will verify it
> fails on a kernel with commit 33c3f6c2b48c ("mm/damon/core: fix damos_walk()
> vs kdamond_fn() exit race") reverted and passes with the fix present before
> sending v3.
>
> Regarding other feedback:
> - Will use correct commit description format with subject in brackets
> - Will remove the Fixes: tag
> - Will add blank line before _damon_sysfs import
> - Will add changelog and link previous versions
> - Will share revision plan and wait at least one day
> before sending new version
>
> Please give me some time to study mm/damon/tests/core-kunit.h and implement
> this properly.
>
> Thanks,
> Sailesh Nandanavanam
>
> On Mon, May 25, 2026 at 12:08 AM SeongJae Park <[email protected]> wrote:
>>
>> Hello Sailesh,
>>
>> On Sun, 24 May 2026 15:32:58 +0530 Sailesh Nandanavanam
>> <[email protected]> wrote:
>>
>> > Add a regression test that verifies damos_walk() does not deadlock
>> > when racing with kdamond_fn() exit.
>> >
>> > When kdamond_fn() finishes its main loop, it cancels remaining
>> > damos_walk() requests and unsets damon_ctx->kdamond. Without the fix
>> > in commit 33c3f6c2b48c, damos_walk() could be called right after
>>
>> Please use more common commit description format that has the commit subject.
>> E.g., commit 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn()
>> exit race").
>>
>> > cancellation but before kdamond pointer unset, causing it to wait
>> > forever for handling that never comes.
>> >
>> > The test starts kdamond monitoring a short-lived process, waits for
>> > the process to exit naturally triggering kdamond termination, then
>> > rapidly calls update_schemes_tried_regions in a separate thread to
>> > hit the race window. Using a thread with join timeout ensures the
>> > test can detect kernel-level deadlocks where the system call blocks
>> > in uninterruptible state.
>> >
>> > The sysfs state path is resolved dynamically via the kdamonds object
>> > instead of being hardcoded, and exceptions are handled specifically
>> > as OSError rather than using a bare except block.
>>
>> Thank you for this patch. But, is this test reliable? I ran the test more
>> than 100 times on my system running a kernel that has commit 33c3f6c2b48c is
>> reverted. But test always passed.
>>
>> >
>> > Fixes: 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn() exit
>> > race")
>>
>> This is not fixing a bug in the commit, so the above 'Fixes:' tag is
>> inappropriate and may only confuse people.
>>
>> > Signed-off-by: Sailesh Nandanavanam <[email protected]>
>> > ---
>>
>> From next time, please add changelog on the commentary area [1], with links
>> to
>> the previous revisions.
>>
>> I was able to find the previous version on the mailing list [2], so putting
>> the
>> link here for others.
>>
>> Also, before sending a new version, please share your revisioning plan and
>> give
>> time (about a daay) for others to comment about.
>>
>> > tools/testing/selftests/damon/Makefile | 1 +
>> > .../sysfs_damos_walk_kdamond_exit_race.py | 82 +++++++++++++++++++
>> > 2 files changed, 83 insertions(+)
>> > create mode 100755
>> > tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
>> >
>> > diff --git a/tools/testing/selftests/damon/Makefile
>> > b/tools/testing/selftests/damon/Makefile
>> > index 2180c328a825..60c83d6c318e 100644
>> > --- a/tools/testing/selftests/damon/Makefile
>> > +++ b/tools/testing/selftests/damon/Makefile
>> > @@ -20,6 +20,7 @@ TEST_PROGS += sysfs_update_removed_scheme_dir.sh
>> > TEST_PROGS += sysfs_update_schemes_tried_regions_hang.py
>> > TEST_PROGS += sysfs_memcg_path_leak.sh
>> > TEST_PROGS += sysfs_no_op_commit_break.py
>> > +TEST_PROGS += sysfs_damos_walk_kdamond_exit_race.py
>> >
>> > EXTRA_CLEAN = __pycache__
>> >
>> > diff --git
>> > a/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
>> > b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
>> > new file mode 100755
>> > index 000000000000..8e8006d63926
>> > --- /dev/null
>> > +++ b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
>> > @@ -0,0 +1,82 @@
>> > +#!/usr/bin/env python3
>> > +# SPDX-License-Identifier: GPL-2.0
>> > +#
>> > +# Regression test for damos_walk() vs kdamond_fn() exit race.
>> > +#
>> > +# When kdamond_fn() finishes its main loop, it cancels remaining
>> > damos_walk()
>> > +# requests and unsets damon_ctx->kdamond. If damos_walk() is called right
>> > +# after cancellation but before kdamond pointer unset, it could wait
>> > forever
>> > +# for handling that never comes, causing a deadlock.
>> > +#
>> > +# This test verifies the fix by rapidly calling
>> > update_schemes_tried_regions
>> > +# while kdamond is naturally terminating (monitored process exits).
>> > +# Without the fix (commit 33c3f6c2b48c), this would hang indefinitely.
>> > +
>> > +import os
>> > +import subprocess
>> > +import threading
>> > +import time
>> > +import _damon_sysfs
>>
>> Let's add a blank line before _damon_sysfs import, to be consistent with
>> other
>> test files, like sysfs_update_schemes_tried_regions_hang.py.
>>
>> > +
>> > +def call_update(kdamond, result):
>> > + err = kdamond.update_schemes_tried_regions()
>> > + result['err'] = err
>> > + result['done'] = True
>> > +
>> > +def main():
>> > + proc = subprocess.Popen(['sleep', '0.3'])
>> > +
>> > + kdamonds = _damon_sysfs.Kdamonds([_damon_sysfs.Kdamond(
>> > + contexts=[_damon_sysfs.DamonCtx(
>> > + ops='vaddr',
>> > + targets=[_damon_sysfs.DamonTarget(pid=proc.pid)],
>> > + schemes=[_damon_sysfs.Damos(
>> > + action='stat',
>> > + access_pattern=_damon_sysfs.DamosAccessPattern(
>> > + nr_accesses=[0, 200]))]
>> > + )]
>> > + )])
>> > +
>> > + err = kdamonds.start()
>> > + if err is not None:
>> > + print('kdamond start failed: %s' % err)
>> > + exit(1)
>> > +
>> > + # Wait for monitored process to die naturally
>> > + proc.wait()
>> > +
>> > + # Rapidly call damos_walk() while kdamond is exiting
>> > + # Use a thread with real timeout to detect kernel-level deadlock
>> > + deadline = time.time() + 5
>> > + while time.time() < deadline:
>> > + result = {'done': False, 'err': None}
>> > + t = threading.Thread(target=call_update,
>> > + args=(kdamonds.kdamonds[0], result))
>> > + t.daemon = True
>> > + t.start()
>> > + t.join(timeout=5)
>>
>> I'm not sure if this is reliable to trigger the exact race. As I mentioned
>> abovely, I tried this test more than 100 times on a kernel that having the
>> fix
>> reverted, but I was unable to make the test fail. If it is that unreliable,
>> I'm not very sure if having this test is beneficial or just make people
>> confused.
>>
>> If the test has no false positive, maybe having this make sense to
>> opportunistically finding the bug. But I think the 5 seconds timeout is
>> still
>> not very reliable on some case, and therefore it seems false positive test
>> failure is available. If that is correct, I think having this test might
>> only
>> confuse people.
>>
>> I think having damos_walk() kunit test for its functionalities including the
>> walk_control_obsolete might make more sense.
>>
>> > +
>> > + if not result['done']:
>> > + print('FAIL: update_schemes_tried_regions hung - '
>> > + 'possible damos_walk/kdamond exit race deadlock')
>> > + exit(1)
>> > +
>> > + if result['err'] is not None:
>> > + # kdamond stopped cleanly - expected
>> > + break
>>
>> Is the above if condition correct? Could you please explain why having an
>> error here is expected?
>>
>> > +
>> > + # Check kdamond state via sysfs using dynamic path
>> > + state_path = os.path.join(
>> > + kdamonds.kdamonds[0].sysfs_dir(), 'state')
>> > + try:
>> > + with open(state_path) as f:
>> > + if f.read().strip() == 'off':
>> > + break
>> > + except OSError as e:
>> > + print('failed to read kdamond state: %s' % e)
>> > + exit(1)
>> > +
>> > + print('PASS: damos_walk() vs kdamond exit race not triggered')
>> > +
>> > +if __name__ == '__main__':
>> > + main()
>> > --
>> > 2.34.1
>>
>> [1] https://docs.kernel.org/process/submitting-patches.html#commentary
>> [2]
>> https://lore.kernel.org/[email protected]
>>
>>
>> Thanks,
>> SJ