On Wed, Jun 11, 2025 at 10:04:35AM +0200, Thomas Weißschuh wrote: > On Wed, Jun 11, 2025 at 12:05:25AM -0700, Nicolin Chen wrote: > > On Tue, Jun 10, 2025 at 08:46:57PM -0300, Jason Gunthorpe wrote: > > > On Tue, Jun 10, 2025 at 11:48:44AM -0700, Nicolin Chen wrote: > > > > On Tue, Jun 10, 2025 at 09:09:02AM -0300, Jason Gunthorpe wrote: > > > > > On Tue, Jun 10, 2025 at 01:38:22PM +0200, Thomas Weißschuh wrote: > > > > > > > ------------------------------------------------------------------ > > > > > > > # RUN > > > > > > > iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty ... > > > > > > > # enforce_dirty: Test terminated unexpectedly by signal 11 > > > > > > > > > > Sig 11 is weird.. > > > > > > > > > > On another note, the selftest should use the kselftest_harness' > > > > > > ASSERT_*() > > > > > > macros instead of plain assert(). > > > > > > > > > > IIRC the kselftest stuff explodes if you try to use it's assert > > > > > functions within a fixture setup/teardown context. > > > > > > > > > > I also wasn't able to reproduce this (x86 ubuntu 24 LTS OS) Maybe > > > > > it is ARM specific, I think Nicolin is running on ARM.. > > > > > > > > Yes. And I was running with 64KB page size. I just quickly retried > > > > with 4KB page size (matching x86), and all failed tests pass now. > > > > > > That's a weird thing to be sensitive too. Can you get a backtrace from > > > the crash, what function/line is crashing? > > > > I think I am getting what's going on. Here the harness code has a > > parent process and a child process: > > > > -------------------------------------------------------------- > > 428- /* _metadata and potentially self are shared with all > > forks. */ \ > > 429: child = fork(); \ > > 430: if (child == 0) { \ > > 431- fixture_name##_setup(_metadata, self, > > variant->data); \ > > 432- /* Let setup failure terminate early. */ \ > > 433- if (_metadata->exit_code) \ > > 434- _exit(0); \ > > 435- *_metadata->no_teardown = false; \ > > 436- fixture_name##_##test_name(_metadata, self, > > variant->data); \ > > 437- _metadata->teardown_fn(false, _metadata, self, > > variant->data); \ > > 438- _exit(0); \ > > 439: } else if (child < 0 || child != waitpid(child, &status, > > 0)) { \ > > 440- ksft_print_msg("ERROR SPAWNING TEST GRANDCHILD\n"); > > \ > > 441- _metadata->exit_code = KSFT_FAIL; \ > > 442- } \ > > 443- _metadata->teardown_fn(true, _metadata, self, > > variant->data); \ > > 444- munmap(_metadata->no_teardown, > > sizeof(*_metadata->no_teardown)); \ > > 445- _metadata->no_teardown = NULL; \ > > 446- if (self && fixture_name##_teardown_parent) \ > > 447- munmap(self, sizeof(*self)); \ > > 448- if (WIFEXITED(status)) { \ > > 449- if (WEXITSTATUS(status)) \ > > 450- _metadata->exit_code = WEXITSTATUS(status); > > \ > > 451- } else if (WIFSIGNALED(status)) { \ > > 452- /* Forward signal to __wait_for_test(). */ \ > > 453- kill(getpid(), WTERMSIG(status)); \ > > 454- } \ > > .... > > 456- static void wrapper_##fixture_name##_##test_name##_teardown( \ > > 457- bool in_parent, struct __test_metadata *_metadata, \ > > 458- void *self, const void *variant) \ > > 459- { \ > > 460- if (fixture_name##_teardown_parent == in_parent && \ > > 461- > > !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \ > > 462- fixture_name##_teardown(_metadata, self, variant); \ > > 463- } \ > > -------------------------------------------------------------- > > > > I found there is a race between those two processes, resulting in > > the teardown() not getting invoked: I added some ksft_print_msg() > > calls in-between the lines to debug, those tests can pass mostly, > > as teardown() got invoked. > > > > I think the reason why those huge page cases fail is just because > > the huge version of setup() takes longer time.. > > Can you try to recreate this issue with changes to > tools/testing/selftests/kselftest_harness/harness-selftest.c ?
Hmm, I assume all 9 cases should pass? Mine only passes 4 on rc1, without any change (perhaps we should start from here?): TAP version 13 1..9 # Starting 9 tests from 4 test cases. # RUN global.standalone_pass ... # harness-selftest.c:19:standalone_pass:before # harness-selftest.c:23:standalone_pass:after # OK global.standalone_pass ok 1 global.standalone_pass # RUN global.standalone_fail ... # harness-selftest.c:27:standalone_fail:before # harness-selftest.c:29:standalone_fail:Expected 0 (0) == 1 (1) # harness-selftest.c:30:standalone_fail:Expected 0 (0) == 1 (1) # standalone_fail: Test terminated by assertion # FAIL global.standalone_fail not ok 2 global.standalone_fail # RUN global.signal_pass ... # harness-selftest.c:35:signal_pass:before # harness-selftest.c:37:signal_pass:after # OK global.signal_pass ok 3 global.signal_pass # RUN global.signal_fail ... # harness-selftest.c:42:signal_fail:before # harness-selftest.c:43:signal_fail:Expected 0 (0) == 1 (1) # signal_fail: Test terminated by assertion # FAIL global.signal_fail not ok 4 global.signal_fail # RUN fixture.pass ... # harness-selftest.c:53:pass:setup # harness-selftest.c:62:pass:before # harness-selftest.c:19:pass:before # harness-selftest.c:23:pass:after # harness-selftest.c:66:pass:after # harness-selftest.c:58:pass:teardown same-process=1 # OK fixture.pass ok 5 fixture.pass # RUN fixture.fail ... # harness-selftest.c:53:fail:setup # harness-selftest.c:70:fail:before # harness-selftest.c:71:fail:Expected 0 (0) == 1 (1) # harness-selftest.c:58:fail:teardown same-process=1 # fail: Test terminated by assertion # FAIL fixture.fail not ok 6 fixture.fail # RUN fixture.timeout ... # harness-selftest.c:53:timeout:setup # harness-selftest.c:77:timeout:before # timeout: Test terminated by timeout # FAIL fixture.timeout not ok 7 fixture.timeout # RUN fixture_parent.pass ... # harness-selftest.c:87:pass:setup # harness-selftest.c:96:pass:before # harness-selftest.c:98:pass:after # harness-selftest.c:92:pass:teardown same-process=0 # OK fixture_parent.pass ok 8 fixture_parent.pass # RUN fixture_setup_failure.pass ... # harness-selftest.c:106:pass:setup # harness-selftest.c:108:pass:Expected 0 (0) == 1 (1) # pass: Test terminated by assertion # FAIL fixture_setup_failure.pass not ok 9 fixture_setup_failure.pass # FAILED: 4 / 9 tests passed. # Totals: pass:4 fail:5 xfail:0 xpass:0 skip:0 error:0 > > I haven't figured out a proper fix yet, but something smells bad: > > 1) *no_teardown is set non-atomically, while both processes calls > > __atomic_test_and_set() > > Does this make a difference? > > diff --git a/tools/testing/selftests/kselftest_harness.h > b/tools/testing/selftests/kselftest_harness.h > index 2925e47db995..89fb37a21d9d 100644 > --- a/tools/testing/selftests/kselftest_harness.h > +++ b/tools/testing/selftests/kselftest_harness.h > @@ -410,7 +410,7 @@ > /* Makes sure there is only one teardown, even when child > forks again. */ \ > _metadata->no_teardown = mmap(NULL, > sizeof(*_metadata->no_teardown), \ > PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, > -1, 0); \ > - *_metadata->no_teardown = true; \ > + __atomic_store_n(_metadata->no_teardown, true, > __ATOMIC_SEQ_CST); \ > if (sizeof(*self) > 0) { \ > if (fixture_name##_teardown_parent) { \ > self = mmap(NULL, sizeof(*self), PROT_READ | > PROT_WRITE, \ > @@ -429,7 +429,7 @@ > /* Let setup failure terminate early. */ \ > if (_metadata->exit_code) \ > _exit(0); \ > - *_metadata->no_teardown = false; \ > + __atomic_store_n(_metadata->no_teardown, false, > __ATOMIC_SEQ_CST); \ > fixture_name##_##test_name(_metadata, self, > variant->data); \ > _metadata->teardown_fn(false, _metadata, self, > variant->data); \ > _exit(0); \ > @@ -455,7 +455,7 @@ > void *self, const void *variant) \ > { \ > if (fixture_name##_teardown_parent == in_parent && \ > - > !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \ > + > !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_SEQ_CST)) \ > fixture_name##_teardown(_metadata, self, variant); \ > } \ > static struct __test_metadata > *_##fixture_name##_##test_name##_object; \ Unfortunately, no :( > > > 2) parent doesn't seem to wait for the setup() to complete.. > > setup() is called in the child (L431) right before the testcase itself is > called (L436). The parent waits for the child to exit (L439) before unmapping. > > > 3) when parent runs faster than the child that is still running > > setup(), the parent unmaps the no_teardown and set it to NULL, > > then UAF in the child, i.e. signal 11? > > That should never happen as the waitpid() will block until the child running > setup() and the testcase itself have exited. Ah, maybe I was wrong about these narratives. But the results show that iommufd_dirty_tracking_teardown() was not called in the failed cases: // I added a huge.c file to run only 4 cases on one variant // And I added two fprintf to its FIXTURE_SETUP/TEARDOWN(). TAP version 13 1..4 # Starting 4 tests from 1 test cases. # RUN iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking ... ---------iommufd_dirty_tracking_setup ---------iommufd_dirty_tracking_teardown # OK iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking ok 1 iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking # RUN iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability ... ---------iommufd_dirty_tracking_setup # device_dirty_capability: Test terminated unexpectedly by signal 11 # FAIL iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability not ok 2 iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability # RUN iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap ... ---------iommufd_dirty_tracking_setup # get_dirty_bitmap: Test terminated unexpectedly by signal 11 # FAIL iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap not ok 3 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap # RUN iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear ... ---------iommufd_dirty_tracking_setup # get_dirty_bitmap_no_clear: Test terminated unexpectedly by signal 11 # FAIL iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear not ok 4 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear # FAILED: 1 / 4 tests passed. > > Does the issue also happen when you only execute a single testcase? > $ ./iommufd -r iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty $ sudo tools/testing/selftests/iommu/iommufd -r iommufd_dirty_tracking.domain_dirty128M.enforce_dirty TAP version 13 1..1 # Starting 1 tests from 1 test cases. # RUN iommufd_dirty_tracking.domain_dirty128M.enforce_dirty ... # OK iommufd_dirty_tracking.domain_dirty128M.enforce_dirty ok 1 iommufd_dirty_tracking.domain_dirty128M.enforce_dirty # PASSED: 1 / 1 tests passed. # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 This one passes. Looks like the first hugepage case would pass but the following ones would fail if running them sequentially.. Thanks Nicolin