Bug#1057251: librocfft0-tests: nondeterministic failures in random_real_3d/random_params.vs_fftw

2023-12-04 Thread Cordell Bloor
The test failures are reproducible on my Radeon VII (gfx906) workstation 
when the failing seed is specified. So, the problem is not specific to 
gfx1032:


$ ROCFFT_LAYER=1 /usr/libexec/rocm/librocfft0-tests/rocfft-test 
--gtest_filter='random_real_3d/random_params.vs_fftw/0:random_real_3d/random_params.vs_fftw/1' 
--seed 190206186

single epsilon: 3.75e-05    double epsilon: 1e-15
Random seed: 190206186
rocfft_setup
rocfft_get_version_string,buf,0x7ffe3b83ebf0,len,256
rocFFT version: 1.0.21.
Note: Google Test filter = 
random_real_3d/random_params.vs_fftw/0:random_real_3d/random_params.vs_fftw/1

[==] Running 2 tests from 1 test suite.
[--] Global test environment set-up.
[--] 2 tests from random_real_3d/random_params
[ RUN  ] random_real_3d/random_params.vs_fftw/0
rocfft_plan_description_create,description,0x55fbee94f950
rocfft_plan_description_set_data_layout,description,0x55fbee94f950,in_array_type,real,out_array_type,hermitian_interleaved,in_offsets,[0],out_offsets,[0],in_strides,[1,34,2142],in_distance,154224,out_strides,[1,18,1134],out_distance,81648
rocfft_plan_create,plan,0x55fbe38234e0,placement,notinplace,transform_type,real_forward,precision,double,dimensions,3,lengths,[34,63,72],number_of_transforms,1,description,0x55fbee94f950
rocfft_execution_info_create,info,0x55fbe297e7a0
rocfft_plan_get_work_buffer_size,plan,0x55fbe38234e0,size_in_bytes 
ptr,0x7ffe3b83d1a8,val,0
rocfft_plan_get_work_buffer_size,plan,0x55fbe38234e0,size_in_bytes 
ptr,0x7ffe3b83d1a8,val,0

rocfft_execute,plan,0x55fbe38234e0,in_buffer,0x55fbdef54720,out_buffer,0x55fbe286e600,info,0x55fbe297e7a0
hipModuleLaunchKernel failure
rocfft_execution_info_destroy,info,0x55fbe297e7a0
rocfft_plan_description_destroy,description,0x55fbee94f950
unknown file: Failure
C++ exception with description "rocFFT plan execution failure" thrown in 
the test body.


[  FAILED  ] random_real_3d/random_params.vs_fftw/0, where GetParam() = 
(0, 3, 1, 1, 2) (582 ms)

[ RUN  ] random_real_3d/random_params.vs_fftw/1
rocfft_plan_description_create,description,0x55fbe3ccf780
rocfft_plan_description_set_data_layout,description,0x55fbe3ccf780,in_array_type,hermitian_interleaved,out_array_type,real,in_offsets,[0],out_offsets,[0],in_strides,[1,18,1134],in_distance,81648,out_strides,[1,34,2142],out_distance,154224
rocfft_plan_create,plan,0x55fbe3ccac60,placement,notinplace,transform_type,real_inverse,precision,double,dimensions,3,lengths,[34,63,72],number_of_transforms,1,description,0x55fbe3ccf780
rocfft_execution_info_create,info,0x55fbdf1a84e0
rocfft_plan_get_work_buffer_size,plan,0x55fbe3ccac60,size_in_bytes 
ptr,0x7ffe3b83d1a8,val,0
rocfft_plan_get_work_buffer_size,plan,0x55fbe3ccac60,size_in_bytes 
ptr,0x7ffe3b83d1a8,val,0

rocfft_execute,plan,0x55fbe3ccac60,in_buffer,0x55fbdeecc7b0,out_buffer,0x55fbedb2e280,info,0x55fbdf1a84e0
hipModuleLaunchKernel failure
rocfft_execution_info_destroy,info,0x55fbdf1a84e0
rocfft_plan_description_destroy,description,0x55fbe3ccf780
unknown file: Failure
C++ exception with description "rocFFT plan execution failure" thrown in 
the test body.


[  FAILED  ] random_real_3d/random_params.vs_fftw/1, where GetParam() = 
(0, 3, 1, 1, 3) (368 ms)

[--] 2 tests from random_real_3d/random_params (951 ms total)

[--] Global test environment tear-down
[==] 2 tests from 1 test suite ran. (955 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] random_real_3d/random_params.vs_fftw/0, where GetParam() = 
(0, 3, 1, 1, 2)
[  FAILED  ] random_real_3d/random_params.vs_fftw/1, where GetParam() = 
(0, 3, 1, 1, 3)


 2 FAILED TESTS
rocfft_cleanup
single precision max l-inf epsilon: 0
single precision max l2 epsilon: 0
double precision max l-inf epsilon: 0
double precision max l2 epsilon: 0



Bug#1057251: librocfft0-tests: nondeterministic failures in random_real_3d/random_params.vs_fftw

2023-12-01 Thread Cordell Bloor
Package: librocfft0-tests
Version: 5.5.0-6
Severity: normal

Dear Maintainer,

The rocfft tests passed then failed on amd64+gfx1032 with an identical set of
dependencies. The failing log contained:

 55s Random seed: 190206186
<...>
14657s [ RUN  ] random_real_3d/random_params.vs_fftw/0
14658s unknown file: Failure
14658s C++ exception with description "rocFFT plan execution failure"
thrown in the test body.
14658s
14658s [  FAILED  ] random_real_3d/random_params.vs_fftw/0, where
GetParam() = (0, 3, 1, 1, 2) (877 ms)
14658s [ RUN  ] random_real_3d/random_params.vs_fftw/1
14659s unknown file: Failure
14659s C++ exception with description "rocFFT plan execution failure"
thrown in the test body.
14659s
14659s [  FAILED  ] random_real_3d/random_params.vs_fftw/1, where
GetParam() = (0, 3, 1, 1, 3) (960 ms)

https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1032/r/rocfft/1341/log.gz

The earlier passing log contained:


 57s Random seed: 1459638283
<...>
14609s [ RUN  ] random_real_3d/random_params.vs_fftw/0
14609s [   OK ] random_real_3d/random_params.vs_fftw/0 (42 ms)
14609s [ RUN  ] random_real_3d/random_params.vs_fftw/1
14609s [   OK ] random_real_3d/random_params.vs_fftw/1 (45 ms)
14609s [ RUN  ] random_real_3d/random_params.vs_fftw/2

https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1032/r/rocfft/894/log.gz

I've discussed this with the upstream rocFFT developers and they plan to change
rocfft-test to only run deterministic tests by default. That will ensure that
when end-users are verifying their installation, that they only run the tests
that have already been run by the upstream developers. There will be an option
to enable the nondeterministic tests, which they will use during their
development.

In the meantime, I would suggest that `--seed N` be added to the arguments
passed to rocfft-test in the autopkgtests. Disabling the nondeterminism in the
test suite makes it easier to compare results when the autopkgtests are
triggered by dependency updates and to compare results between different GPU
architectures.

We should also investigate to determine the underlying cause of the failure
with `--seed 190206186`.

Regards,
Cory Bloor


-- System Information:
Debian Release: trixie/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 6.5.0-4-amd64 (SMP w/32 CPU threads; PREEMPT)
Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_CA:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages librocfft0-tests depends on:
ii  libamdhip64-5   5.2.3-13
ii  libboost-program-options1.74.0  1.74.0+ds1-23
ii  libc6   2.37-12
ii  libfftw3-double33.3.10-1
ii  libfftw3-single33.3.10-1
ii  libgcc-s1   13.2.0-7
ii  librocfft0  5.5.0-6
ii  librocrand1 5.5.1-2
ii  libstdc++6  13.2.0-7

librocfft0-tests recommends no packages.

librocfft0-tests suggests no packages.

-- no debconf information