Hi, I'm looking for suggestions on how to diagnose a large, confused, performance difference between Mesa and RocM.
The code: https://github.com/penguin42/vhs-teletext/blob/vectorised/teletext/vbi/patternopencl.py is pyopencl driven and part of a large program. (Running it is pretty easy, but you do need a chunk of data which is normally collected off received video; I could provide a sample privately) There are three kernels in there correlate, minerr1, minerr2 that are chained together. and it runs multiple _processes_ each running all the kernels. On one set of data (tapedg004) which mostly uses a smaller data set, Mesa beats RocM, on the bigger data set, the simple correlate kernel is about 50% slower on mesa: 64k dataset: kernel: correlate | minerr1 | minerr2 MESA 2.3ms 0.84ms * 0.15ms * ROCM 1.1ms * 0.88ms 0.18ms 8k dataset: kernel: correlate | minerr1 | minerr2 MESA 0.127ms * 0.078ms * 0.059ms * ROCM 0.166ms 0.120ms 0.090ms So generally the minerr kernel is faster on Mesa (sometimes a lot) but in the large data case the correlate kernel is more than twice as slow. What surprises me is that the correlate kernel is the simpler of the kernels in my mind! Host is Fedora 38, on Ryzen 3950X; clinfo from mesa below. All pointers welcome. Dave Clinfo from Mesa Number of platforms 2 Platform Name Clover Platform Vendor Mesa Platform Version OpenCL 1.1 Mesa 23.0.3 Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd Platform Extensions function suffix MESA Platform Name rusticl Platform Vendor Mesa/X.org Platform Version OpenCL 3.0 Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd Platform Extensions with Version cl_khr_icd 0x 400000 (1.0.0) Platform Numeric Version 0xc00000 (3.0.0) Platform Extensions function suffix MESA Platform Host timer resolution 0ns Platform Name Clover Number of devices 1 Device Name AMD Radeon RX 550 / 550 Series (polaris12, LLVM 16.0.1, DRM 3.49, 6 .2.14-300.fc38.x86_64) Device Vendor AMD Device Vendor ID 0x1002 Device Version OpenCL 1.1 Mesa 23.0.3 Device Numeric Version 0x401000 (1.1.0) Driver Version 23.0.3 Device OpenCL C Version OpenCL C 1.1 Device Type GPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Max compute units 8 Max clock frequency 1206MHz Max work item dimensions 3 Max work item sizes 256x256x256 Max work group size 256 Preferred work group size multiple (kernel) 64 Preferred / native vector sizes char 16 / 16 short 8 / 8 int 4 / 4 long 2 / 2 half 0 / 0 (n/a) float 4 / 4 double 2 / 2 (cl_khr_fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals No Infinity and NANs Yes Round to nearest Yes Round to zero No Round to infinity No IEEE754-2008 fused multiply-add No Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 2147483648 (2GiB) Error Correction support No Max memory allocation 536870912 (512MiB) Unified memory for Host and Device No Minimum alignment for any data type 128 bytes Alignment of base address 32768 bits (4096 bytes) Global Memory cache type None Image support No Local memory type Local Local memory size 65536 (64KiB) Max number of constant args 16 Max constant buffer size 67108864 (64MiB) Max size of kernel argument 1024 Queue properties Out-of-order execution No Profiling Yes Profiling timer resolution 0ns Execution capabilities Run OpenCL kernels Yes Run native kernels No ILs with version SPIR-V 0x 400000 (1.0.0) Built-in kernels with version (n/a) Device Extensions cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_k hr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_bas e_atomics cl_khr_int64_extended_atomics cl_khr_fp64 cl_khr_extended_versioning Device Extensions with Version cl_khr_byte_addressable_store 0x 400000 (1.0.0) cl_khr_global_int32_base_atomics 0x 400000 (1.0.0) cl_khr_global_int32_extended_atomics 0x 400000 (1.0.0) cl_khr_local_int32_base_atomics 0x 400000 (1.0.0) cl_khr_local_int32_extended_atomics 0x 400000 (1.0.0) cl_khr_int64_base_atomics 0x 400000 (1.0.0) cl_khr_int64_extended_atomics 0x 400000 (1.0.0) cl_khr_fp64 0x 400000 (1.0.0) cl_khr_extended_versioning 0x 400000 (1.0.0) Platform Name rusticl Number of devices 0 NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Clover clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [MESA] clCreateContext(NULL, ...) [default] Success [MESA] clCreateContext(NULL, ...) [other] clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1) Platform Name Clover Device Name AMD Radeon RX 550 / 550 Series (polaris12, LLVM 16.0.1, DRM 3.49, 6 .2.14-300.fc38.x86_64) clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1) Platform Name Clover Device Name AMD Radeon RX 550 / 550 Series (polaris12, LLVM 16.0.1, DRM 3.49, 6 .2.14-300.fc38.x86_64) clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1) Platform Name Clover Device Name AMD Radeon RX 550 / 550 Series (polaris12, LLVM 16.0.1, DRM 3.49, 6 .2.14-300.fc38.x86_64) ICD loader properties ICD loader Name OpenCL ICD Loader ICD loader Vendor OCL Icd free software ICD loader Version 2.3.1 ICD loader Profile OpenCL 3.0 -- -----Open up your eyes, open up your mind, open up your code ------- / Dr. David Alan Gilbert | Running GNU/Linux | Happy \ \ dave @ treblig.org | | In Hex / \ _________________________|_____ http://www.treblig.org |_______/