Bug#1070446: rocm-hipamd: arm64 FTBFS with glibc 2.38

2024-05-05 Thread Cordell Bloor

Tags: patch

Hi Graham,

On 2024-05-05 07:31, Graham Inggs wrote:

As can be seen in reproducible builds [1], rocm-hipamd FTBFS on arm64
with glibc 2.38.  I've copied what I hope is the relevant part of the
log below.

A bug was filed against glibc [2], but it seems glibc upstream do not
consider it a bug in glibc.


There is nothing that can be done in rocm-hipamd to address this bug, 
aside from removing arm64 from the rocm-hipamd architecture list. The 
incompatibility is not with the HIP runtime, but with the HIP language. 
This is a disagreement that glibc and llvm will need to resolve between 
themselves.



[1]https://tests.reproducible-builds.org/debian/rb-pkg/rocm-hipamd.html
[2]https://sourceware.org/bugzilla/show_bug.cgi?id=30909


In file included from /tmp/hip_pch.724714/hip_pch.h:1:
In file included from
/build/reproducible-path/rocm-hipamd-5.7.1/hip/include/hip/hip_runtime.h:62:
In file included from
/build/reproducible-path/rocm-hipamd-5.7.1/hipamd/include/hip/amd_detail/amd_hip_runtime.h:76:
In file included from
/usr/lib/gcc/aarch64-linux-gnu/13/../../../../include/c++/13/cmath:47:
In file included from /usr/include/math.h:40:
/usr/include/aarch64-linux-gnu/bits/math-vector.h:40:9: error: unknown
type name '__SVFloat32_t'
40 | typedef __SVFloat32_t __sv_f32_t;
   | ^
/usr/include/aarch64-linux-gnu/bits/math-vector.h:41:9: error: unknown
type name '__SVFloat64_t'
41 | typedef __SVFloat64_t __sv_f64_t;
   | ^
/usr/include/aarch64-linux-gnu/bits/math-vector.h:42:9: error: unknown
type name '__SVBool_t'
42 | typedef __SVBool_t __sv_bool_t;
   | ^


This compilation error is when building device code when the host 
architecture is aarch64. LLVM only defines __SVFloat32_t, __SVFloat64_t 
and __SVBool_t when building host code, but not when building device 
code. To me this seems reasonable because GPUs do not support SVE 
instructions.


However, the math.h header (on aarch64 at least) is not aware of the 
concept of the distinction between host code and device code. As such, 
it fails when compiling device code. The glibc argument is that GCC 
always supports these types, but I'm not convinced. I'm curious how GCC 
handles the math headers for OpenMP GPU offloading [3].


In any case, I've attached a patch for glibc that would fix this bug. 
Perhaps my suggestion would be more palatable to upstream than the 
previously rejected patch. If not, it's up to glibc or LLVM to find a 
solution. If they cannot, then we will have to drop arm64 support for 
the HIP language.


Sincerely,
Cory Bloor

[3]: https://gcc.gnu.org/wiki/Offloading
From: Cordell Bloor 
Date: Wed, 10 Apr 2024 16:49:24 -0600
Subject: [PATCH] arm64/math-vec.h: drop SVE vector types in device code

These headers get included when building HIP libraries on the aarch64
platform. The headers are used when building both CPU code and GPU
code, but the SVE vector types are not supported on the GPU.

The clang compiler sets __HIP_DEVICE_COMPILE__ when it is building
code for the GPU, so disable __SVE_VEC_MATH_SUPPORTED when that macro
is detected.

Bug-Debian: https://bugs.debian.org/1070446
Bug-Ubuntu: https://bugs.launchpad.net/glibc/+bug/2032624
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30909
Forwarded: No

Index: glibc/sysdeps/aarch64/fpu/bits/math-vector.h
===
--- glibc.orig/sysdeps/aarch64/fpu/bits/math-vector.h
+++ glibc/sysdeps/aarch64/fpu/bits/math-vector.h
@@ -101,7 +101,8 @@ typedef __attribute__ ((__neon_vector_ty
 typedef __attribute__ ((__neon_vector_type__ (2))) double __f64x2_t;
 #endif
 
-#if __GNUC_PREREQ(10, 0) || __glibc_clang_prereq(11, 0)
+#if (__GNUC_PREREQ(10, 0) || __glibc_clang_prereq(11, 0)) \
+  && !defined(__HIP_DEVICE_COMPILE__)
 #  define __SVE_VEC_MATH_SUPPORTED
 typedef __SVFloat32_t __sv_f32_t;
 typedef __SVFloat64_t __sv_f64_t;


Bug#1064730: stdgpu: FTBFS: type_traits.h:736:1: error: expected type-specifier before ‘template’

2024-04-15 Thread Cordell Bloor

Hi Timo,

On Sat, 2 Mar 2024 09:21:43 +0100 Timo =?utf-8?Q?R=C3=B6hling?= 
 wrote:


>
> On Sun, 25 Feb 2024 20:28:53 +0100 Lucas Nussbaum 
> wrote:
> > > /usr/include/thrust/detail/type_traits.h:736:1: error: expected
> > > type-specifier before ‘template’
>
> This bug is caused by a #ifdef cascade for different
> THRUST_DEVICE_SYSTEM values, which sadly no longer works with
> THRUST_DEVICE_SYSTEM_OMP. This makes it effectively impossible to
> build the HIP backend and the OpenMP backend from the same source.

Am I understanding correctly that this was broken in a rocthrust update? 
Should this be treated as a rocthrust bug? [1]


Sincerely,
Cory Bloor

[1]: https://bugs.debian.org/1064730



Bug#1067956: rocalution: FTBFS on armhf (test failure with memory allocation)

2024-04-07 Thread Cordell Bloor

Control: severity 1067956 important

The rocalution package has never successfully built for armhf, so I 
don't think this qualifies as release-critical.


It's great to see that the rocalution package gets all the way into the 
tests before failing, though. The upstream project only officially 
supports amd64, so that's better than I was expecting. The tests should 
probably skip anything requiring more than ~2 GB of memory when running 
on 32-bit architectures. Patches are welcome.


Sincerely,
Cory Bloor



Bug#1067356: hipsolver: FTBFS: make[1]: *** [debian/rules:17: override_dh_auto_configure-arch] Error 2

2024-03-20 Thread Cordell Bloor

Control: reassign 1067356 libamdhip64-dev 5.7.1-1
Control: affects 1067356 hipsolver
Control: fixed 1067356 5.7.1-2

On 2024-03-20 15:00, Lucas Nussbaum wrote:

During a rebuild of all packages in sid, your package failed to build
on amd64.


Relevant part (hopefully):

make[1]: Entering directory '/<>'
dh_auto_configure -- -DCMAKE_BUILD_TYPE=Release -DROCM_SYMLINK_LIBS=OFF 
-DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF -DBUILD_CLIENTS_TESTS=ON
cd obj-x86_64-linux-gnu && DEB_PYTHON_INSTALL_LAYOUT=deb cmake 
-DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=None -DCMAKE_INSTALL_SYSCONFDIR=/etc 
-DCMAKE_INSTALL_LOCALSTATEDIR=/var -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON 
-DFETCHCONTENT_FULLY_DISCONNECTED=ON -DCMAKE_INSTALL_RUNSTATEDIR=/run 
-DCMAKE_SKIP_INSTALL_ALL_DEPENDENCY=ON "-GUnix Makefiles" -DCMAKE_VERBOSE_MAKEFILE=ON 
-DCMAKE_INSTALL_LIBDIR=lib/x86_64-linux-gnu -DCMAKE_BUILD_TYPE=Release -DROCM_SYMLINK_LIBS=OFF 
-DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF -DBUILD_CLIENTS_TESTS=ON ..
Re-run cmake no build system arguments
-- The CXX compiler identification is GNU 13.2.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The Fortran compiler identification is GNU 13.2.0
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /usr/bin/gfortran - skipped
CMake Error at /usr/lib/x86_64-linux-gnu/cmake/hip/hip-config.cmake:170 
(message):
   Unexpected HIP_PLATFORM:
Call Stack (most recent call first):
   CMakeLists.txt:145 (find_package)


Thanks Lucas. I uploaded a fix for this yesterday, but it was too late 
for this build.


Sincerely,
Cory Bloor


Bug#1042036: rocblas: FTBFS: AttributeError: 'KernelWriterAssembly' object has no attribute 'language'

2023-07-25 Thread Cordell Bloor

Thanks Lucas,

On 2023-07-25 14:56, Lucas Nussbaum wrote:

# Writing Kernels...
Generating kernels: Launching 8 threads...
Traceback (most recent call last):
   File "/<>/tensile/Tensile/Parallel.py", line 54, in 
apply_print_exception
 return func(*args)
^^^
   File "/<>/tensile/Tensile/TensileCreateLibrary.py", line 67, in 
processKernelSource
 header = kernelWriter.getHeaderFileString(kernel)
  
   File "/<>/tensile/Tensile/KernelWriter.py", line 5065, in 
getHeaderFileString
 if self.language == "HIP" or self.language == "OCL":
^
AttributeError: 'KernelWriterAssembly' object has no attribute 'language'
Custom kernel filename 
/<>/obj-x86_64-linux-gnu/library/src/build_tmp/TENSILE/assembly/DGEMM_Aldebaran_NN_MT128x128x16_MI16x16x4x1_GRVW2_SU4_SUS128_WGM4.s
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
   File "/usr/lib/python3.11/multiprocessing/pool.py", line 125, in worker
 result = (True, func(*args, **kwds))
 ^^^
   File "/usr/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar
 return list(itertools.starmap(args[0], args[1]))
^
   File "/<>/tensile/Tensile/Parallel.py", line 54, in 
apply_print_exception
 return func(*args)
^^^
   File "/<>/tensile/Tensile/TensileCreateLibrary.py", line 67, in 
processKernelSource
 header = kernelWriter.getHeaderFileString(kernel)
  
   File "/<>/tensile/Tensile/KernelWriter.py", line 5065, in 
getHeaderFileString
 if self.language == "HIP" or self.language == "OCL":
^
AttributeError: 'KernelWriterAssembly' object has no attribute 'language'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
   File "/<>/tensile/Tensile/bin/TensileCreateLibrary", line 43, in 

 TensileCreateLibrary()
   File "/<>/tensile/Tensile/TensileCreateLibrary.py", line 1303, 
in TensileCreateLibrary
 codeObjectFiles = writeSolutionsAndKernels(outputPath, CxxCompiler, None, 
solutions,
   
^^
   File "/<>/tensile/Tensile/TensileCreateLibrary.py", line 482, 
in writeSolutionsAndKernels
 results = Common.ParallelMap(processKernelSource, kIter, "Generating 
kernels", method=lambda x: x.starmap, maxTasksPerChild=1)
   

   File "/<>/tensile/Tensile/Parallel.py", line 134, in ParallelMap
 rv = mapFunc(function, objects)
  ^^
   File "/usr/lib/python3.11/multiprocessing/pool.py", line 375, in starmap
 return self._map_async(func, iterable, starmapstar, chunksize).get()
^
   File "/usr/lib/python3.11/multiprocessing/pool.py", line 774, in get
 raise self._value
AttributeError: 'KernelWriterAssembly' object has no attribute 'language'
make[3]: *** [library/src/CMakeFiles/TENSILE_LIBRARY_TARGET.dir/build.make:92: 
Tensile/library/TensileLibrary.dat] Error 1


This build failure is non-deterministic. I've seen it before, but I had 
thought it only occurred when specifying the AMDGPU_TARGETS property in 
the rocBLAS build. It seems it can occur even without that. It may just 
be that specifying a reduced set of AMDGPU_TARGETS merely increases the 
probability of failure.


The missing language attribute is an indication that the 
KernelWriterAssembly object was not initialized before it was used. I 
have never seen this when building the upstream project, so I suspect 
that this is related to the removal of the replacement kernels that had 
were excluded on DFSG grounds during Debian packaging. I am suspicious 
that this build failure is just one symptom and that the test failures 
that we see on gfx900 and gfx906 architectures may also be caused by 
incorrectly generated assembly related to the replacement kernels.


We could run a test build with the replacement kernels restored to 
verify if this is the case. Even if the replacement kernels cannot be 
packaged in Debian, a local build with them restored may help us to 
confirm or falsify my theory as to the cause of this failure.


We can also take a look at the YAML specification that drives the 
generation of 
DGEMM_Aldebaran_NN_MT128x128x16_MI16x16x4x1_GRVW2_SU4_SUS128_WGM4.s. A 
scorched-earth approach to dealing with this issue would be to delete 
the YAML of problematic assembly kernels until the rocBLAS build and 
tests stop failing. That may have a serious adverse effect on 
performance, but it could restore correctness as the library would fall 
back to using source kernels. We should avoid doing that if pos

Bug#1031252: hipsparse: FTBFS (c++: error: -E or -x required when input is from standard input)

2023-02-13 Thread Cordell Bloor



On 2023-02-13 17:22, Santiago Vila wrote:

[  8%] Linking CXX shared library libhipsparse.so
cd /<>/obj-x86_64-linux-gnu/library && /usr/bin/cmake -E 
cmake_link_script CMakeFiles/hipsparse.dir/link.txt --verbose=1
/usr/bin/c++ -fPIC -g -O2 -ffile-prefix-map=/<>=. 
-fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
-D_FORTIFY_SOURCE=2 -O3 -DNDEBUG -Wl,-z,relro -shared 
-Wl,-soname,libhipsparse.so.0 -o libhipsparse.so.0.1 
CMakeFiles/hipsparse.dir/src/hcc_detail/hipsparse.cpp.o 
/usr/lib/x86_64-linux-gnu/librocsparse.so.0.1 
/usr/lib/x86_64-linux-gnu/libamdhip64.so.5.2.21153- 
-lCLANGRT_BUILTINS-NOTFOUND

c++: error: -E or -x required when input is from standard input
make[3]: *** [library/CMakeFiles/hipsparse.dir/build.make:102: 
library/libhipsparse.so.0.1] Error 1


This is a bug in libamdhip64-5:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1021643

It was fixed in rocm-hipamd 5.2.3-2, but no versions since rocm-hipamd 
5.2.3-1 have migrated to bookworm. This problem will affect all 
libraries and executables that link against libamdhip64-5 using the GCC 
toolchain.



If this is really a bug in one of the build-depends, please use
reassign and affects, so that this is still visible in the BTS web
page for this package.


I'm still learning how to use these the Debian bug reporting tools. 
Perhaps another maintainer could help set these properties.


Apologies for the incomplete handling, but I hope that this information 
is helpful.


Sincerely,
Cory Bloor