Hi Frederik,
On 20/01/2020 06:57, Harwath, Frederik wrote:
Hi,
this patch implements a runtime ISA check for amdgcn offloading.
The check verifies that the ISA of the GPU to which we try to offload matches
the ISA for which the code to be offloaded has been compiled. If it detects
a mismatch, it emits an error message which contains a hint at the correct
compilation
parameters for the GPU. For instance:
"libgomp: GCN fatal error: GCN code object ISA 'gfx906' does not match GPU
ISA 'gfx900'.
Try to recompile with '-foffload=-march=gfx900'."
or
"libgomp: GCN fatal error: GCN code object ISA 'gfx900' does not match agent
ISA 'gfx803'.
Try to recompile with '-foffload=-march=fiji'."
(By the way, the names that we use for the ISAs are a bit inconsistent. Perhaps
we should just
use the gfx-names for all ISAs everywhere?.)
The -march=?? names match those used by LLVM, since they are passed
straight through to the assembler and linker we import from that
project. There seems to have been some inconsistency there also, with
older devices having names, and newer devices numbers. Unless we want
specs files with a full set of mappings then it's best to stick with
what we have, for now, I think.
Without this patch, the user only gets an confusing error message from the HSA
runtime which
fails to load the GCN object code.
This has been annoying indeed. The new message will be very welcome. :-)
I have checked that the code does not lead to any regressions when running
the test suite correctly, i.e. with the "-foffload=-march=..." option
given to the compiler matching the architecture of the GPU.
It seems difficult to implement an automated test that triggers an ISA mismatch.
I have tested manually (for different combinations of the compilation flags
and offloading GPU ISAs) that the runtime ISA check produces the expected error
messages.
I think you said that overriding the multilib options wasn't
straight-forward, and I don't have any idea how to fix that. Nor can we
know what devices the test environment does or does not have. I think
we'll have to do without a negative test.
Is it ok to commit this patch to the master branch?
I can't see anything significantly wrong with the code of the patch,
however I have some minor issues I'd like fixed in the text.
@@ -396,6 +396,88 @@ struct gcn_image_desc
struct global_var_info *global_variables;
};
+/* This enum mirrors the corresponding LLVM enum's values for all ISAs that we
+ support.
+ See https://llvm.org/docs/AMDGPUUsage.html#amdgpu-ef-amdgpu-mach-table */
+
+typedef enum {
+ EF_AMDGPU_MACH_AMDGCN_GFX801 = 0x028,
+ EF_AMDGPU_MACH_AMDGCN_GFX803 = 0x02a,
+ EF_AMDGPU_MACH_AMDGCN_GFX900 = 0x02c,
+ EF_AMDGPU_MACH_AMDGCN_GFX906 = 0x02f,
+} EF_AMDGPU_MACH;
+
+const static int EF_AMDGPU_MACH_MASK = 0x000000ff;
+typedef EF_AMDGPU_MACH gcn_isa;
+
+const static char* gcn_gfx801_s = "gfx801";
+const static char* gcn_gfx803_s = "gfx803";
+const static char* gcn_gfx900_s = "gfx900";
+const static char* gcn_gfx906_s = "gfx906";
+const static int gcn_isa_name_len = 6;
+
+static int
+elf_gcn_isa_field (Elf64_Ehdr *image)
+{
+ return image->e_flags & EF_AMDGPU_MACH_MASK;
+}
The whole file has been carefully grouped into categories, separated
with vim fold markers {{{ }}} but your patch inserts functions amongst
the types. Please move the functions down into the "Utility functions"
group. The const static variables should probably go with them.
@@ -2257,6 +2342,39 @@ find_load_offset (Elf64_Addr *load_offset, struct
agent_info *agent,
return res;
}
+/* Check that the GCN ISA of the given image matches the ISA of the agent. */
+
+static bool
+isa_matches_agent (struct agent_info *agent, Elf64_Ehdr *image)
+{
+ int isa_field = elf_gcn_isa_field (image);
+ const char* isa_s = isa_hsa_name (isa_field);
+ if (!isa_s)
+
Please use gnu-style indentation.
@@ -3294,7 +3415,11 @@ GOMP_OFFLOAD_init_device (int n)
&buf);
if (status != HSA_STATUS_SUCCESS)
return hsa_error ("Error querying the name of the agent", status);
- agent->gfx900_p = (strncmp (buf, "gfx900", 6) == 0);
+ agent->gfx900_p = (strncmp (buf, gcn_gfx900_s, gcn_isa_name_len) == 0);
+
+ agent->device_isa = isa_code (buf);
+ if (agent->device_isa < 0)
+ return hsa_error ("Unknown GCN agent architecture.", HSA_STATUS_ERROR);
Can device_isa not just replace gfx900_p? I think it's only tested in
one place, and that would be easily substituted.
Thanks
Andrew