List myself as "nvptx port" maintainer (was: Thomas Schwinge appointed co-maintainer of the nvptx backend)
Hi! On 2023-07-19T23:41:47+0200, Gerald Pfeifer wrote: > It's my pleasure to announce Thomas Schwinge as co-maintainer of the > nvptx backend. > > Congratulations and Happy Hacking, Thomas! Please go ahead and update > MAINTAINERS accordingly. > > Gerald (on behalf of the steering committee) Thanks! I've pushed commit 28e3d361ba0cfa7ea2f90706159a144eaf4b650e 'List myself as "nvptx port" maintainer', see attached. Grüße Thomas - Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 >From 28e3d361ba0cfa7ea2f90706159a144eaf4b650e Mon Sep 17 00:00:00 2001 From: Thomas Schwinge Date: Tue, 25 Jul 2023 21:17:52 +0200 Subject: [PATCH] List myself as "nvptx port" maintainer * MAINTAINERS: List myself as "nvptx port" maintainer. --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index b626d89fe34..e9b11b43a0f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -102,6 +102,7 @@ nds32 port Shiva Chen nios2 port Chung-Lin Tang nios2 port Sandra Loosemore nvptx port Tom de Vries +nvptx port Thomas Schwinge or1k port Stafford Horne pdp11 port Paul Koning powerpcspe port Andrew Jenner -- 2.34.1
Flip the nvptx port to LRA (was: [PATCH] Turn on LRA on all targets)
Hi! On 2023-04-29T09:06:54-0600, Jeff Law via Gcc-patches wrote: > On 4/29/23 07:37, Roger Sayle wrote: >> >> Segher Boessenkool wrote: >>> I send this patch now so that people can start testing. >>> >>> --- a/gcc/config/nvptx/nvptx.cc >>> +++ b/gcc/config/nvptx/nvptx.cc >>> @@ -7601,9 +7601,6 @@ nvptx_asm_output_def_from_decls (FILE *stream, tree >>> name, tree value) >>> #undef TARGET_ATTRIBUTE_TABLE >>> #define TARGET_ATTRIBUTE_TABLE nvptx_attribute_table >>> >>> -#undef TARGET_LRA_P >>> -#define TARGET_LRA_P hook_bool_void_false >>> - >>> #undef TARGET_LEGITIMATE_ADDRESS_P >>> #define TARGET_LEGITIMATE_ADDRESS_P nvptx_legitimate_address_p >> >> I've tested Segher's patch on nvptx-none with make and make -k check and >> can confirm there are no new regressions. Confirmed. Also, no change in nvptx target libraries built. As expected. >> Nvptx is unique in that it >> doesn't >> use register allocation, i.e. GCC's only TARGET_NO_REGISTER_ALLOCATION >> target, >> so it's a little odd that it specifies which register allocator it doesn't >> use. >> >> I hope this helps, > > It does. Consider a patch which flips the nvptx port to LRA as > pre-approved. Pushed to master branch commit f7e3123638712773e8c01e17aae9dc64d9342016 "Flip the nvptx port to LRA", see attached. Grüße Thomas - Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 >From f7e3123638712773e8c01e17aae9dc64d9342016 Mon Sep 17 00:00:00 2001 From: Segher Boessenkool Date: Sun, 23 Apr 2023 16:47:52 + Subject: [PATCH] Flip the nvptx port to LRA ... understanding that "turn on LRA" is an exaggeration here, given that nvptx isn't actually doing register allocation ('TARGET_NO_REGISTER_ALLOCATION'). gcc/ * config/nvptx/nvptx.cc (TARGET_LRA_P): Remove. Co-authored-by: Thomas Schwinge --- gcc/config/nvptx/nvptx.cc | 3 --- 1 file changed, 3 deletions(-) diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc index e3b0304d5376..16ed78030d73 100644 --- a/gcc/config/nvptx/nvptx.cc +++ b/gcc/config/nvptx/nvptx.cc @@ -7633,9 +7633,6 @@ nvptx_asm_output_def_from_decls (FILE *stream, tree name, tree value) #undef TARGET_ATTRIBUTE_TABLE #define TARGET_ATTRIBUTE_TABLE nvptx_attribute_table -#undef TARGET_LRA_P -#define TARGET_LRA_P hook_bool_void_false - #undef TARGET_LEGITIMATE_ADDRESS_P #define TARGET_LEGITIMATE_ADDRESS_P nvptx_legitimate_address_p -- 2.39.2
Re: The nvptx port [0/11+]
Hi! On Mon, 20 Oct 2014 16:17:56 +0200, Bernd Schmidt ber...@codesourcery.com wrote: This is a patch kit that adds the nvptx port to gcc. Committed to trunk in r220781: commit 0f7695734890f93fe58179e36ac2f41bf4147d78 Author: tschwinge tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4 Date: Wed Feb 18 08:01:03 2015 + nvptx-none: Disable the lto-plugin. config/ * elf.m4 (ACX_ELF_TARGET_IFELSE): nvptx-*-none isn't ELF. / * configure: Regenerate. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@220781 138bc75d-0d04-0410-961f-82ee72b054a4 --- ChangeLog|4 config/ChangeLog |4 config/elf.m4|7 +-- configure|3 ++- 4 files changed, 15 insertions(+), 3 deletions(-) diff --git ChangeLog ChangeLog index 0969af5..a9e4437 100644 --- ChangeLog +++ ChangeLog @@ -1,3 +1,7 @@ +2015-02-18 Thomas Schwinge tho...@codesourcery.com + + * configure: Regenerate. + 2015-02-06 Diego Novillo dnovi...@google.com * MAINTAINERS (Global Reviewers, Plugin, LTO, tree-ssa, diff --git config/ChangeLog config/ChangeLog index 2cbc885..c9ed121 100644 --- config/ChangeLog +++ config/ChangeLog @@ -1,3 +1,7 @@ +2015-02-18 Thomas Schwinge tho...@codesourcery.com + + * elf.m4 (ACX_ELF_TARGET_IFELSE): nvptx-*-none isn't ELF. + 2014-11-17 Bob Dunlop bob.dun...@xyzzy.org.uk * mt-ospace (CFLAGS_FOR_TARGET): Append -g -Os rather than diff --git config/elf.m4 config/elf.m4 index da051cb..1772a44 100644 --- config/elf.m4 +++ config/elf.m4 @@ -1,4 +1,4 @@ -dnl Copyright (C) 2010, 2011 Free Software Foundation, Inc. +dnl Copyright (C) 2010, 2011, 2015 Free Software Foundation, Inc. dnl This file is free software, distributed under the terms of the GNU dnl General Public License. As a special exception to the GNU General dnl Public License, this file may be distributed as part of a program @@ -7,6 +7,8 @@ dnl the same distribution terms as the rest of that program. dnl From Paolo Bonzini. +dnl Is this an ELF target supporting the LTO plugin? + dnl usage: ACX_ELF_TARGET_IFELSE([if-elf], [if-not-elf]) AC_DEFUN([ACX_ELF_TARGET_IFELSE], [ AC_REQUIRE([AC_CANONICAL_TARGET]) @@ -15,7 +17,8 @@ target_elf=no case $target in *-darwin* | *-aix* | *-cygwin* | *-mingw* | *-aout* | *-*coff* | \ *-msdosdjgpp* | *-vms* | *-wince* | *-*-pe* | \ - alpha*-dec-osf* | *-interix* | hppa[[12]]*-*-hpux*) + alpha*-dec-osf* | *-interix* | hppa[[12]]*-*-hpux* | \ + nvptx-*-none) target_elf=no ;; *) diff --git configure configure index dd794db..f20a6ab 100755 --- configure +++ configure @@ -6047,7 +6047,8 @@ target_elf=no case $target in *-darwin* | *-aix* | *-cygwin* | *-mingw* | *-aout* | *-*coff* | \ *-msdosdjgpp* | *-vms* | *-wince* | *-*-pe* | \ - alpha*-dec-osf* | *-interix* | hppa[12]*-*-hpux*) + alpha*-dec-osf* | *-interix* | hppa[12]*-*-hpux* | \ + nvptx-*-none) target_elf=no ;; *) Grüße, Thomas signature.asc Description: PGP signature
Re: nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)
Hi! On Wed, 4 Feb 2015 10:43:14 +0100, Jakub Jelinek ja...@redhat.com wrote: On Mon, Feb 02, 2015 at 04:32:34PM +0100, Thomas Schwinge wrote: Hi! On Tue, 23 Dec 2014 19:49:35 +0100, I wrote: On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt ber...@codesourcery.com wrote: The scripts (11/11) I've put up on github, along with a hacked up newlib. These are at [...] They are likely to migrate to MentorEmbedded from bernds, but that had some permissions problems last week. That has recently been done: https://github.com/MentorEmbedded/nvptx-tools and https://github.com/MentorEmbedded/nvptx-newlib are now available. (I'm aware that we still are to write up how to actually build and test all this.) I just updated https://gcc.gnu.org/wiki/Offloading?action=diffrev2=26rev1=25. Can you please update the gmane URLs to corresponding https://gcc.gnu.org/ml/gcc-patches/ URLs? We have our own mailing list archives, no need to use third party ones. It's convenient for me (Message-IDs falls out of my mailer automatically, and Gmane happens to support retrieving message by Message-ID), and the sourceware mailing list archives software doesn't interlink articles between different -MM, which I find rather limiting. OK to check in the following to trunk? Committed to trunk in r220783. --- gcc/config/nvptx/nvptx.opt +++ gcc/config/nvptx/nvptx.opt @@ -17,13 +17,13 @@ ; along with GCC; see the file COPYING3. If not see ; http://www.gnu.org/licenses/. -m64 -Target Report RejectNegative Mask(ABI64) -Generate code for a 64 bit ABI - m32 Target Report RejectNegative InverseMask(ABI64) -Generate code for a 32 bit ABI +Generate code for a 32-bit ABI + +m64 +Target Report RejectNegative Mask(ABI64) +Generate code for a 64-bit ABI I'd expect you want also Negative(m64) on the m32 option and Negative(m32) on the m64 option. +@table @gcctabopt + +@item -m32 +@itemx -m64 +@opindex m32 +@opindex m64 +Generate code for 32-bit or 64-bit ABI. I guess you should mention which one of those is the default (if it isn't configure time configurable). Have taken a note to look into these, later. What about multilibs, is newlib built for both -m32 and -m64, or just the default option? So far, we have concentrated only on the 64-bit x86_64 configuration; 32-bit has several known issues to be resolved. https://gcc.gnu.org/PR65099 filed. Grüße, Thomas signature.asc Description: PGP signature
Re: nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)
On Wed, Feb 18, 2015 at 09:50:15AM +0100, Thomas Schwinge wrote: What about multilibs, is newlib built for both -m32 and -m64, or just the default option? So far, we have concentrated only on the 64-bit x86_64 configuration; 32-bit has several known issues to be resolved. https://gcc.gnu.org/PR65099 filed. I meant 64-bit and 32-bit PTX. Jakub
nvptx-none: Define empty GOMP_SELF_SPECS (was: The nvptx port [0/11+])
Hi! On Mon, 20 Oct 2014 16:17:56 +0200, Bernd Schmidt ber...@codesourcery.com wrote: This is a patch kit that adds the nvptx port to gcc. I wonder why we haven't been seeing this in our internal development branch -- maybe because on that branch we're still discarding more compiler options in the offloading path? Committed to trunk in r220780: commit 2fdc66a9fcfbc5b77c1c03d7c34893a0a086e8f8 Author: tschwinge tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4 Date: Wed Feb 18 07:45:42 2015 + nvptx-none: Define empty GOMP_SELF_SPECS. Otherwise, offloading with -fopenacc or -fopenmp active will run into: x86_64-unknown-linux-gnu-accel-nvptx-none-gcc: error: unrecognized command line option '-pthread' gcc/ * config/nvptx/nvptx.h (GOMP_SELF_SPECS): Define macro. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@220780 138bc75d-0d04-0410-961f-82ee72b054a4 --- gcc/ChangeLog|4 gcc/config/nvptx/nvptx.h |4 2 files changed, 8 insertions(+) diff --git gcc/ChangeLog gcc/ChangeLog index 2c75df6..180a605 100644 --- gcc/ChangeLog +++ gcc/ChangeLog @@ -1,3 +1,7 @@ +2015-02-18 Thomas Schwinge tho...@codesourcery.com + + * config/nvptx/nvptx.h (GOMP_SELF_SPECS): Define macro. + 2015-02-18 Andrew Pinski apin...@cavium.com Naveen H.S naveen.hurugalaw...@caviumnetworks.com diff --git gcc/config/nvptx/nvptx.h gcc/config/nvptx/nvptx.h index 9a9954b..e74d16f 100644 --- gcc/config/nvptx/nvptx.h +++ gcc/config/nvptx/nvptx.h @@ -33,6 +33,10 @@ builtin_define (__nvptx__);\ } while (0) +/* Avoid the default in ../../gcc.c, which adds -pthread, which is not + supported for nvptx. */ +#define GOMP_SELF_SPECS + /* Storage Layout. */ #define BITS_BIG_ENDIAN 0 Grüße, Thomas signature.asc Description: PGP signature
Re: nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)
On Mon, Feb 02, 2015 at 04:32:34PM +0100, Thomas Schwinge wrote: Hi! On Tue, 23 Dec 2014 19:49:35 +0100, I wrote: On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt ber...@codesourcery.com wrote: The scripts (11/11) I've put up on github, along with a hacked up newlib. These are at [...] They are likely to migrate to MentorEmbedded from bernds, but that had some permissions problems last week. That has recently been done: https://github.com/MentorEmbedded/nvptx-tools and https://github.com/MentorEmbedded/nvptx-newlib are now available. (I'm aware that we still are to write up how to actually build and test all this.) I just updated https://gcc.gnu.org/wiki/Offloading?action=diffrev2=26rev1=25. Can you please update the gmane URLs to corresponding https://gcc.gnu.org/ml/gcc-patches/ URLs? We have our own mailing list archives, no need to use third party ones. OK to check in the following to trunk? --- gcc/config/nvptx/nvptx.opt +++ gcc/config/nvptx/nvptx.opt @@ -17,13 +17,13 @@ ; along with GCC; see the file COPYING3. If not see ; http://www.gnu.org/licenses/. -m64 -Target Report RejectNegative Mask(ABI64) -Generate code for a 64 bit ABI - m32 Target Report RejectNegative InverseMask(ABI64) -Generate code for a 32 bit ABI +Generate code for a 32-bit ABI + +m64 +Target Report RejectNegative Mask(ABI64) +Generate code for a 64-bit ABI I'd expect you want also Negative(m64) on the m32 option and Negative(m32) on the m64 option. +@table @gcctabopt + +@item -m32 +@itemx -m64 +@opindex m32 +@opindex m64 +Generate code for 32-bit or 64-bit ABI. I guess you should mention which one of those is the default (if it isn't configure time configurable). What about multilibs, is newlib built for both -m32 and -m64, or just the default option? Jakub
Re: nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)
Hi! On Tue, 23 Dec 2014 19:49:35 +0100, I wrote: On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt ber...@codesourcery.com wrote: The scripts (11/11) I've put up on github, along with a hacked up newlib. These are at [...] They are likely to migrate to MentorEmbedded from bernds, but that had some permissions problems last week. That has recently been done: https://github.com/MentorEmbedded/nvptx-tools and https://github.com/MentorEmbedded/nvptx-newlib are now available. (I'm aware that we still are to write up how to actually build and test all this.) I just updated https://gcc.gnu.org/wiki/Offloading?action=diffrev2=26rev1=25. OK to check in the following to trunk? commit a0c73cb76d1f13642df7725d64bc618ee0909abc Author: Thomas Schwinge tho...@codesourcery.com Date: Mon Feb 2 16:29:36 2015 +0100 Begin documenting the nvptx backend. gcc/ * doc/install.texi (nvptx-*-none): New section. * doc/invoke.texi (Nvidia PTX Options): Likewise. * config/nvptx/nvptx.opt: Update. --- gcc/config/nvptx/nvptx.opt | 10 +- gcc/doc/install.texi | 23 +++ gcc/doc/invoke.texi| 26 ++ 3 files changed, 54 insertions(+), 5 deletions(-) diff --git gcc/config/nvptx/nvptx.opt gcc/config/nvptx/nvptx.opt index 1448dfc..249a61d 100644 --- gcc/config/nvptx/nvptx.opt +++ gcc/config/nvptx/nvptx.opt @@ -17,13 +17,13 @@ ; along with GCC; see the file COPYING3. If not see ; http://www.gnu.org/licenses/. -m64 -Target Report RejectNegative Mask(ABI64) -Generate code for a 64 bit ABI - m32 Target Report RejectNegative InverseMask(ABI64) -Generate code for a 32 bit ABI +Generate code for a 32-bit ABI + +m64 +Target Report RejectNegative Mask(ABI64) +Generate code for a 64-bit ABI mmainkernel Target Report RejectNegative diff --git gcc/doc/install.texi gcc/doc/install.texi index c9e3bf1..b31f9b6 100644 --- gcc/doc/install.texi +++ gcc/doc/install.texi @@ -3302,6 +3302,8 @@ information have to. @item @uref{#nds32be-x-elf,,nds32be-*-elf} @item +@uref{#nvptx-x-none,,nvptx-*-none} +@item @uref{#powerpc-x-x,,powerpc*-*-*} @item @uref{#powerpc-x-darwin,,powerpc-*-darwin*} @@ -4269,6 +4271,27 @@ Andes NDS32 target in big endian mode. @html hr / @end html +@anchor{nvptx-x-none} +@heading nvptx-*-none +Nvidia PTX target. + +Instead of GNU binutils, you will need to install +@uref{https://github.com/MentorEmbedded/nvptx-tools/,,nvptx-tools}. +Tell GCC where to find it: +@option{--with-build-time-tools=[install-nvptx-tools]/nvptx-none/bin}. + +A nvptx port of newlib is available at +@uref{https://github.com/MentorEmbedded/nvptx-newlib/,,nvptx-newlib}. +It can be automatically built together with GCC@. For this, add a +symbolic link to nvptx-newlib's @file{newlib} directory to the +directory containing the GCC sources. + +Use the @option{--disable-sjlj-exceptions} and +@option{--enable-newlib-io-long-long} options when configuring. + +@html +hr / +@end html @anchor{powerpc-x-x} @heading powerpc-*-* You can specify a default version for the @option{-mcpu=@var{cpu_type}} diff --git gcc/doc/invoke.texi gcc/doc/invoke.texi index ba81ec7..1fb329e 100644 --- gcc/doc/invoke.texi +++ gcc/doc/invoke.texi @@ -840,6 +840,9 @@ Objective-C and Objective-C++ Dialects}. -mcustom-fpu-cfg=@var{name} @gol -mhal -msmallc -msys-crt0=@var{name} -msys-lib=@var{name}} +@emph{Nvidia PTX Options} +@gccoptlist{-m32 -m64 -mmainkernel} + @emph{PDP-11 Options} @gccoptlist{-mfpu -msoft-float -mac0 -mno-ac0 -m40 -m45 -m10 @gol -mbcopy -mbcopy-builtin -mint32 -mno-int16 @gol @@ -11967,6 +11970,7 @@ platform. * MSP430 Options:: * NDS32 Options:: * Nios II Options:: +* Nvidia PTX Options:: * PDP-11 Options:: * picoChip Options:: * PowerPC Options:: @@ -18277,6 +18281,28 @@ This option is typically used to link with a library provided by a HAL BSP. @end table +@node Nvidia PTX Options +@subsection Nvidia PTX Options +@cindex Nvidia PTX options +@cindex nvptx options + +These options are defined for Nvidia PTX: + +@table @gcctabopt + +@item -m32 +@itemx -m64 +@opindex m32 +@opindex m64 +Generate code for 32-bit or 64-bit ABI. + +@item -mmainkernel +@opindex mmainkernel +Link in code for a __main kernel. This is for stand-alone instead of +offloading execution. + +@end table + @node PDP-11 Options @subsection PDP-11 Options @cindex PDP-11 Options Grüße, Thomas pgp0CHeeOXpKu.pgp Description: PGP signature
nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)
Hi! On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt ber...@codesourcery.com wrote: The scripts (11/11) I've put up on github, along with a hacked up newlib. These are at https://github.com/bernds/nvptx-tools https://github.com/bernds/nvptx-newlib They are likely to migrate to MentorEmbedded from bernds, but that had some permissions problems last week. That has recently been done: https://github.com/MentorEmbedded/nvptx-tools and https://github.com/MentorEmbedded/nvptx-newlib are now available. (I'm aware that we still are to write up how to actually build and test all this.) Grüße, Thomas signature.asc Description: PGP signature
Re: The nvptx port [10/11+] Target files
Hi! On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt ber...@codesourcery.com wrote: I've now committed it, in the following form. --- /dev/null +++ b/gcc/config/nvptx/nvptx.h @@ -0,0 +1,356 @@ +#define ASM_OUTPUT_ALIGN(FILE, POWER) Committed to trunk in r218689: commit 61f8a1bd770ded96fcff88f3cbc426a23c413992 Author: tschwinge tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4 Date: Fri Dec 12 20:14:10 2014 + nvptx: Define valid ASM_OUTPUT_ALIGN. gcc/ * config/nvptx/nvptx.h (ASM_OUTPUT_ALIGN): Define as a C statment. gcc/doc/tm.texi:@defmac ASM_OUTPUT_ALIGN (@var{stream}, @var{power}) gcc/doc/tm.texi-A C statement to output to the stdio stream @var{stream} an assembler gcc/doc/tm.texi-command to advance the location counter to a multiple of 2 to the gcc/doc/tm.texi-@var{power} bytes. @var{power} will be a C expression of type @code{int}. gcc/doc/tm.texi-@end defmac gcc/config/nvptx/nvptx.h:#define ASM_OUTPUT_ALIGN(FILE, POWER) Empty is not a C statement, and so in code such as: gcc/dwarf2out.c- if (lsda_encoding == DW_EH_PE_aligned) gcc/dwarf2out.c:ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (PTR_SIZE)); gcc/dwarf2out.c- dw2_asm_output_data (size_of_encoded_value (lsda_encoding), 0, gcc/dwarf2out.c- Language Specific Data Area (none)); gcc/varasm.c- if (align BITS_PER_UNIT) gcc/varasm.c:ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT)); gcc/varasm.c- assemble_variable_contents (decl, name, dont_output_data); gcc/varasm.c- if (align 0) gcc/varasm.c:ASM_OUTPUT_ALIGN (asm_out_file, align); gcc/varasm.c- gcc/varasm.c- targetm.asm_out.internal_label (asm_out_file, LTRAMP, 0); gcc/varasm.c- if (align BITS_PER_UNIT) gcc/varasm.c:ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT)); gcc/varasm.c- assemble_constant_contents (exp, XSTR (symbol, 0), align); ..., GCC warns: [...]/source-gcc/gcc/dwarf2out.c: In function 'void output_fde(dw_fde_ref, bool, bool, char*, int, char*, bool, int)': [...]/source-gcc/gcc/dwarf2out.c:665:3: warning: suggest braces around empty body in an 'if' statement [-Wempty-body] ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (PTR_SIZE)); ^ [...]/source-gcc/gcc/varasm.c: In function 'void assemble_variable(tree, int, int, int)': [...]/source-gcc/gcc/varasm.c:2217:2: warning: suggest braces around empty body in an 'if' statement [-Wempty-body] ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT)); ^ [...]/source-gcc/gcc/varasm.c: In function 'rtx_def* assemble_trampoline_template()': [...]/source-gcc/gcc/varasm.c:2603:5: warning: suggest braces around empty body in an 'if' statement [-Wempty-body] ASM_OUTPUT_ALIGN (asm_out_file, align); ^ [...]/source-gcc/gcc/varasm.c: In function 'void output_constant_def_contents(rtx)': [...]/source-gcc/gcc/varasm.c:3413:2: warning: suggest braces around empty body in an 'if' statement [-Wempty-body] ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT)); ^ Also, use the values, to get rid of that one: [...]/source-gcc/gcc/final.c: In function 'rtx_insn* final_scan_insn(rtx_insn*, FILE*, int, int, int*)': [...]/source-gcc/gcc/final.c:2450:12: warning: variable 'log_align' set but not used [-Wunused-but-set-variable] int log_align; ^ git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@218689 138bc75d-0d04-0410-961f-82ee72b054a4 --- gcc/ChangeLog| 4 gcc/config/nvptx/nvptx.h | 10 +- 2 files changed, 13 insertions(+), 1 deletion(-) diff --git gcc/ChangeLog gcc/ChangeLog index 689c4fd..e5de2c6 100644 --- gcc/ChangeLog +++ gcc/ChangeLog @@ -1,3 +1,7 @@ +2014-12-12 Thomas Schwinge tho...@codesourcery.com + + * config/nvptx/nvptx.h (ASM_OUTPUT_ALIGN): Define as a C statment. + 2014-12-12 Vladimir Makarov vmaka...@redhat.com PR target/64110 diff --git gcc/config/nvptx/nvptx.h gcc/config/nvptx/nvptx.h index c222375..5f08ba7 100644 --- gcc/config/nvptx/nvptx.h +++ gcc/config/nvptx/nvptx.h @@ -281,9 +281,17 @@ struct GTY(()) machine_function } \ while (0) -#define ASM_OUTPUT_ALIGN(FILE, POWER) +#define ASM_OUTPUT_ALIGN(FILE, POWER) \ + do \ +{ \ + (void) (FILE); \ + (void) (POWER); \ +}
Re: The nvptx port
On 11/14/14 11:04, Jeff Law wrote: On 11/14/14 05:36, Jakub Jelinek wrote: So, for a warp, if some threads perform one branch of an if and other threads another one, all threads perform the first one first (with some maybe not doing anything), then all the threads the others (again, other threads not doing anything)? Nobody ever specified exactly what happens in this case to me, but I gathered from reading the docs that once you have some threads in one path and others in a different path, things slow down to a horrid crawl. So you try to avoid that :-) this is correct. Don't do that.
Re: The nvptx port
On 11/14/14 10:43, Jeff Law wrote: On 11/14/14 04:09, Bernd Schmidt wrote: Hi Jakub, I have some questions about nvptx: 1) you've said that alloca isn't supported, but it seems Yes, it's unimplemented. There's an internal declaration for it but that seems to be as far as it goes, and that declaration is 32-bit only anyway. Right. My recollection is it's defined in the vISA, but unimplemented. yup, all PTX docs I've seen (which is up to 3.2) say: 'Note: The current version of PTX does not support alloca.' and as Bernd says, the associated text only talks about a declaration for 32-bit land. nathan
The nvptx port
Hi! I have some questions about nvptx: 1) you've said that alloca isn't supported, but it seems to be wired up and uses the %alloca documented in the PTX manual, what is the issue with that? %alloca not being actually implemented by the current PTX assembler or translator? Or some local vs. global address space issues? If the latter, could at least VLAs be supported? 2) what is the reason why TLS isn't supported by the port (well, __emutls is emitted, but I doubt pthread_[gs]etspecific is implementable and thus it will not really do anything. Can't the port just emit all DECL_THREAD_LOCAL_P variables into .local instead of .global address space? Would one need to convert those pointers to generic any way? I'm asking because e.g. libgomp uses __thread heavily and it would be nice to be able to use that. 3) in assembly emitted by the nvptx port, I've noticed: .visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 %in_ar2) { .reg.u64 %ar1; .reg.u32 %ar2; .reg.u32 %retval; .reg.u64 %hr10; .reg.u32 %r22; .reg.u64 %r25; is the missing \t before the %retval line intentional? 4) I had a brief look at what it would take to port libgomp to PTX, which is needed for OpenMP offloading. OpenMP offloaded kernels should start with 1 team and 1 thread in it, if we ignore GOMP_teams for now, I think the major things are: - right now libgomp is heavily pthread_* based, which is a no-go for nvptx I assume, I think we'll need some ifdefs in the sources - the main thing is that I believe we just have to replace gomp_team_start for nvptx; seems there are cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use to spawn selected kernel in selected number of threads (and teams), from the docs it isn't exactly clear what the calling thread will do, if it is suspended and the HW core given to it is reused by something else (e.g. one of the newly spawned threads), then I think it should be usable. Not sure what happens with .local memory of the parent task, if the children all have different .local memory, then perhaps one could just copy over what is needed from the invoking to the first invoked thread at start. The question is how to figure out what to pass to cudeLaunchDevice (e.g. how to get handle of the current stream), and how to query how many teams and/or threads it is reasonable to ask for if the program wants defaults (and how many teams/threads are hard limits beyond which one can't go) - is it worth to reuse cudaLaunchDevice threads or are they cheap enough to start that any thread pooling should be removed for nvptx? - we'll need some synchronization primitives, I see atomic support is there, we need mutexes and semaphores I think, is that implementable using bar instruction? - the library uses __attribute__((constructor)) in 3 places or so, initialize_team is pthread specific and can be probably ifdefed out, we won't support dlclose in nvptx anyway, but at least we need some way to initialize the nvptx libgomp; if the initialization is done in global memory, would it persist in between different kernels, so can the initialization as separate kernel be run once, something else? - is there any way to do any affinity management, or shall we just ignore affinity strategies? - the target/offloading stuff should be most likely stubbed in the library for nvptx, target data/target regions inside of target regions are undefined behavior in OpenMP, no need to bloat things - any way how to query time? Other thoughts? Jakub
Re: The nvptx port
On Fri, Nov 14, 2014 at 09:29:48AM +0100, Jakub Jelinek wrote: I have some questions about nvptx: Oh, and 5) I have noticed gcc doesn't generate the .uni suffixes anywhere, while llvm generates them; are those appropriate only when a function is guaranteed to be run unconditionally from the toplevel kernel, or even in spots in arbitrary functions which might not be run unconditionally by all threads in thread block, but all threads that encounter the particular function will run the specific spot unconditionally? I mean, if we have arbitrary function: void foo (void) { something; bar (); something; } then the call is unconditional in there, but there is no guarantee somebody will not do void baz (int x) { if (x 20) foo (); } and run foo only in a subset of the threads. Jakub
Re: The nvptx port
Hi Jakub, I have some questions about nvptx: 1) you've said that alloca isn't supported, but it seems to be wired up and uses the %alloca documented in the PTX manual, what is the issue with that? %alloca not being actually implemented by the current PTX assembler or translator? Yes, it's unimplemented. There's an internal declaration for it but that seems to be as far as it goes, and that declaration is 32-bit only anyway. 2) what is the reason why TLS isn't supported by the port (well, __emutls is emitted, but I doubt pthread_[gs]etspecific is implementable and thus it will not really do anything. Can't the port just emit all DECL_THREAD_LOCAL_P variables into .local instead of .global address space? .local is stack frame memory, not TLS. The ptx docs mention the use of .local at file-scope as occurring only in legacy ptx code and I get the impression it's discouraged. (As an aside, there's a question of how to represent a different concept, gang-local memory, in gcc. That would be .shared memory. We're currently going with just using an internal attribute) 3) in assembly emitted by the nvptx port, I've noticed: .visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 %in_ar2) { .reg.u64 %ar1; .reg.u32 %ar2; .reg.u32 %retval; .reg.u64 %hr10; .reg.u32 %r22; .reg.u64 %r25; is the missing \t before the %retval line intentional? No, I can fix that up. 4) I had a brief look at what it would take to port libgomp to PTX, which is needed for OpenMP offloading. OpenMP offloaded kernels should start with 1 team and 1 thread in it, if we ignore GOMP_teams for now, I think the major things are: - right now libgomp is heavily pthread_* based, which is a no-go for nvptx I assume, I think we'll need some ifdefs in the sources I haven't looked into whether libpthread is doable. I suspect it's a poor match. I also haven't really looked into OpenMP, so I'm feeling a bit uncertain about answering your further questions. - the main thing is that I believe we just have to replace gomp_team_start for nvptx; seems there are cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use to spawn selected kernel in selected number of threads (and teams), from the docs it isn't exactly clear what the calling thread will do, if it is suspended and the HW core given to it is reused by something else (e.g. one of the newly spawned threads), then I think it should be usable. Not sure what happens with .local memory of the parent task, if the children all have different .local memory, then perhaps one could just copy over what is needed from the invoking to the first invoked thread at start. I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice from ptx code? These are called from the host. As mentioned above, .local is probably not useful for what you want. - is it worth to reuse cudaLaunchDevice threads or are they cheap enough to start that any thread pooling should be removed for nvptx? Sorry, I don't understand the question. - we'll need some synchronization primitives, I see atomic support is there, we need mutexes and semaphores I think, is that implementable using bar instruction? It's probably membar you need. - the library uses __attribute__((constructor)) in 3 places or so, initialize_team is pthread specific and can be probably ifdefed out, we won't support dlclose in nvptx anyway, but at least we need some way to initialize the nvptx libgomp; if the initialization is done in global memory, would it persist in between different kernels, so can the initialization as separate kernel be run once, something else? I think that it would persist, and this would be my scheme for implementing constructors, but I haven't actually tried. - is there any way to do any affinity management, or shall we just ignore affinity strategies? Not sure what they do in libgomp. It's probably not a match for GPU architectures. - any way how to query time? There are %clock and %clock64 cycle counters. Bernd
Re: The nvptx port
On 11/14/2014 11:01 AM, Jakub Jelinek wrote: On Fri, Nov 14, 2014 at 09:29:48AM +0100, Jakub Jelinek wrote: I have some questions about nvptx: Oh, and 5) I have noticed gcc doesn't generate the .uni suffixes anywhere, while llvm generates them; are those appropriate only when a function is guaranteed to be run unconditionally from the toplevel kernel, or even in spots in arbitrary functions which might not be run unconditionally by all threads in thread block, but all threads that encounter the particular function will run the specific spot unconditionally? I mean, if we have arbitrary function: void foo (void) { something; bar (); something; } then the call is unconditional in there, but there is no guarantee somebody will not do void baz (int x) { if (x 20) foo (); } and run foo only in a subset of the threads. It's unclear to me what the .uni suffix even does on calls. Google finds this: http://divmap.wordpress.com/home/divopt/ which suggests that it says that the call's predicate will evaluate to the same value on all threads. So I think for an unconditional call instruction it's just meaningless. Bernd
Re: The nvptx port
On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote: I have some questions about nvptx: 1) you've said that alloca isn't supported, but it seems to be wired up and uses the %alloca documented in the PTX manual, what is the issue with that? %alloca not being actually implemented by the current PTX assembler or translator? Yes, it's unimplemented. There's an internal declaration for it but that seems to be as far as it goes, and that declaration is 32-bit only anyway. :(. Does NVidia plan to fix that in next version? 2) what is the reason why TLS isn't supported by the port (well, __emutls is emitted, but I doubt pthread_[gs]etspecific is implementable and thus it will not really do anything. Can't the port just emit all DECL_THREAD_LOCAL_P variables into .local instead of .global address space? .local is stack frame memory, not TLS. The ptx docs mention the use of .local at file-scope as occurring only in legacy ptx code and I get the impression it's discouraged. :(. So what other option one has to implement something like TLS, even using inline asm or similar? There is %tid, so perhaps indexing some array with %tid? The trouble with that is that some thread can do #pragma omp parallel again, and I bet the %tid afterwards would be again 0-(n-1), and if it is an index into a global array, it wouldn't work well then. Maybe without anything like TLS we can't really support nested parallelism, only one level of #pragma omp parallel inside of nvptx regions. But, if we add support for #pragma omp team, we'd either need the array in gang-local memory, or some other special register to give us gang id. BTW, one can still invoke OpenMP target regions (even OpenACC regions) from multiple host threads, so the question is how without local TLS we can actually do anything at all. Sure, we can pass parameters to the kernel, but we'd need to propagate it through all functions. Or can cudaGetParameterBuffer be used for that? 4) I had a brief look at what it would take to port libgomp to PTX, which is needed for OpenMP offloading. OpenMP offloaded kernels should start with 1 team and 1 thread in it, if we ignore GOMP_teams for now, I think the major things are: - right now libgomp is heavily pthread_* based, which is a no-go for nvptx I assume, I think we'll need some ifdefs in the sources I haven't looked into whether libpthread is doable. I suspect it's a poor match. I also haven't really looked into OpenMP, so I'm feeling a bit uncertain about answering your further questions. What OpenMP needs is essentially: - some way to spawn multiple threads (fork-join model), where the parent thread is the first one among those other threads, or, if that isn't possible, the first thread pretends to be the same as the first thread and the parent thread sleeps - something like pthread_mutex_lock/unlock (only basic; or say atomic ops + futex we use for Linux) - something like sem_* semaphore - and some TLS or something similar (pthread_[gs]etspecific etc.) - the main thing is that I believe we just have to replace gomp_team_start for nvptx; seems there are cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use to spawn selected kernel in selected number of threads (and teams), from the docs it isn't exactly clear what the calling thread will do, if it is suspended and the HW core given to it is reused by something else (e.g. one of the newly spawned threads), then I think it should be usable. Not sure what happens with .local memory of the parent task, if the children all have different .local memory, then perhaps one could just copy over what is needed from the invoking to the first invoked thread at start. I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice from ptx code? These are called from the host. As mentioned above, .local is probably not useful for what you want. In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned it should be possible, there is: .extern .func(.param .b32 func_retval0) cudaLaunchDevice ( .param .b64 func, .param .b64 parameterBuffer, .param .align 4 .b8 gridDimension[12], .param .align 4 .b8 blockDimension[12], .param .b32 sharedMemSize, .param .b64 stream ) ; (or s/.b64/.b32/ for -m32) that should be usable from within PTX. The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic parallelism (because all other variants are just bad for OpenMP, you'd need to preallocate all the gangs/threads (without knowing how many you'll need), and perhaps let them sleep on some barrier until you have work for them. - is it worth to reuse cudaLaunchDevice threads or are they cheap enough to start that any thread pooling should be removed for nvptx? Sorry, I don't understand the question. I meant what is the cost of cudaLaunchDevice
Re: The nvptx port
I'm adding Thomas and Cesar to the Cc list, they may have more insight into CUDA library questions as I haven't really looked into that part all that much. On 11/14/2014 12:39 PM, Jakub Jelinek wrote: On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote: I have some questions about nvptx: 1) you've said that alloca isn't supported, but it seems to be wired up and uses the %alloca documented in the PTX manual, what is the issue with that? %alloca not being actually implemented by the current PTX assembler or translator? Yes, it's unimplemented. There's an internal declaration for it but that seems to be as far as it goes, and that declaration is 32-bit only anyway. :(. Does NVidia plan to fix that in next version? I very much doubt it. It was like this in CUDA 5.0 when we started working on it, and it's still like this in CUDA 6.5. 2) what is the reason why TLS isn't supported by the port (well, __emutls is emitted, but I doubt pthread_[gs]etspecific is implementable and thus it will not really do anything. Can't the port just emit all DECL_THREAD_LOCAL_P variables into .local instead of .global address space? .local is stack frame memory, not TLS. The ptx docs mention the use of .local at file-scope as occurring only in legacy ptx code and I get the impression it's discouraged. :(. So what other option one has to implement something like TLS, even using inline asm or similar? There is %tid, so perhaps indexing some array with %tid? That ought to work. For performance you'd want that array in .shared memory but I believe that's limited in size. BTW, one can still invoke OpenMP target regions (even OpenACC regions) from multiple host threads, so the question is how without local TLS we can actually do anything at all. Sure, we can pass parameters to the kernel, but we'd need to propagate it through all functions. Or can cudaGetParameterBuffer be used for that? Presumably a kernel could copy its arguments out to memory somewhere when it's called? 4) I had a brief look at what it would take to port libgomp to PTX, which is needed for OpenMP offloading. OpenMP offloaded kernels should start with 1 team and 1 thread in it, if we ignore GOMP_teams for now, I think the major things are: - right now libgomp is heavily pthread_* based, which is a no-go for nvptx I assume, I think we'll need some ifdefs in the sources I haven't looked into whether libpthread is doable. I suspect it's a poor match. I also haven't really looked into OpenMP, so I'm feeling a bit uncertain about answering your further questions. What OpenMP needs is essentially: - some way to spawn multiple threads (fork-join model), where the parent thread is the first one among those other threads, or, if that isn't possible, the first thread pretends to be the same as the first thread and the parent thread sleeps - something like pthread_mutex_lock/unlock (only basic; or say atomic ops + futex we use for Linux) - something like sem_* semaphore - and some TLS or something similar (pthread_[gs]etspecific etc.) - the main thing is that I believe we just have to replace gomp_team_start for nvptx; seems there are cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use to spawn selected kernel in selected number of threads (and teams), from the docs it isn't exactly clear what the calling thread will do, if it is suspended and the HW core given to it is reused by something else (e.g. one of the newly spawned threads), then I think it should be usable. Not sure what happens with .local memory of the parent task, if the children all have different .local memory, then perhaps one could just copy over what is needed from the invoking to the first invoked thread at start. I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice from ptx code? These are called from the host. As mentioned above, .local is probably not useful for what you want. In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned it should be possible, there is: .extern .func(.param .b32 func_retval0) cudaLaunchDevice ( .param .b64 func, .param .b64 parameterBuffer, .param .align 4 .b8 gridDimension[12], .param .align 4 .b8 blockDimension[12], .param .b32 sharedMemSize, .param .b64 stream ) ; (or s/.b64/.b32/ for -m32) that should be usable from within PTX. The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic parallelism (because all other variants are just bad for OpenMP, you'd need to preallocate all the gangs/threads (without knowing how many you'll need), and perhaps let them sleep on some barrier until you have work for them. The latter would have been essentially the model I'd have tried to use (instead of sleeping, conditionalize on %tid==0). I didn't know there was a way to launch kernels from ptx code and haven't thought about
Re: The nvptx port
On Fri, Nov 14, 2014 at 01:12:40PM +0100, Bernd Schmidt wrote: :(. So what other option one has to implement something like TLS, even using inline asm or similar? There is %tid, so perhaps indexing some array with %tid? That ought to work. For performance you'd want that array in .shared memory but I believe that's limited in size. Any way to query those limits? Size of .shared memory, number of threads in warp, number of warps, etc.? In OpenACC, are all workers in a single gang the same warp? BTW, one can still invoke OpenMP target regions (even OpenACC regions) from multiple host threads, so the question is how without local TLS we can actually do anything at all. Sure, we can pass parameters to the kernel, but we'd need to propagate it through all functions. Or can cudaGetParameterBuffer be used for that? Presumably a kernel could copy its arguments out to memory somewhere when it's called? The question is where. If it is global memory, then how would you find out what value is for your team and what value is for some other team? - we'll need some synchronization primitives, I see atomic support is there, we need mutexes and semaphores I think, is that implementable using bar instruction? It's probably membar you need. That is a memory barrier, I need threads to wait on each other, wake up one another etc. Hmm. It's worthwhile to keep in mind that GPU threads really behave somewhat differently from CPUs (they don't really execute independently); the OMP model may just be a poor match for the architecture in general. One could busywait on a spinlock, but AFAIK there isn't really a way to put a thread to sleep. By not executing independently, I mean this: I believe if one thread in a warp is waiting on the spinlock, all the other ones are also busywaiting. There may be other effects that seem odd if one approaches it from a CPU perspective - for example you probably want only one thread in a warp to try to take the spinlock. So, for a warp, if some threads perform one branch of an if and other threads another one, all threads perform the first one first (with some maybe not doing anything), then all the threads the others (again, other threads not doing anything)? As for the match, OpenMP isn't written for a particular accelerator, though supposedly the addition of #pragma omp teams construct was done for NVidia. So, some OpenMP code may be efficient on PTX, while other code might not be that much (e.g. if all threads in a warp need to execute the same thing, supposedly #pragma omp task isn't very good idea for the devices). Jakub
Re: The nvptx port
On 11/14/2014 01:36 PM, Jakub Jelinek wrote: Any way to query those limits? Size of .shared memory, number of threads in warp, number of warps, etc.? I'd have to google most of that. There seems to be a WARP_SZ constant available in ptx to get the size of the warp. In OpenACC, are all workers in a single gang the same warp? No, warps are a relatively small size (32 threads). So, for a warp, if some threads perform one branch of an if and other threads another one, all threads perform the first one first (with some maybe not doing anything), then all the threads the others (again, other threads not doing anything)? I believe that's what happens. Bernd
Re: The nvptx port
On 11/14/2014 04:12 AM, Bernd Schmidt wrote: - we'll need some synchronization primitives, I see atomic support is there, we need mutexes and semaphores I think, is that implementable using bar instruction? It's probably membar you need. That is a memory barrier, I need threads to wait on each other, wake up one another etc. Hmm. It's worthwhile to keep in mind that GPU threads really behave somewhat differently from CPUs (they don't really execute independently); the OMP model may just be a poor match for the architecture in general. One could busywait on a spinlock, but AFAIK there isn't really a way to put a thread to sleep. By not executing independently, I mean this: I believe if one thread in a warp is waiting on the spinlock, all the other ones are also busywaiting. There may be other effects that seem odd if one approaches it from a CPU perspective - for example you probably want only one thread in a warp to try to take the spinlock. Thread synchronization in CUDA is different from conventional CPUs. Using the gang/thread terminology, there's no way to synchronize two threads in two different gangs in PTX without invoking separate kernels. Basically, after a kernel is invoked, the host/accelerator (the later using dynamic parallelism) waits for the kernel to finish, and that effectively creates a barrier. PTX does have an intra-gang synchronization primitive, which is helpful if the control flow diverges within a gang. Also, unless I'm mistaken, the PTX atomic operations only work within a gang. Also, keep in mind that PTX doesn't have a global TID. The user needs to calculate it using ctaid/tid and friends. Cesar
Re: The nvptx port
On Fri, Nov 14, 2014 at 07:37:49AM -0800, Cesar Philippidis wrote: Hmm. It's worthwhile to keep in mind that GPU threads really behave somewhat differently from CPUs (they don't really execute independently); the OMP model may just be a poor match for the architecture in general. One could busywait on a spinlock, but AFAIK there isn't really a way to put a thread to sleep. By not executing independently, I mean this: I believe if one thread in a warp is waiting on the spinlock, all the other ones are also busywaiting. There may be other effects that seem odd if one approaches it from a CPU perspective - for example you probably want only one thread in a warp to try to take the spinlock. Thread synchronization in CUDA is different from conventional CPUs. Using the gang/thread terminology, there's no way to synchronize two threads in two different gangs in PTX without invoking separate kernels. Basically, after a kernel is invoked, the host/accelerator (the later using dynamic parallelism) waits for the kernel to finish, and that effectively creates a barrier. I believe in OpenMP terminology a gang is a team, and inter-teams barriers are not supposed to work etc. (though, I think locks and atomic instructions still are, so is critical region, so I really hope atomics are atomic even inter-gang). So for synchronization (mutexes and semaphores, from which barriers are implemented; but perhaps could also use bar.arrive and bar.sync) we mainly need synchronization within the gang. Also, keep in mind that PTX doesn't have a global TID. The user needs to calculate it using ctaid/tid and friends. Ok. Is %gridid needed for that combo too? Jakub
Re: The nvptx port
On 11/14/2014 08:18 AM, Jakub Jelinek wrote: Also, keep in mind that PTX doesn't have a global TID. The user needs to calculate it using ctaid/tid and friends. Ok. Is %gridid needed for that combo too? Eventually, probably. Currently, we're launching all of our kernels with cuLaunchKernel, and that function doesn't take grids into account. Nvidia's documentation is kind of confusing. They use different terminology for their high level CUDA stuff and the low level PTX. E.g., what CUDA refers to blocks/warps, PTX calls CTAs. I'm not sure what grids corresponds to, but I think it might be devices. If that's the case, the runtime does have the capability to select which device to run a kernel on. But, it can't run a single kernel on multiple devices unless you use asynchronous kernel invocations. Cesar
Re: The nvptx port
On 11/14/14 04:09, Bernd Schmidt wrote: Hi Jakub, I have some questions about nvptx: 1) you've said that alloca isn't supported, but it seems to be wired up and uses the %alloca documented in the PTX manual, what is the issue with that? %alloca not being actually implemented by the current PTX assembler or translator? Yes, it's unimplemented. There's an internal declaration for it but that seems to be as far as it goes, and that declaration is 32-bit only anyway. Right. My recollection is it's defined in the vISA, but unimplemented. Jeff
Re: The nvptx port
On 11/14/14 04:39, Jakub Jelinek wrote: On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote: I have some questions about nvptx: 1) you've said that alloca isn't supported, but it seems to be wired up and uses the %alloca documented in the PTX manual, what is the issue with that? %alloca not being actually implemented by the current PTX assembler or translator? Yes, it's unimplemented. There's an internal declaration for it but that seems to be as far as it goes, and that declaration is 32-bit only anyway. :(. Does NVidia plan to fix that in next version? They haven't indicated any such plans to me directly. However, there's a clear direction to support arbitrary C/C++ over time. jeff
Re: The nvptx port
On Fri, Nov 14, 2014 at 08:37:52AM -0800, Cesar Philippidis wrote: On 11/14/2014 08:18 AM, Jakub Jelinek wrote: Also, keep in mind that PTX doesn't have a global TID. The user needs to calculate it using ctaid/tid and friends. Ok. Is %gridid needed for that combo too? Eventually, probably. Currently, we're launching all of our kernels with cuLaunchKernel, and that function doesn't take grids into account. I wonder if cudaLaunchDevice called from PTX will result in a different %gridid or not, will see next week if I manage to get the HW and SW stack Jakub
Re: The nvptx port
On 11/14/14 04:39, Jakub Jelinek wrote: :(. So what other option one has to implement something like TLS, even using inline asm or similar? There is %tid, so perhaps indexing some array with %tid? The trouble with that is that some thread can do #pragma omp parallel again, and I bet the %tid afterwards would be again 0-(n-1), and if it is an index into a global array, it wouldn't work well then. Maybe without anything like TLS we can't really support nested parallelism, only one level of #pragma omp parallel inside of nvptx regions. But, if we add support for #pragma omp team, we'd either need the array in gang-local memory, or some other special register to give us gang id. Does the interface to the hardware even allow a model where we can launch another offload task while one is in progress? Jeff
Re: The nvptx port
On 11/14/14 05:36, Jakub Jelinek wrote: So, for a warp, if some threads perform one branch of an if and other threads another one, all threads perform the first one first (with some maybe not doing anything), then all the threads the others (again, other threads not doing anything)? Nobody ever specified exactly what happens in this case to me, but I gathered from reading the docs that once you have some threads in one path and others in a different path, things slow down to a horrid crawl. So you try to avoid that :-) jeff
Re: The nvptx port [0/11+]
On Mon, Oct 20, 2014 at 4:17 PM, Bernd Schmidt ber...@codesourcery.com wrote: This is a patch kit that adds the nvptx port to gcc. It contains preliminary patches to add needed functionality, the target files, and one somewhat optional patch with additional target tools. There'll be more patch series, one for the testsuite, and one to make the offload functionality work with this port. Also required are the previous four rtl patches, two of which weren't entirely approved yet. For the moment, I've stripped out all the address space support that got bogged down in review by brokenness in our representation of address spaces. The ptx address spaces are of course still defined and used inside the backend. Ptx really isn't a usual target - it is a virtual target which is then translated by another compiler (ptxas) to the final code that runs on the GPU. There are many restrictions, some imposed by the GPU hardware, and some by the fact that not everything you'd want can be represented in ptx. Here are some of the highlights: * Everything is typed - variables, functions, registers. This can cause problems with KR style C or anything else that doesn't have a proper type internally. * Declarations are needed, even for undefined variables. * Can't emit initializers referring to their variable's address since you can't write forward declarations for variables. * Variables can be declared only as scalars or arrays, not structures. Initializers must be in the variable's declared type, which requires some code in the backend, and it means that packed pointer values are not representable. * Since it's a virtual target, we skip register allocation - no good can probably come from doing that twice. This means asm statements aren't fixed up and will fail if they use matching constraints. * No support for indirect jumps, label values, nonlocal gotos. * No alloca - ptx defines it, but it's not implemented. * No trampolines. * No debugging (at all, for now - we may add line number directives). * Limited C library support - I have a hacked up copy of newlib that provides a reasonable subset. * malloc and free are defined by ptx (these appear to be undocumented), but there isn't a realloc. I have one patch for Fortran to use a malloc/memcpy helper function in cases where we know the old size. All in all, this is not intended to be used as a C (or any other source language) compiler. I've gone through a lot of effort to make it work reasonably well, but only in order to get sufficient test coverage from the testsuites. The intended use for this is only to build it as an offload compiler, and use it through OpenACC by way of lto1. That leaves the question of how we should document it - does it need the usual constraint and option documentation, given that user's aren't expected to use any of it? A slightly earlier version of the entire patch kit was bootstrapped and tested on x86_64-linux. Ok for trunk? Now that this has been committed - I notice that there is no entry in MAINTAINERS for the port. I propose Bernd. Thanks, Richard. Bernd
Re: The nvptx port [0/11+]
On 11/12/14 05:34, Richard Biener wrote: Now that this has been committed - I notice that there is no entry in MAINTAINERS for the port. I propose Bernd. Well, ahead of you there. I proposed Bernd to the steering committee as the maintainer a little while ago. I need to go back and count votes :-) jeff
Re: The nvptx port [10/11+] Target files
On 10/30/2014 12:35 AM, Jeff Law wrote: A nit -- Richard S. recently removed the need to include the enum for enum machine_mode. I believe he had a script to handle the mundane parts of that change. Please make sure to update the nvptx port to conform to that new convention, obviously feel free to use the script if you want. You may need to update with James Greenhalgh's changes to MOVE_BY_PIECES_P and friends. With those two issues addressed as needed, this is OK for the trunk. I've now committed it, in the following form. Other than the enum thing, this also adds some atomic instructions. The scripts (11/11) I've put up on github, along with a hacked up newlib. These are at https://github.com/bernds/nvptx-tools https://github.com/bernds/nvptx-newlib They are likely to migrate to MentorEmbedded from bernds, but that had some permissions problems last week. Bernd commit 659744a99d815b168716b4460e32f6a21593e494 Author: Bernd Schmidt ber...@codesourcery.com Date: Thu Nov 6 19:03:57 2014 +0100 Add the nvptx port. * configure.ac: Handle nvptx-*-*. * configure: Regenerate. gcc/ * config/nvptx/nvptx.c: New file. * config/nvptx/nvptx.h: New file. * config/nvptx/nvptx-protos.h: New file. * config/nvptx/nvptx.md: New file. * config/nvptx/t-nvptx: New file. * config/nvptx/nvptx.opt: New file. * common/config/nvptx/nvptx-common.c: New file. * config.gcc: Handle nvptx-*-*. libgcc/ * config.host: Handle nvptx-*-*. * shared-object.mk (as-flags-$o): Define. ($(base)$(objext), $(base)_s$(objext)): Use it instead of -xassembler-with-cpp. * static-object.mk: Identical changes. * config/nvptx/t-nvptx: New file. * config/nvptx/crt0.s: New file. * config/nvptx/free.asm: New file. * config/nvptx/malloc.asm: New file. * config/nvptx/realloc.c: New file. diff --git a/ChangeLog b/ChangeLog index fd6172a..e83d1e6 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +2014-11-06 Bernd Schmidt ber...@codesourcery.com + + * configure.ac: Handle nvptx-*-*. + * configure: Regenerate. + 2014-11-06 Prachi Godbole prachi.godb...@imgtec.com * MAINTAINERS (Write After Approval): Add myself. diff --git a/configure b/configure index d0c760b..0e014a3 100755 --- a/configure +++ b/configure @@ -3779,6 +3779,10 @@ case ${target} in mips*-*-*) noconfigdirs=$noconfigdirs gprof ;; + nvptx*-*-*) +# nvptx is just a compiler +noconfigdirs=$noconfigdirs target-libssp target-libstdc++-v3 target-libobjc +;; sh-*-* | sh64-*-*) case ${target} in sh*-*-elf) diff --git a/configure.ac b/configure.ac index 2f0af4a..b1ef069 100644 --- a/configure.ac +++ b/configure.ac @@ -1138,6 +1138,10 @@ case ${target} in mips*-*-*) noconfigdirs=$noconfigdirs gprof ;; + nvptx*-*-*) +# nvptx is just a compiler +noconfigdirs=$noconfigdirs target-libssp target-libstdc++-v3 target-libobjc +;; sh-*-* | sh64-*-*) case ${target} in sh*-*-elf) diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 731a7bc8b..c170e69 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,14 @@ +2014-11-10 Bernd Schmidt ber...@codesourcery.com + + * config/nvptx/nvptx.c: New file. + * config/nvptx/nvptx.h: New file. + * config/nvptx/nvptx-protos.h: New file. + * config/nvptx/nvptx.md: New file. + * config/nvptx/t-nvptx: New file. + * config/nvptx/nvptx.opt: New file. + * common/config/nvptx/nvptx-common.c: New file. + * config.gcc: Handle nvptx-*-*. + 2014-11-10 Richard Biener rguent...@suse.de * tree-ssa-operands.c (finalize_ssa_uses): Properly put diff --git a/gcc/common/config/nvptx/nvptx-common.c b/gcc/common/config/nvptx/nvptx-common.c new file mode 100644 index 000..80ab076 --- /dev/null +++ b/gcc/common/config/nvptx/nvptx-common.c @@ -0,0 +1,38 @@ +/* NVPTX common hooks. + Copyright (C) 2014 Free Software Foundation, Inc. + Contributed by Bernd Schmidt ber...@codesourcery.com + +This file is part of GCC. + +GCC is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 3, or (at your option) +any later version. + +GCC is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with GCC; see the file COPYING3. If not see +http://www.gnu.org/licenses/. */ + +#include config.h +#include system.h +#include coretypes.h +#include diagnostic-core.h +#include tm.h +#include tm_p.h +#include common/common-target.h +#include common/common-target-def.h +#include opts.h +#include flags.h + +#undef TARGET_HAVE_NAMED_SECTIONS +#define TARGET_HAVE_NAMED_SECTIONS false + +#undef TARGET_DEFAULT_TARGET_FLAGS
Re: The nvptx port [10/11+] Target files
On Mon, Nov 10, 2014 at 05:19:57PM +0100, Bernd Schmidt wrote: commit 659744a99d815b168716b4460e32f6a21593e494 Author: Bernd Schmidt ber...@codesourcery.com Date: Thu Nov 6 19:03:57 2014 +0100 Note, in r217301 you've committed a change to pr35468.c, not mentioned in the ChangeLog, that uses no_const_addr_space effective target that is never defined. Can you please revert or commit a patch that adds support for that to gcc/testsuite/lib/ ? +ERROR: gcc.c-torture/compile/pr35468.c -O0 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O0 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O1 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O1 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O2 -flto -flto-partition=none : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O2 -flto -flto-partition=none : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O2 -flto : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O2 -flto : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O2 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O2 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O3 -fomit-frame-pointer : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O3 -fomit-frame-pointer : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O3 -g : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O3 -g : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -Os : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -Os : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.dg/pr44194-1.c: syntax error in target selector target { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* }{ ! powerpc*-*-linux* } || powerpc_elfv2! nvptx-*-* for dg-do 1 compile { target { { { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* } } { { ! powerpc*-*-linux* } || powerpc_elfv2 } { ! nvptx-*-* } } } +UNRESOLVED: gcc.dg/pr44194-1.c: syntax error in target selector target { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* }{ ! powerpc*-*-linux* } || powerpc_elfv2! nvptx-*-* for dg-do 1 compile { target { { { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* } } { { ! powerpc*-*-linux* } || powerpc_elfv2 } { ! nvptx-*-* } } } +FAIL: gcc.dg/pr45352-1.c (test for excess errors) Jakub
Re: The nvptx port [10/11+] Target files
On Mon, Nov 10, 2014 at 12:04 PM, Jakub Jelinek ja...@redhat.com wrote: On Mon, Nov 10, 2014 at 05:19:57PM +0100, Bernd Schmidt wrote: commit 659744a99d815b168716b4460e32f6a21593e494 Author: Bernd Schmidt ber...@codesourcery.com Date: Thu Nov 6 19:03:57 2014 +0100 Note, in r217301 you've committed a change to pr35468.c, not mentioned in the ChangeLog, that uses no_const_addr_space effective target that is never defined. Can you please revert or commit a patch that adds support for that to gcc/testsuite/lib/ ? +ERROR: gcc.c-torture/compile/pr35468.c -O0 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O0 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O1 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O1 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O2 -flto -flto-partition=none : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O2 -flto -flto-partition=none : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O2 -flto : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O2 -flto : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O2 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O2 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O3 -fomit-frame-pointer : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O3 -fomit-frame-pointer : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O3 -g : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O3 -g : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -Os : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -Os : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.dg/pr44194-1.c: syntax error in target selector target { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* }{ ! powerpc*-*-linux* } || powerpc_elfv2! nvptx-*-* for dg-do 1 compile { target { { { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* } } { { ! powerpc*-*-linux* } || powerpc_elfv2 } { ! nvptx-*-* } } } +UNRESOLVED: gcc.dg/pr44194-1.c: syntax error in target selector target { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* }{ ! powerpc*-*-linux* } || powerpc_elfv2! nvptx-*-* for dg-do 1 compile { target { { { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* } } { { ! powerpc*-*-linux* } || powerpc_elfv2 } { ! nvptx-*-* } } } +FAIL: gcc.dg/pr45352-1.c (test for excess errors) Jakub -- H.J.
Re: The nvptx port [10/11+] Target files
On Mon, Nov 10, 2014 at 12:04 PM, Jakub Jelinek ja...@redhat.com wrote: On Mon, Nov 10, 2014 at 05:19:57PM +0100, Bernd Schmidt wrote: commit 659744a99d815b168716b4460e32f6a21593e494 Author: Bernd Schmidt ber...@codesourcery.com Date: Thu Nov 6 19:03:57 2014 +0100 Note, in r217301 you've committed a change to pr35468.c, not mentioned in the ChangeLog, that uses no_const_addr_space effective target that is never defined. Can you please revert or commit a patch that adds support for that to gcc/testsuite/lib/ ? +ERROR: gcc.c-torture/compile/pr35468.c -O0 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O0 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O1 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O1 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O2 -flto -flto-partition=none : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O2 -flto -flto-partition=none : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O2 -flto : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O2 -flto : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O2 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O2 : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O3 -fomit-frame-pointer : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O3 -fomit-frame-pointer : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -O3 -g : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -O3 -g : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.c-torture/compile/pr35468.c -Os : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +UNRESOLVED: gcc.c-torture/compile/pr35468.c -Os : unknown effective target keyword \`no_const_addr_space' for dg-require-effective-target 2 no_const_addr_space +ERROR: gcc.dg/pr44194-1.c: syntax error in target selector target { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* }{ ! powerpc*-*-linux* } || powerpc_elfv2! nvptx-*-* for dg-do 1 compile { target { { { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* } } { { ! powerpc*-*-linux* } || powerpc_elfv2 } { ! nvptx-*-* } } } +UNRESOLVED: gcc.dg/pr44194-1.c: syntax error in target selector target { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* }{ ! powerpc*-*-linux* } || powerpc_elfv2! nvptx-*-* for dg-do 1 compile { target { { { { { { { i?86-*-* x86_64-*-* } x32 } || lp64 } { ! s390*-*-* } } { ! hppa*64*-*-* } } { ! alpha*-*-* } } { { ! powerpc*-*-linux* } || powerpc_elfv2 } { ! nvptx-*-* } } } +FAIL: gcc.dg/pr45352-1.c (test for excess errors) Jakub I reverted the change in gcc.c-torture/compile/pr35468.c. I also checked in this patch to add missing braces in gcc.dg/pr44194-1.c. -- H.J. - Index: ChangeLog === --- ChangeLog (revision 217315) +++ ChangeLog (working copy) @@ -1,3 +1,7 @@ +2014-11-10 H.J. Lu hongjiu...@intel.com + + * gcc.dg/pr44194-1.c (dg-do): Add missing braces. + 2014-11-10 Roman Gareev gareevro...@gmail.com * gcc.dg/graphite/isl-ast-gen-blocks-1.c: Remove using of Index: gcc.dg/pr44194-1.c === --- gcc.dg/pr44194-1.c (revision 217315) +++ gcc.dg/pr44194-1.c
Re: The nvptx port [10/11+] Target files
On Nov 10, 2014, at 12:37 PM, H.J. Lu hjl.to...@gmail.com wrote: I also checked in this patch to add missing braces in gcc.dg/pr44194-1.c. Thanks.
the nvptx port
Hi Bernd, reading the patches, it seems like there is no mention of sm_35, only sm_30. So, I'm wondering what 'sub'targets will initially be supported, and if/how/when various processors will be selected. Thanks, Joost
Re: The nvptx port [8/11+] Write undefined decls.
On 10/22/2014 08:11 PM, Jeff Law wrote: I'm not going to insist you do this in the same way as the PA. That was a different era -- we had significant motivation to make things work in such a way that everything could be buried in the pa specific files. That sometimes led to less than optimal approaches to fix certain problems. So... is this patch approved? Bernd
Re: The nvptx port [10/11+] Target files
On 11/04/2014 05:51 PM, Bernd Schmidt wrote: On 11/04/2014 05:48 PM, Richard Henderson wrote: On 10/28/2014 03:56 PM, Bernd Schmidt wrote: +nvptx_ptx_type_from_mode (enum machine_mode mode, bool promote) +{ + switch (mode) +{ +case BLKmode: + return .b8; +case BImode: + return .pred; +case QImode: + if (promote) +return .u32; + else +return .u8; +case HImode: + return .u16; Promote here too? Or does this have nothing to do with +static enum machine_mode +arg_promotion (enum machine_mode mode) +{ + if (mode == QImode || mode == HImode) +return SImode; + return mode; +} No, these are different problems - the one in arg promotion is purely about KR C and trying to match untyped function decls with calls, while the type_from_mode bit was about some ptx ideosyncracy. Although I forget what the problem was, that code is more than a year old - I'll see if I can get rid of this. Err, no, it's quite necessary. From the manual The .u8, .s8 and .b8 instruction types are restricted to ld, st and cvt instructions. This means that if the compiler generates reasonable-looking code along the lines of .reg .u8 %r70; mov.u8 %r70,48; you get ptxas 2211-1.o, line 191; error : Arguments mismatch for instruction 'mov' Now, one _could_ write .cvt.u8.u32 for the load immediate, but then one would also have to write .cvt.u8.u8 for register-register moves, and that's starting to look iffy. I don't really want to rely on the ptx assembler to do the right thing for conversions from one type to itself. Bernd
Re: The nvptx port [8/11+] Write undefined decls.
On 11/05/14 05:01, Bernd Schmidt wrote: On 10/22/2014 08:11 PM, Jeff Law wrote: I'm not going to insist you do this in the same way as the PA. That was a different era -- we had significant motivation to make things work in such a way that everything could be buried in the pa specific files. That sometimes led to less than optimal approaches to fix certain problems. So... is this patch approved? Yes, sorry for not being explicit. Jeff
Re: The nvptx port [1/11+] indirect jumps
On 10/20/2014 04:19 PM, Bernd Schmidt wrote: ptx doesn't have indirect jumps, so CODE_FOR_indirect_jump may not be defined. Add a sorry. Looking back through all the mails it turns out this one wasn't approved yet. Ping? Bernd
Re: The nvptx port [1/11+] indirect jumps
On 11/04/2014 04:32 PM, Bernd Schmidt wrote: On 10/20/2014 04:19 PM, Bernd Schmidt wrote: ptx doesn't have indirect jumps, so CODE_FOR_indirect_jump may not be defined. Add a sorry. Looking back through all the mails it turns out this one wasn't approved yet. Ping? Ok. r~
Re: The nvptx port [10/11+] Target files
On 10/28/2014 03:56 PM, Bernd Schmidt wrote: +nvptx_ptx_type_from_mode (enum machine_mode mode, bool promote) +{ + switch (mode) +{ +case BLKmode: + return .b8; +case BImode: + return .pred; +case QImode: + if (promote) + return .u32; + else + return .u8; +case HImode: + return .u16; Promote here too? Or does this have nothing to do with +static enum machine_mode +arg_promotion (enum machine_mode mode) +{ + if (mode == QImode || mode == HImode) +return SImode; + return mode; +} r~
Re: The nvptx port [10/11+] Target files
On 11/04/2014 05:48 PM, Richard Henderson wrote: On 10/28/2014 03:56 PM, Bernd Schmidt wrote: +nvptx_ptx_type_from_mode (enum machine_mode mode, bool promote) +{ + switch (mode) +{ +case BLKmode: + return .b8; +case BImode: + return .pred; +case QImode: + if (promote) + return .u32; + else + return .u8; +case HImode: + return .u16; Promote here too? Or does this have nothing to do with +static enum machine_mode +arg_promotion (enum machine_mode mode) +{ + if (mode == QImode || mode == HImode) +return SImode; + return mode; +} No, these are different problems - the one in arg promotion is purely about KR C and trying to match untyped function decls with calls, while the type_from_mode bit was about some ptx ideosyncracy. Although I forget what the problem was, that code is more than a year old - I'll see if I can get rid of this. Bernd
Re: The nvptx port [11/11] More tools.
On 10/31/14 17:50, Bernd Schmidt wrote: On 10/31/2014 09:56 PM, Jeff Law wrote: Pondering this a bit more, I think this is fine in concept. As you note, removing the GNU extensions or at least making them conditional would be good since these are going to be built with the host tools. I'm not going to dig into the implementations... I'm going to assume the nvptx maintainer (that's highly likely to be you :-) will own their care and feeding. I was beginning to think I'd just make a separate package. That could then also include a nvptx-run which would have to link against CUDA libraries. Your call. jeff
Re: The nvptx port [11/11] More tools.
On 10/20/14 08:48, Bernd Schmidt wrote: This is a bonus optional patch which adds ar, ranlib, as and ld to the ptx port. This is not proper binutils; ar and ranlib are just linked to the host versions, and the other two tools have the following functions: * nvptx-as is required to convert the compiler output to actual valid ptx assembly, primarily by reordering declarations and definitions. Believe me when I say that I've tried to make that work in the compiler itself and it's pretty much impossible without some really invasive changes. * nvptx-ld is just a pseudo linker that works by concatenating ptx input files and separating them with nul characters. Actual linking is something that happens later, when calling CUDA library functions, but existing build system make it useful to have something called ld which is able to bundle everything that's needed into a single file, and this seemed to be the simplest way of achieving this. There's a toplevel configure.ac change necessary to make ar/ranlib useable by the libgcc build. Having some tools built like this has some precedent in t-vmsnative, but as Thomas noted it does make feature tests in gcc's configure somewhat ugly (but everything works well enough to build the compiler). The alternative here is to bundle all these files into a separate nvptx-tools package which users would have to download - something that would be nice to avoid. These tools currently require GNU extensions - something I probably ought to fix if we decide to add them to the gcc build itself. Pondering this a bit more, I think this is fine in concept. As you note, removing the GNU extensions or at least making them conditional would be good since these are going to be built with the host tools. I'm not going to dig into the implementations... I'm going to assume the nvptx maintainer (that's highly likely to be you :-) will own their care and feeding. jeff
Re: The nvptx port [11/11] More tools.
On 10/31/2014 09:56 PM, Jeff Law wrote: Pondering this a bit more, I think this is fine in concept. As you note, removing the GNU extensions or at least making them conditional would be good since these are going to be built with the host tools. I'm not going to dig into the implementations... I'm going to assume the nvptx maintainer (that's highly likely to be you :-) will own their care and feeding. I was beginning to think I'd just make a separate package. That could then also include a nvptx-run which would have to link against CUDA libraries. Bernd
Re: The nvptx port [7/11+] Inform the port about call arguments
On 10/28/14 08:49, Bernd Schmidt wrote: On 10/22/2014 08:12 PM, Jeff Law wrote: Yea, let's keep your approach. Just wanted to explore a bit since the PA seems to have a variety of similar characteristics. Here's an updated version of the patch. I experimented a little with ptx calling conventions and ran into an arg that had to be moved with memcpy, which exposed an ordering problem - all call_args were added to the memcpy call. So the invocation of the hook had to be moved downwards a bit, and the calculation of the return value needs to happen after it (since nvptx_function_value needs to know whether we are actually trying to construct a call at the moment). Bootstrapped and tested on x86_64-linux, ok? OK. Jeff
Re: The nvptx port [10/11+] Target files
On 10/28/14 08:56, Bernd Schmidt wrote: I have patches that expose all the address spaces to the middle-end through a lower-as pass that runs early. The preliminary patches for that ran into some resistance and into general brokenness of our address space support, so I decided to rip all that out for the moment to get the basic port into the next version. This new version also implements a way of providing realloc that was suggested in another thread. Calls to malloc and free are redirected to libgcc variants. I'm not a big fan of wasting extra space on every allocation (which is why I didn't originally consider this approach viable), but it seems we'll have to do it that way. There's a change to the libgcc build system: on ptx we need comments in the assembly to survive, so we can't use -xassembler-with-cpp. I've not found any files named *.asm, so I've changed that suffix to mean plain assembler. Bernd 010-target.diff * configure.ac: Allow configuring lto for nvptx. * configure: Regenerate. gcc/ * config/nvptx/nvptx.c: New file. * config/nvptx/nvptx.h: New file. * config/nvptx/nvptx-protos.h: New file. * config/nvptx/nvptx.md: New file. * config/nvptx/t-nvptx: New file. * config/nvptx/nvptx.opt: New file. * common/config/nvptx/nvptx-common.c: New file. * config.gcc: Handle nvptx-*-*. libgcc/ * config.host: Handle nvptx-*-*. * shared-object.mk (as-flags-$o): Define. ($(base)$(objext), $(base)_s$(objext)): Use it instead of -xassembler-with-cpp. * static-object.mk: Identical changes. * config/nvptx/t-nvptx: New file. * config/nvptx/crt0.s: New file. * config/nvptx/free.asm: New file. * config/nvptx/malloc.asm: New file. * config/nvptx/realloc.c: New file. A nit -- Richard S. recently removed the need to include the enum for enum machine_mode. I believe he had a script to handle the mundane parts of that change. Please make sure to update the nvptx port to conform to that new convention, obviously feel free to use the script if you want. You may need to update with James Greenhalgh's changes to MOVE_BY_PIECES_P and friends. With those two issues addressed as needed, this is OK for the trunk. FWIW, I'm amazed at how many similarities there are between what needs to be done for the PTX tools and what needed to be done to interface with the native HPPA tools way-back-when. Simply amazing. I notice that you've got some OpenMP bits (write_as_kernel). Are y'all doing any testing with OpenMP or is that an artifact of layering OpenACC on top of the OpenMP infrastructure? Also, I've asked the steering committee to appoint you as the maintainer for the nvptx port as well. jeff
Re: The nvptx port [10/11+] Target files
On 10/30/2014 12:35 AM, Jeff Law wrote: A nit -- Richard S. recently removed the need to include the enum for enum machine_mode. I believe he had a script to handle the mundane parts of that change. Please make sure to update the nvptx port to conform to that new convention, obviously feel free to use the script if you want. You may need to update with James Greenhalgh's changes to MOVE_BY_PIECES_P and friends. Ok, I'll look into those. With those two issues addressed as needed, this is OK for the trunk. Thanks! I've pinged some of the preliminary patches that went unapproved up to this point. One leftover issue, discussed in the [0/11] mail - what amount of documentation is appropriate for this, given that we don't want to support using this as anything other than an offload compiler? Should I still add all the standard invoke.texi/gccint.texi pieces? I notice that you've got some OpenMP bits (write_as_kernel). Are y'all doing any testing with OpenMP or is that an artifact of layering OpenACC on top of the OpenMP infrastructure? The distinction between .kernel and .func is really not to do with either - only .kernels are callable from the host, and only .funcs are callable from within ptx code. Bernd
Re: The nvptx port [10/11+] Target files
On 10/29/14 17:55, Bernd Schmidt wrote: Thanks! I've pinged some of the preliminary patches that went unapproved up to this point. Thanks. One leftover issue, discussed in the [0/11] mail - what amount of documentation is appropriate for this, given that we don't want to support using this as anything other than an offload compiler? Should I still add all the standard invoke.texi/gccint.texi pieces? I'm still not sure here. nvptx is quite a bit different than anything we've done in the past and I'm not sure how much of the traditional stuff we want to document vs on the other end how much of the special stuff we want to document. I simply don't know. I notice that you've got some OpenMP bits (write_as_kernel). Are y'all doing any testing with OpenMP or is that an artifact of layering OpenACC on top of the OpenMP infrastructure? The distinction between .kernel and .func is really not to do with either - only .kernels are callable from the host, and only .funcs are callable from within ptx code. Ok. Thanks for clarifying. jeff
Re: The nvptx port [7/11+] Inform the port about call arguments
On 10/22/2014 08:12 PM, Jeff Law wrote: Yea, let's keep your approach. Just wanted to explore a bit since the PA seems to have a variety of similar characteristics. Here's an updated version of the patch. I experimented a little with ptx calling conventions and ran into an arg that had to be moved with memcpy, which exposed an ordering problem - all call_args were added to the memcpy call. So the invocation of the hook had to be moved downwards a bit, and the calculation of the return value needs to happen after it (since nvptx_function_value needs to know whether we are actually trying to construct a call at the moment). Bootstrapped and tested on x86_64-linux, ok? Bernd gcc/ * target.def (call_args, end_call_args): New hooks. * hooks.c (hook_void_rtx_tree): New empty function. * hooks.h (hook_void_rtx_tree): Declare. * doc/tm.texi.in (TARGET_CALL_ARGS, TARGET_END_CALL_ARGS): Add. * doc/tm.texi: Regenerate. * calls.c (expand_call): Slightly rearrange the code. Use the two new hooks. (expand_library_call_value_1): Use the two new hooks. Index: gcc/doc/tm.texi === --- gcc/doc/tm.texi.orig +++ gcc/doc/tm.texi @@ -4960,6 +4960,29 @@ except the last are treated as named. You need not define this hook if it always returns @code{false}. @end deftypefn +@deftypefn {Target Hook} void TARGET_CALL_ARGS (rtx, @var{tree}) +While generating RTL for a function call, this target hook is invoked once +for each argument passed to the function, either a register returned by +@code{TARGET_FUNCTION_ARG} or a memory location. It is called just +before the point where argument registers are stored. The type of the +function to be called is also passed as the second argument; it is +@code{NULL_TREE} for libcalls. The @code{TARGET_END_CALL_ARGS} hook is +invoked just after the code to copy the return reg has been emitted. +This functionality can be used to perform special setup of call argument +registers if a target needs it. +For functions without arguments, the hook is called once with @code{pc_rtx} +passed instead of an argument register. +Most ports do not need to implement anything for this hook. +@end deftypefn + +@deftypefn {Target Hook} void TARGET_END_CALL_ARGS (void) +This target hook is invoked while generating RTL for a function call, +just after the point where the return reg is copied into a pseudo. It +signals that all the call argument and return registers for the just +emitted call are now no longer in use. +Most ports do not need to implement anything for this hook. +@end deftypefn + @deftypefn {Target Hook} bool TARGET_PRETEND_OUTGOING_VARARGS_NAMED (cumulative_args_t @var{ca}) If you need to conditionally change ABIs so that one works with @code{TARGET_SETUP_INCOMING_VARARGS}, but the other works like neither Index: gcc/doc/tm.texi.in === --- gcc/doc/tm.texi.in.orig +++ gcc/doc/tm.texi.in @@ -3856,6 +3856,10 @@ These machine description macros help im @hook TARGET_STRICT_ARGUMENT_NAMING +@hook TARGET_CALL_ARGS + +@hook TARGET_END_CALL_ARGS + @hook TARGET_PRETEND_OUTGOING_VARARGS_NAMED @node Trampolines Index: gcc/hooks.c === --- gcc/hooks.c.orig +++ gcc/hooks.c @@ -245,6 +245,11 @@ hook_void_tree (tree a ATTRIBUTE_UNUSED) } void +hook_void_rtx_tree (rtx, tree) +{ +} + +void hook_void_constcharptr (const char *a ATTRIBUTE_UNUSED) { } Index: gcc/hooks.h === --- gcc/hooks.h.orig +++ gcc/hooks.h @@ -71,6 +71,7 @@ extern void hook_void_constcharptr (cons extern void hook_void_rtx_insn_int (rtx_insn *, int); extern void hook_void_FILEptr_constcharptr (FILE *, const char *); extern bool hook_bool_FILEptr_rtx_false (FILE *, rtx); +extern void hook_void_rtx_tree (rtx, tree); extern void hook_void_tree (tree); extern void hook_void_tree_treeptr (tree, tree *); extern void hook_void_int_int (int, int); Index: gcc/target.def === --- gcc/target.def.orig +++ gcc/target.def @@ -3816,6 +3816,33 @@ not generate any instructions in this ca default_setup_incoming_varargs) DEFHOOK +(call_args, + While generating RTL for a function call, this target hook is invoked once\n\ +for each argument passed to the function, either a register returned by\n\ +@code{TARGET_FUNCTION_ARG} or a memory location. It is called just\n\ +before the point where argument registers are stored. The type of the\n\ +function to be called is also passed as the second argument; it is\n\ +@code{NULL_TREE} for libcalls. The @code{TARGET_END_CALL_ARGS} hook is\n\ +invoked just after the code to copy the return reg has been emitted.\n\ +This functionality can be used to perform special setup of call
Re: The nvptx port [10/11+] Target files
On 10/22/2014 08:01 PM, Jeff Law wrote: Please make sure all the functions in nvptx.c have function comments. Done, and replaced regno 4 with NVPTX_RETURN_REGNUM. +const char * +nvptx_output_call_insn (rtx insn, rtx result, rtx callee) If possible, promote first argument to rtx_insn *. Also done. +/* Clean up subreg operands. */ Which means what? A little more descriptive here would be helpful. Expanded. I'm surprised there's not more hair around the address space issues. I expected more problems there. I have patches that expose all the address spaces to the middle-end through a lower-as pass that runs early. The preliminary patches for that ran into some resistance and into general brokenness of our address space support, so I decided to rip all that out for the moment to get the basic port into the next version. This new version also implements a way of providing realloc that was suggested in another thread. Calls to malloc and free are redirected to libgcc variants. I'm not a big fan of wasting extra space on every allocation (which is why I didn't originally consider this approach viable), but it seems we'll have to do it that way. There's a change to the libgcc build system: on ptx we need comments in the assembly to survive, so we can't use -xassembler-with-cpp. I've not found any files named *.asm, so I've changed that suffix to mean plain assembler. Bernd * configure.ac: Allow configuring lto for nvptx. * configure: Regenerate. gcc/ * config/nvptx/nvptx.c: New file. * config/nvptx/nvptx.h: New file. * config/nvptx/nvptx-protos.h: New file. * config/nvptx/nvptx.md: New file. * config/nvptx/t-nvptx: New file. * config/nvptx/nvptx.opt: New file. * common/config/nvptx/nvptx-common.c: New file. * config.gcc: Handle nvptx-*-*. libgcc/ * config.host: Handle nvptx-*-*. * shared-object.mk (as-flags-$o): Define. ($(base)$(objext), $(base)_s$(objext)): Use it instead of -xassembler-with-cpp. * static-object.mk: Identical changes. * config/nvptx/t-nvptx: New file. * config/nvptx/crt0.s: New file. * config/nvptx/free.asm: New file. * config/nvptx/malloc.asm: New file. * config/nvptx/realloc.c: New file. Index: gcc/common/config/nvptx/nvptx-common.c === --- /dev/null +++ gcc/common/config/nvptx/nvptx-common.c @@ -0,0 +1,38 @@ +/* NVPTX common hooks. + Copyright (C) 2014 Free Software Foundation, Inc. + Contributed by Bernd Schmidt ber...@codesourcery.com + +This file is part of GCC. + +GCC is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 3, or (at your option) +any later version. + +GCC is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with GCC; see the file COPYING3. If not see +http://www.gnu.org/licenses/. */ + +#include config.h +#include system.h +#include coretypes.h +#include diagnostic-core.h +#include tm.h +#include tm_p.h +#include common/common-target.h +#include common/common-target-def.h +#include opts.h +#include flags.h + +#undef TARGET_HAVE_NAMED_SECTIONS +#define TARGET_HAVE_NAMED_SECTIONS false + +#undef TARGET_DEFAULT_TARGET_FLAGS +#define TARGET_DEFAULT_TARGET_FLAGS MASK_ABI64 + +struct gcc_targetm_common targetm_common = TARGETM_COMMON_INITIALIZER; Index: gcc/config.gcc === --- gcc/config.gcc.orig +++ gcc/config.gcc @@ -420,6 +420,9 @@ nios2-*-*) cpu_type=nios2 extra_options=${extra_options} g.opt ;; +nvptx-*-*) + cpu_type=nvptx + ;; powerpc*-*-*) cpu_type=rs6000 extra_headers=ppc-asm.h altivec.h spe.h ppu_intrinsics.h paired.h spu2vmx.h vec_types.h si2vmx.h htmintrin.h htmxlintrin.h @@ -2148,6 +2151,10 @@ nios2-*-*) ;; esac ;; +nvptx-*) + tm_file=${tm_file} newlib-stdint.h + tmake_file=nvptx/t-nvptx + ;; pdp11-*-*) tm_file=${tm_file} newlib-stdint.h use_gcc_stdint=wrap Index: gcc/config/nvptx/nvptx.c === --- /dev/null +++ gcc/config/nvptx/nvptx.c @@ -0,0 +1,2118 @@ +/* Target code for NVPTX. + Copyright (C) 2014 Free Software Foundation, Inc. + Contributed by Bernd Schmidt ber...@codesourcery.com + + This file is part of GCC. + + GCC is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published + by the Free Software Foundation; either version 3, or (at your + option) any later version. + + GCC is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even
Re: The nvptx port [11/11] More tools.
On 10/22/14 15:11, Bernd Schmidt wrote: On 10/22/2014 10:31 PM, Jeff Law wrote: These tools currently require GNU extensions - something I probably ought to fix if we decide to add them to the gcc build itself. Would these be more appropriate in binutils? I don't think so, given that we don't need any piece of regular binutils. There's no meaningful way to build libbfd. It would be strange to build binutils and have everything that's normally part of it disabled at configure time. Fair enough, but I'm having trouble seeing these in GCC. Makes me wonder if they ought to be a package unto themselves, nvptxtools or somesuch. Note that as a separate package, you don't have to remove the GNU extensions :-) jeff
Re: The nvptx port [1/11+] indirect jumps
On Tue, Oct 21, 2014 at 11:32 PM, Bernd Schmidt ber...@codesourcery.com wrote: On 10/21/2014 11:30 PM, Jakub Jelinek wrote: At least for OpenMP, the best would be if the #pragma omp target regions and/or #pragma omp declare target functions contain anything a particular offloading accelerator can't handle, instead of failing the whole compilation perhaps just emit some at least by default non-fatal warning and not emit anything for the particular offloading target, which would mean either host fallback, or, if some other offloading target succeeded, just that target. I guess a test could be added to mkoffload if gcc were to return a different value for a sorry vs. any other compilation failure. The tool could then choose not to produce offloading support for that target. But that would be for the whole file instead of for the specific region? So maybe we should produce one LTO offload object for each offload function and make the symbols they are supposed to provide weak so a fail doesn't end up failing to link the main program? Looks like this gets somewhat awkward with the LTO setup. Richard. Bernd
Re: The nvptx port [1/11+] indirect jumps
On Wed, Oct 22, 2014 at 10:18:49AM +0200, Richard Biener wrote: On Tue, Oct 21, 2014 at 11:32 PM, Bernd Schmidt ber...@codesourcery.com wrote: On 10/21/2014 11:30 PM, Jakub Jelinek wrote: At least for OpenMP, the best would be if the #pragma omp target regions and/or #pragma omp declare target functions contain anything a particular offloading accelerator can't handle, instead of failing the whole compilation perhaps just emit some at least by default non-fatal warning and not emit anything for the particular offloading target, which would mean either host fallback, or, if some other offloading target succeeded, just that target. I guess a test could be added to mkoffload if gcc were to return a different value for a sorry vs. any other compilation failure. The tool could then choose not to produce offloading support for that target. But that would be for the whole file instead of for the specific region? So maybe we should produce one LTO offload object for each offload function and make the symbols they are supposed to provide weak so a fail doesn't end up failing to link the main program? Looks like this gets somewhat awkward with the LTO setup. I don't think we want to do a fine-grained granularity here, it will only lead to significant nightmares. E.g. a target region can call other target functions, if a target function it calls (perhaps directly through a series of other target functions, perhaps indirectly through function pointers etc.) can't be supported by the host, you'd need to give up on offloading all target regions that do or could invoke that. That can be in another TU within the same shared library etc. And, if some regions are emitted and others are not, #pragma omp target data will behave less predictably and more confusingly, right now it can test, does this library have usable offloading for everything it provides (i.e. libgomp would ask the plugin to initialize offloading from the current shared library if not already done, and if successful, say it supports offloading for the particular device and map variables to that device as requested, otherwise it would just assume only host fallback is possible and not really map anything). When a target region is hit, from either within the target data region or elsewhere, it is already figured out if it has to fallback to host or not. Now, if you have fine-grained offloading, 33.2% of target regions being offloadable, the rest not, what would you actually do in target data region? It doesn't generically know what target regions will be encountered. So act as if offloading perhaps was possible? But then at each target region find out if it is really possible? IMHO people that care about performance will use target regions with care, with the offloading targets that they care about in mind, for those that don't care about that, either they will be lucky and things will work out all, or they will just end up with host fallback. Jakub
Re: The nvptx port [1/11+] indirect jumps
Hi! On Wed, 22 Oct 2014 10:18:49 +0200, Richard Biener richard.guent...@gmail.com wrote: On Tue, Oct 21, 2014 at 11:32 PM, Bernd Schmidt ber...@codesourcery.com wrote: On 10/21/2014 11:30 PM, Jakub Jelinek wrote: At least for OpenMP, the best would be if the #pragma omp target regions and/or #pragma omp declare target functions contain anything a particular offloading accelerator can't handle, instead of failing the whole compilation perhaps just emit some at least by default non-fatal warning and not emit anything for the particular offloading target, which would mean either host fallback, or, if some other offloading target succeeded, just that target. I guess a test could be added to mkoffload if gcc were to return a different value for a sorry vs. any other compilation failure. The tool could then choose not to produce offloading support for that target. But that would be for the whole file instead of for the specific region? I'm not sure that's what you're suggesting, but at least on non-shared memory offloading devices, you can't switch arbitrarily between offloading device(s) and host-fallback, for you have to do data management between the non-shared memories. So maybe we should produce one LTO offload object for each offload function and make the symbols they are supposed to provide weak so a fail doesn't end up failing to link the main program? Looks like this gets somewhat awkward with the LTO setup. Grüße, Thomas pgp6yaImJYJpu.pgp Description: PGP signature
Re: The nvptx port [1/11+] indirect jumps
On Wed, Oct 22, 2014 at 10:34 AM, Thomas Schwinge tho...@codesourcery.com wrote: Hi! On Wed, 22 Oct 2014 10:18:49 +0200, Richard Biener richard.guent...@gmail.com wrote: On Tue, Oct 21, 2014 at 11:32 PM, Bernd Schmidt ber...@codesourcery.com wrote: On 10/21/2014 11:30 PM, Jakub Jelinek wrote: At least for OpenMP, the best would be if the #pragma omp target regions and/or #pragma omp declare target functions contain anything a particular offloading accelerator can't handle, instead of failing the whole compilation perhaps just emit some at least by default non-fatal warning and not emit anything for the particular offloading target, which would mean either host fallback, or, if some other offloading target succeeded, just that target. I guess a test could be added to mkoffload if gcc were to return a different value for a sorry vs. any other compilation failure. The tool could then choose not to produce offloading support for that target. But that would be for the whole file instead of for the specific region? I'm not sure that's what you're suggesting, but at least on non-shared memory offloading devices, you can't switch arbitrarily between offloading device(s) and host-fallback, for you have to do data management between the non-shared memories. Oh, I see. For HSA we simply don't emit an offload variant for code we cannot handle. But only for those parts. So it's only offload or fallback for other devices? Thus also never share work between both for example (run N threads on the CPU and M threads on the offload target)? Richard. So maybe we should produce one LTO offload object for each offload function and make the symbols they are supposed to provide weak so a fail doesn't end up failing to link the main program? Looks like this gets somewhat awkward with the LTO setup. Grüße, Thomas
Re: The nvptx port [1/11+] indirect jumps
On Wed, Oct 22, 2014 at 12:02:16PM +0200, Richard Biener wrote: I'm not sure that's what you're suggesting, but at least on non-shared memory offloading devices, you can't switch arbitrarily between offloading device(s) and host-fallback, for you have to do data management between the non-shared memories. Oh, I see. For HSA we simply don't emit an offload variant for code we cannot handle. But only for those parts. So it's only offload or fallback for other devices? Thus also never Yeah. share work between both for example (run N threads on the CPU and M threads on the offload target)? I believe at least for the non-shared memory the OpenMP model wouldn't allow that. Of course, user can do the sharing explicitly (though OpenMP 4.0 doesn't have asynchronous target regions): one could e.g. run a couple of host tasks on the offloading region with if (0) - forced host fallback, ensure e.g. one team and one parallel thread in that case, and then in one host task with if (1) and use as many teams and parallel threads as available on the offloading device. Jakub
Re: The nvptx port [10/11+] Target files
On 10/20/14 08:33, Bernd Schmidt wrote: These are the main target files for the ptx port. t-nvptx is empty for now but will grow some content with follow up patches. Bernd 010-target.diff * configure.ac: Allow configuring lto for nvptx. * configure: Regenerate. gcc/ * config/nvptx/nvptx.c: New file. * config/nvptx/nvptx.h: New file. * config/nvptx/nvptx-protos.h: New file. * config/nvptx/nvptx.md: New file. * config/nvptx/t-nvptx: New file. * config/nvptx/nvptx.opt: New file. * common/config/nvptx/nvptx-common.c: New file. * config.gcc: Handle nvptx-*-*. libgcc/ * config.host: Handle nvptx-*-*. * config/nvptx/t-nvptx: New file. * config/nvptx/crt0.s: New file. Please make sure all the functions in nvptx.c have function comments. nvptx_split_reg_p, write_as_kernel, nvptx_write_function_decl, write_function_decl_only, nvptx_function_incoming_arg, nvptx_promote_function_mode, nvptx_maybe_convert_symbolic_operand, etc. There are many others.. A scan over that entire file would be appreciated. + +/* TARGET_FUNCTION_VALUE implementation. Returns an RTX representing the place + where function FUNC returns or receives a value of data type TYPE. */ + +static rtx +nvptx_function_value (const_tree type, const_tree func ATTRIBUTE_UNUSED, + bool outgoing) +{ + int unsignedp = TYPE_UNSIGNED (type); + enum machine_mode orig_mode = TYPE_MODE (type); + enum machine_mode mode = promote_function_mode (type, orig_mode, + unsignedp, NULL_TREE, 1); + if (outgoing) +return gen_rtx_REG (mode, 4); + if (cfun-machine-start_call == NULL_RTX) +/* Pretend to return in a hard reg for early uses before pseudos can be + generated. */ +return gen_rtx_REG (mode, 4); + return gen_reg_rtx (mode); Rather than magic register numbers, can you use something symbolic? +} + +/* Implement TARGET_LIBCALL_VALUE. */ + +static rtx +nvptx_libcall_value (enum machine_mode mode, const_rtx) +{ + if (cfun-machine-start_call == NULL_RTX) +/* Pretend to return in a hard reg for early uses before pseudos can be + generated. */ +return gen_rtx_REG (mode, 4); + return gen_reg_rtx (mode); +} Similarly. + +/* Implement TARGET_FUNCTION_VALUE_REGNO_P. */ + +static bool +nvptx_function_value_regno_p (const unsigned int regno) +{ + return regno == 4; +} Here too. + +bool +nvptx_hard_regno_mode_ok (int regno, enum machine_mode mode) +{ + if (regno != 4 || cfun == NULL || cfun-machine-ret_reg_mode == VOIDmode) +return true; + return mode == cfun-machine-ret_reg_mode; +} Function comment. Magic register #. + +const char * +nvptx_output_call_insn (rtx insn, rtx result, rtx callee) If possible, promote first argument to rtx_insn *. + +/* Clean up subreg operands. */ Which means what? A little more descriptive here would be helpful. I have a guess what you need to do here, but more commentary would be helpful for someone that hasn't read through the virtual PTX ISA. The machine description is about what I would expect, in fact, it shows how nice a virtual ISA can be. Overall it seems pretty reasonable. Most of the difficulty appears to be interfacing with the 3rd party tools, but that's largely expected. I'm surprised there's not more hair around the address space issues. I expected more problems there. I'm going to trust that all the ABI related stuff is correct. I'm not going to second guess any of that stuff. I think we've got a couple things to iterate on from yesterday and you've got some minor stuff to address as noted above, but this looks pretty close to being ready. jeff
Re: The nvptx port [7/11+] Inform the port about call arguments
On 10/21/14 16:06, Bernd Schmidt wrote: On 10/21/2014 11:53 PM, Jeff Law wrote: So, in the end I'm torn. I don't like adding new hooks when they're not needed, but I have some reservations about relying on the order of stuff in CALL_INSN_FUNCTION_USAGE and I worry a bit that you might end up with stuff other than arguments on that list -- the PA port could filter on the hard registers used for passing arguments, so other stuff appearing isn't a big deal. This is another worry. Also, at the moment we don't actually add the pseudos to CALL_INSN_FUNCTION_USAGE (that's patch 6/11), we use the regs saved by the call_args hook to make proper USEs in a PARALLEL. I'm not convinced the rest of the compiler would be too happy to see pseudos there. So, in all I'd say it's probably possible to do it that way, but it feels a lot iffier than I'd be happy with. I for one didn't know about the PA requirement, so I could easily have broken it unknowingly if I'd made some random change modifying call expansion. Yea, let's keep your approach. Just wanted to explore a bit since the PA seems to have a variety of similar characteristics. jeff
Re: The nvptx port [8/11+] Write undefined decls.
On 10/21/14 16:15, Bernd Schmidt wrote: On 10/22/2014 12:05 AM, Jeff Law wrote: On 10/20/14 14:30, Bernd Schmidt wrote: ptx assembly requires that declarations are written for undefined variables. This adds that functionality. Does this need to happen at the use site, or can it be deferred? This is independent of use sites. The patch just adds another walk over the varpool to emit not just the defined vars. Ideally we'd maintain an order that declares or defines every variable before it is referenced by an initializer, but the attempt to do that in the compiler totally failed due to references between constant pools and regular variables. The nvptx-as tool we have fixes up the order of declarations after the first compilation stage. THe PA had to do something similar. We built up a vector of every external object in ASM_OUTPUT_EXTERNAL, but did not emit anything. Then in ASM_FILE_END, we walked that vector and anything that was actually referenced (as opposed to just just declared) we would emit the magic .IMPORT lines. Sounds like the PA could use this hook to simplify its code quite a bit. The PA stuff is a trivial amount of code :-) But it is a bit awkward in that we're using a per-variable hook to stash, then the end-file hook to walk the stashed stuff. IIRC, the problem is tentative definitions. Otherwise we'd just emit the .import statements as we saw the declarations. I believe that was to properly interface with the HP assembler/linker. We also have to defer emitting plabels, but I can't recall the braindamage behind that. I'm not going to insist you do this in the same way as the PA. That was a different era -- we had significant motivation to make things work in such a way that everything could be buried in the pa specific files. That sometimes led to less than optimal approaches to fix certain problems. Jeff
Re: The nvptx port [11/11] More tools.
On 10/20/14 08:48, Bernd Schmidt wrote: This is a bonus optional patch which adds ar, ranlib, as and ld to the ptx port. This is not proper binutils; ar and ranlib are just linked to the host versions, and the other two tools have the following functions: * nvptx-as is required to convert the compiler output to actual valid ptx assembly, primarily by reordering declarations and definitions. Believe me when I say that I've tried to make that work in the compiler itself and it's pretty much impossible without some really invasive changes. * nvptx-ld is just a pseudo linker that works by concatenating ptx input files and separating them with nul characters. Actual linking is something that happens later, when calling CUDA library functions, but existing build system make it useful to have something called ld which is able to bundle everything that's needed into a single file, and this seemed to be the simplest way of achieving this. There's a toplevel configure.ac change necessary to make ar/ranlib useable by the libgcc build. Having some tools built like this has some precedent in t-vmsnative, but as Thomas noted it does make feature tests in gcc's configure somewhat ugly (but everything works well enough to build the compiler). The alternative here is to bundle all these files into a separate nvptx-tools package which users would have to download - something that would be nice to avoid. These tools currently require GNU extensions - something I probably ought to fix if we decide to add them to the gcc build itself. Would these be more appropriate in binutils? Jeff
Re: The nvptx port [11/11] More tools.
On 10/22/2014 10:31 PM, Jeff Law wrote: These tools currently require GNU extensions - something I probably ought to fix if we decide to add them to the gcc build itself. Would these be more appropriate in binutils? I don't think so, given that we don't need any piece of regular binutils. There's no meaningful way to build libbfd. It would be strange to build binutils and have everything that's normally part of it disabled at configure time. Bernd
Re: The nvptx port [0/11+]
On Mon, Oct 20, 2014 at 4:17 PM, Bernd Schmidt ber...@codesourcery.com wrote: This is a patch kit that adds the nvptx port to gcc. It contains preliminary patches to add needed functionality, the target files, and one somewhat optional patch with additional target tools. There'll be more patch series, one for the testsuite, and one to make the offload functionality work with this port. Also required are the previous four rtl patches, two of which weren't entirely approved yet. For the moment, I've stripped out all the address space support that got bogged down in review by brokenness in our representation of address spaces. The ptx address spaces are of course still defined and used inside the backend. Ptx really isn't a usual target - it is a virtual target which is then translated by another compiler (ptxas) to the final code that runs on the GPU. There are many restrictions, some imposed by the GPU hardware, and some by the fact that not everything you'd want can be represented in ptx. Here are some of the highlights: * Everything is typed - variables, functions, registers. This can cause problems with KR style C or anything else that doesn't have a proper type internally. * Declarations are needed, even for undefined variables. * Can't emit initializers referring to their variable's address since you can't write forward declarations for variables. * Variables can be declared only as scalars or arrays, not structures. Initializers must be in the variable's declared type, which requires some code in the backend, and it means that packed pointer values are not representable. * Since it's a virtual target, we skip register allocation - no good can probably come from doing that twice. This means asm statements aren't fixed up and will fail if they use matching constraints. So with this restriction I wonder why it didn't make sense to go the HSA backend route emitting PTX from a GIMPLE SSA pass. This would have avoided the LTO dance as well ... That is, what is the advantage of expanding to RTL here - what main benefits do you get from that which you thought would be different to handle if doing code generation from GIMPLE SSA? For HSA we even do register allocation (to a fixed virtual register set), sth simple enough on SSA. We of course also have to do instruction selection but luckily virtual ISAs are easy to target. So were you worried about duplicating instruction selection and or doing it manually instead of with well-known machine descriptions? I'm just curious - I am not asking you to rewrite the beast ;) Thanks, Richard. * No support for indirect jumps, label values, nonlocal gotos. * No alloca - ptx defines it, but it's not implemented. * No trampolines. * No debugging (at all, for now - we may add line number directives). * Limited C library support - I have a hacked up copy of newlib that provides a reasonable subset. * malloc and free are defined by ptx (these appear to be undocumented), but there isn't a realloc. I have one patch for Fortran to use a malloc/memcpy helper function in cases where we know the old size. All in all, this is not intended to be used as a C (or any other source language) compiler. I've gone through a lot of effort to make it work reasonably well, but only in order to get sufficient test coverage from the testsuites. The intended use for this is only to build it as an offload compiler, and use it through OpenACC by way of lto1. That leaves the question of how we should document it - does it need the usual constraint and option documentation, given that user's aren't expected to use any of it? A slightly earlier version of the entire patch kit was bootstrapped and tested on x86_64-linux. Ok for trunk? Bernd
Re: The nvptx port [0/11+]
On Mon, Oct 20, 2014 at 04:17:56PM +0200, Bernd Schmidt wrote: * Can't emit initializers referring to their variable's address since you can't write forward declarations for variables. Can't that be handled by emitting the initializer without the address and some constructor that fixes up the initializer at runtime? * Variables can be declared only as scalars or arrays, not structures. Initializers must be in the variable's declared type, which requires some code in the backend, and it means that packed pointer values are not representable. Can't you represent structures and unions as arrays of chars? For constant initializers that don't need relocations the compiler can surely turn them into arrays of char initializers (e.g. fold-const.c native_encode_expr/native_interpret_expr could be used for that). Supposedly it would mean slower than perhaps necessary loads/stores of aligned larger fields from the structure, but if it is an alternative to not supporting structures/unions at all, that sounds like so severe limitation that it can be pretty fatal for the target. * No support for indirect jumps, label values, nonlocal gotos. Not even indirect calls? How do you implement C++ or Fortran vtables? Jakub
Re: The nvptx port [0/11+]
On 10/21/2014 10:18 AM, Richard Biener wrote: So with this restriction I wonder why it didn't make sense to go the HSA backend route emitting PTX from a GIMPLE SSA pass. This would have avoided the LTO dance as well ... Quite simple - there isn't an established way to do this. If I'd known you were doing something like this when I started the work I might have looked into that approach. Bernd
Re: The nvptx port [0/11+]
On 10/21/2014 10:42 AM, Jakub Jelinek wrote: On Mon, Oct 20, 2014 at 04:17:56PM +0200, Bernd Schmidt wrote: * Can't emit initializers referring to their variable's address since you can't write forward declarations for variables. Can't that be handled by emitting the initializer without the address and some constructor that fixes up the initializer at runtime? That reminds me that constructors are something I forgot to add to the list. I'm thinking about making these work with some trickery in the linker, but at the moment they are unsupported. Can't you represent structures and unions as arrays of chars? For constant initializers that don't need relocations the compiler can surely turn them into arrays of char initializers (e.g. fold-const.c native_encode_expr/native_interpret_expr could be used for that). Supposedly it would mean slower than perhaps necessary loads/stores of aligned larger fields from the structure, but if it is an alternative to not supporting structures/unions at all, that sounds like so severe limitation that it can be pretty fatal for the target. Oh, structs and unions are supported, and essentially that's what I'm doing - I choose a base integer type to represent them. That happens to be the size of a pointer, so properly aligned symbol refs can be emitted. It's just the packed ones that can't be done. * No support for indirect jumps, label values, nonlocal gotos. Not even indirect calls? How do you implement C++ or Fortran vtables? Indirect calls do exist. Bernd
Re: The nvptx port [0/11+]
On Tue, Oct 21, 2014 at 12:53 PM, Bernd Schmidt ber...@codesourcery.com wrote: On 10/21/2014 10:18 AM, Richard Biener wrote: So with this restriction I wonder why it didn't make sense to go the HSA backend route emitting PTX from a GIMPLE SSA pass. This would have avoided the LTO dance as well ... Quite simple - there isn't an established way to do this. If I'd known you were doing something like this when I started the work I might have looked into that approach. Ah, I see. I think having both ways now is good so we can compare pros and cons in practice (and make further targets follow the better approach if there is one). Richard. Bernd
Re: The nvptx port [1/11+] indirect jumps
On 10/20/14 14:19, Bernd Schmidt wrote: ptx doesn't have indirect jumps, so CODE_FOR_indirect_jump may not be defined. Add a sorry. Bernd 001-indjumps.diff gcc/ * optabs.c (emit_indirect_jump): Test HAVE_indirect_jump and emit a sorry if necessary. So doesn't this imply no hot-cold partitioning since we use indirect jumps to get across the partition? Similarly doesn't this imply other missing features (setjmp/longjmp, nonlocal gotos, computed jumps? Do you need some mechanism to ensure that hot/cold partitioning isn't enabled? Do you need some kind of message specific to the other features, or are we going to assume that the user will map from the indirect jump message back to the use of setjmp/longjmp or something similar? How are switches implemented (if at all)? Jeff
Re: The nvptx port [2/11+] No register allocation
On 10/20/14 14:20, Bernd Schmidt wrote: Since it's a virtual target, I've chosen not to run register allocation. This is one of the patches necessary to make that work, it primarily adds a target hook to disable it and fixes some of the fallout. Bernd 002-noregalloc.diff gcc/ * target.def (no_register_allocation): New data hook. * doc/tm.texi.in: Add @hook TARGET_NO_REGISTER_ALLOCATION. * doc/tm.texi: Regenerate. * ira.c (gate_ira): New function. (pass_data_ira): Set has_gate. (pass_ira): Add a gate function. (pass_data_reload): Likewise. (pass_reload): Add a gate function. (pass_ira): Use it. * reload1.c (eliminate_regs): If reg_eliminte_is NULL, assert that no register allocation happens on the target and return. * final.c (alter_subreg): Ensure register is not a pseudo before calling simplify_subreg. (output_operand): Assert that x isn't a pseudo only if doing register allocation.\ s/reg_eliminte/reg_eliminate/ Otherwise this looks fine. Note potential for rethinking this change at some point in the future as we get more experience with these kinds of targets. Jeff
Re: The nvptx port [3/11+] Struct returns
On 10/20/14 14:22, Bernd Schmidt wrote: Even when returning a structure by passing an invisible reference, gcc still likes to set the return register to the address of the struct. This is undesirable on ptx where things like the return register have to be declared, and the function really returns void at ptx level. I've added a target hook to avoid this. I figure other targets might find it beneficial to omit this unnecessary set as well. Bernd 003-sretreg.diff gcc/ * target.def (omit_struct_return_reg): New data hook. * doc/tm.texi.in: Add @hook TARGET_OMIT_STRUCT_RETURN_REG. * doc/tm.texi: Regenerate. * function.c (expand_function_end): Use it. My first thought when reading this surprise that we actually return a value here and a desire to just zap that code completely since there's virtually no chance the optimizer will be able to delete it. But then I remembered how much I hate dealing with this kind of ABI issue. I suspect nobody actually specifies behavior here other than to indicate when pass by invisible reference is used and what register holds that incoming value. So, OK for the trunk. jeff
Re: The nvptx port [4/11+] Post-RA pipeline
On 10/20/14 14:24, Bernd Schmidt wrote: This stops most of the post-regalloc passes to be run if the target doesn't want register allocation. I'd previously moved them all out of postreload to the toplevel, but Jakub (I think) pointed out that the idea is not to run them to avoid crashes if reload fails e.g. for an invalid asm. So I've made a new container pass. A later patch will make thread_prologue_and_epilogue_insns callable from the backend. Bernd 004-postra.diff gcc/ * passes.def (pass_compute_alignments, pass_duplicate_computed_gotos, pass_variable_tracking, pass_free_cfg, pass_machine_reorg, pass_cleanup_barriers, pass_delay_slots, pass_split_for_shorten_branches, pass_convert_to_eh_region_ranges, pass_shorten_branches, pass_est_nothrow_function_flags, pass_dwarf2_frame, pass_final): Move outside of pass_postreload and into pass_late_compilation. (pass_late_compilation): Add. * passes.c (pass_data_late_compilation, pass_late_compilation, make_pass_late_compilation): New. * timevar.def (TV_LATE_COMPILATION): New. OK. jeff
Re: The nvptx port [5/11+] Variable declarations
On 10/20/14 14:25, Bernd Schmidt wrote: ptx assembly follows rather different rules than what's typical elsewhere. We need a new hook to add a }; string when we are finished outputting a variable with an initializer. Bernd 005-declend.diff gcc/ * target.def (decl_end): New hook. * varasm.c (assemble_variable_contents, assemble_constant_contents): Use it. * doc/tm.texi.in (TARGET_ASM_DECL_END): Add. * doc/tm.texi: Regenerate. Ok. jeff
Re: The nvptx port [6/11+] Pseudo call args
On 10/20/14 14:26, Bernd Schmidt wrote: On ptx, we'll be using pseudos to pass function args as well, and there's one assert that needs to be toned town to make that work. Bernd 006-usereg.diff gcc/ * expr.c (use_reg_mode): Just return for pseudo registers. OK. I pondered asking for this to be conditional on the no-register-allocation conditional, but then decided it wasn't worth the effort. jeff
Re: The nvptx port [1/11+] indirect jumps
On 10/21/2014 08:26 PM, Jeff Law wrote: * optabs.c (emit_indirect_jump): Test HAVE_indirect_jump and emit a sorry if necessary. So doesn't this imply no hot-cold partitioning since we use indirect jumps to get across the partition? Similarly doesn't this imply other missing features (setjmp/longjmp, nonlocal gotos, computed jumps? Pretty much yes to all. Do you need some mechanism to ensure that hot/cold partitioning isn't enabled? I guess I could clear flag_reorder_blocks_and_partition in nvptx_option_override. The problem hasn't come up so far. Do you need some kind of message specific to the other features, or are we going to assume that the user will map from the indirect jump message back to the use of setjmp/longjmp or something similar? I have some sorry calls in things like a dummy nonlocal_goto pattern. It doesn't quite manage to catch everything without an ICE yet though. How are switches implemented (if at all)? Comparison tree as you'd generate for small switches on all other targets. Bernd
Re: The nvptx port [7/11+] Inform the port about call arguments
On 10/20/14 14:29, Bernd Schmidt wrote: In ptx assembly we need to decorate call insns with the arguments that are being passed. We also need to know the exact function type. This is kind of hard to do with the existing infrastructure since things like function_arg are called at other times rather than just when emitting a call, so this patch adds two more hooks, one called just before argument registers are loaded (once for each arg), and the other just after the call is complete. Bernd 007-callargs.diff gcc/ * target.def (call_args, end_call_args): New hooks. * hooks.c (hook_void_rtx_tree): New empty function. * hooks.h (hook_void_rtx_tree): Declare. * doc/tm.texi.in (TARGET_CALL_ARGS, TARGET_END_CALL_ARGS): Add. * doc/tm.texi: Regenerate. * calls.c (expand_call): Slightly rearrange the code. Use the two new hooks. (expand_library_call_value_1): Use the two new hooks. How exactly do you need to decorate? Just mention the register, size information or do you need full type information? We've had targets where we had to indicate register banks for each argument. Those would walk CALL_INSN_FUNCTION_USAGE to find the argument registers, then from the register # we would know which register bank to use. Would that work for you? Jeff
Re: The nvptx port [1/11+] indirect jumps
On Tue, Oct 21, 2014 at 11:00:35PM +0200, Bernd Schmidt wrote: On 10/21/2014 08:26 PM, Jeff Law wrote: * optabs.c (emit_indirect_jump): Test HAVE_indirect_jump and emit a sorry if necessary. So doesn't this imply no hot-cold partitioning since we use indirect jumps to get across the partition? Similarly doesn't this imply other missing features (setjmp/longjmp, nonlocal gotos, computed jumps? Pretty much yes to all. Do you need some mechanism to ensure that hot/cold partitioning isn't enabled? I guess I could clear flag_reorder_blocks_and_partition in nvptx_option_override. The problem hasn't come up so far. Do you need some kind of message specific to the other features, or are we going to assume that the user will map from the indirect jump message back to the use of setjmp/longjmp or something similar? I have some sorry calls in things like a dummy nonlocal_goto pattern. It doesn't quite manage to catch everything without an ICE yet though. With all the sorry additions, what is actually the plan for OpenMP (dunno how OpenACC is different in this regard)? At least for OpenMP, the best would be if the #pragma omp target regions and/or #pragma omp declare target functions contain anything a particular offloading accelerator can't handle, instead of failing the whole compilation perhaps just emit some at least by default non-fatal warning and not emit anything for the particular offloading target, which would mean either host fallback, or, if some other offloading target succeeded, just that target. The unsupported stuff can be machine dependent builtins that can't be transformed, or e.g. the various things you've listed as unsupportable by the PTX backend right now. Jakub
Re: The nvptx port [7/11+] Inform the port about call arguments
On 10/21/2014 11:11 PM, Jeff Law wrote: On 10/20/14 14:29, Bernd Schmidt wrote: In ptx assembly we need to decorate call insns with the arguments that are being passed. We also need to know the exact function type. This is kind of hard to do with the existing infrastructure since things like function_arg are called at other times rather than just when emitting a call, so this patch adds two more hooks, one called just before argument registers are loaded (once for each arg), and the other just after the call is complete. How exactly do you need to decorate? Just mention the register, size information or do you need full type information? A normal call looks like { .param.u32 %retval_in; .param.u64 %out_arg0; st.param.u64 [%out_arg0], %r1400; call (%retval_in), PopCnt, (%out_arg0); ld.param.u32%r1403, [%retval_in]; } which declares local variables for the args and return value, stores the pseudos in the outgoing args, calls the function with explicitly named args and return values, and loads the incoming return value. All this is produced by nvptx_output_call_insn for a single CALL rtx insn. Indirect calls additionally need to produce a .callprototype pseudo-op which looks like a function declaration; for normal calls the called function must already be declared elsewhere. The machinery to produce such .callprototypes is also used to produce a ptx decl from the call insn for an external KR declaration with no argument types. We've had targets where we had to indicate register banks for each argument. Those would walk CALL_INSN_FUNCTION_USAGE to find the argument registers, then from the register # we would know which register bank to use. Would that work for you? Couple of problems with this - the fusage isn't available to gen_call, it gets added to the call insn after it is emitted, but the backend would like to have this information when emitting the insn. Also, I'd need the order to be reliable and I don't think CALL_INSN_FUNCTION_USAGE is really designed to guarantee that (I suspect the order of register args and things like the struct return reg is wrong). I also need the exact function type and the call_args hook seems like the easiest way to communicate it to the backend. Bernd
Re: The nvptx port [1/11+] indirect jumps
On 10/21/2014 11:30 PM, Jakub Jelinek wrote: At least for OpenMP, the best would be if the #pragma omp target regions and/or #pragma omp declare target functions contain anything a particular offloading accelerator can't handle, instead of failing the whole compilation perhaps just emit some at least by default non-fatal warning and not emit anything for the particular offloading target, which would mean either host fallback, or, if some other offloading target succeeded, just that target. I guess a test could be added to mkoffload if gcc were to return a different value for a sorry vs. any other compilation failure. The tool could then choose not to produce offloading support for that target. Bernd
Re: The nvptx port [7/11+] Inform the port about call arguments
On 10/21/14 21:29, Bernd Schmidt wrote: A normal call looks like { .param.u32 %retval_in; .param.u64 %out_arg0; st.param.u64 [%out_arg0], %r1400; call (%retval_in), PopCnt, (%out_arg0); ld.param.u32%r1403, [%retval_in]; } which declares local variables for the args and return value, stores the pseudos in the outgoing args, calls the function with explicitly named args and return values, and loads the incoming return value. All this is produced by nvptx_output_call_insn for a single CALL rtx insn. So far, so good. Indirect calls additionally need to produce a .callprototype pseudo-op which looks like a function declaration; for normal calls the called function must already be declared elsewhere. The machinery to produce such .callprototypes is also used to produce a ptx decl from the call insn for an external KR declaration with no argument types. Yea, no surprise here. Couple of problems with this - the fusage isn't available to gen_call, it gets added to the call insn after it is emitted, but the backend would like to have this information when emitting the insn. Right. Targets which have needed this emit the decorations at insn-output time so the fusage has been attached. Also, I'd need the order to be reliable and I don't think CALL_INSN_FUNCTION_USAGE is really designed to guarantee that (I suspect the order of register args and things like the struct return reg is wrong). I also need the exact function type and the call_args hook seems like the easiest way to communicate it to the backend. We've depended on the ordering in the PA, well, forever. However, I doubt ordering of regs in the fusage is documented at all! We could change that. So, in the end I'm torn. I don't like adding new hooks when they're not needed, but I have some reservations about relying on the order of stuff in CALL_INSN_FUNCTION_USAGE and I worry a bit that you might end up with stuff other than arguments on that list -- the PA port could filter on the hard registers used for passing arguments, so other stuff appearing isn't a big deal. Let me sleep on this one :-) Jeff
Re: The nvptx port [8/11+] Write undefined decls.
On 10/20/14 14:30, Bernd Schmidt wrote: ptx assembly requires that declarations are written for undefined variables. This adds that functionality. Bernd 008-undefdecl.diff gcc/ * target.def (assemble_undefined_decl): New hooks. * hooks.c (hook_void_FILEptr_constcharptr_const_tree): New function. * hooks.h (hook_void_FILEptr_constcharptr_const_tree): Declare. * doc/tm.texi.in (TARGET_ASM_ASSEMBLE_UNDEFINED_DECL): Add. * doc/tm.texi: Regenerate. * output.h (assemble_undefined_decl): Declare. (get_fnname_from_decl): Declare. * varasm.c (assemble_undefined_decl): New function. (get_fnname_from_decl): New function. * final.c (rest_of_handle_final): Use it. * varpool.c (varpool_output_variables): Call assemble_undefined_decl for nodes without a definition. Does this need to happen at the use site, or can it be deferred? THe PA had to do something similar. We built up a vector of every external object in ASM_OUTPUT_EXTERNAL, but did not emit anything. Then in ASM_FILE_END, we walked that vector and anything that was actually referenced (as opposed to just just declared) we would emit the magic .IMPORT lines. Jeff
Re: The nvptx port [9/11+] Epilogues
On 10/20/14 14:32, Bernd Schmidt wrote: We skip the late compilation passes on ptx, but there's one piece we do need - fixing up the function so that we get return insns in the right places. This patch just makes thread_prologue_and_epilogue_insns callable from the reorg pass. Bernd 009-proep.diff gcc/ * function.c (thread_prologue_and_epilogue_insns): No longer static. * function.h (thread_prologue_and_epilogue_insns): Declare. OK. Jeff
Re: The nvptx port [7/11+] Inform the port about call arguments
On 10/21/2014 11:53 PM, Jeff Law wrote: So, in the end I'm torn. I don't like adding new hooks when they're not needed, but I have some reservations about relying on the order of stuff in CALL_INSN_FUNCTION_USAGE and I worry a bit that you might end up with stuff other than arguments on that list -- the PA port could filter on the hard registers used for passing arguments, so other stuff appearing isn't a big deal. This is another worry. Also, at the moment we don't actually add the pseudos to CALL_INSN_FUNCTION_USAGE (that's patch 6/11), we use the regs saved by the call_args hook to make proper USEs in a PARALLEL. I'm not convinced the rest of the compiler would be too happy to see pseudos there. So, in all I'd say it's probably possible to do it that way, but it feels a lot iffier than I'd be happy with. I for one didn't know about the PA requirement, so I could easily have broken it unknowingly if I'd made some random change modifying call expansion. Bernd
Re: The nvptx port [8/11+] Write undefined decls.
On 10/22/2014 12:05 AM, Jeff Law wrote: On 10/20/14 14:30, Bernd Schmidt wrote: ptx assembly requires that declarations are written for undefined variables. This adds that functionality. Does this need to happen at the use site, or can it be deferred? This is independent of use sites. The patch just adds another walk over the varpool to emit not just the defined vars. Ideally we'd maintain an order that declares or defines every variable before it is referenced by an initializer, but the attempt to do that in the compiler totally failed due to references between constant pools and regular variables. The nvptx-as tool we have fixes up the order of declarations after the first compilation stage. THe PA had to do something similar. We built up a vector of every external object in ASM_OUTPUT_EXTERNAL, but did not emit anything. Then in ASM_FILE_END, we walked that vector and anything that was actually referenced (as opposed to just just declared) we would emit the magic .IMPORT lines. Sounds like the PA could use this hook to simplify its code quite a bit. Looking at the patch again I noticed there's still some unrelated code in here - the patch used to be quite a lot larger and got shrunk due to the failure mentioned above. get_fnname_for_decl is just a new function broken out of rest_of_handle_final, it is used by the nvptx.c code. Bernd
The nvptx port [0/11+]
This is a patch kit that adds the nvptx port to gcc. It contains preliminary patches to add needed functionality, the target files, and one somewhat optional patch with additional target tools. There'll be more patch series, one for the testsuite, and one to make the offload functionality work with this port. Also required are the previous four rtl patches, two of which weren't entirely approved yet. For the moment, I've stripped out all the address space support that got bogged down in review by brokenness in our representation of address spaces. The ptx address spaces are of course still defined and used inside the backend. Ptx really isn't a usual target - it is a virtual target which is then translated by another compiler (ptxas) to the final code that runs on the GPU. There are many restrictions, some imposed by the GPU hardware, and some by the fact that not everything you'd want can be represented in ptx. Here are some of the highlights: * Everything is typed - variables, functions, registers. This can cause problems with KR style C or anything else that doesn't have a proper type internally. * Declarations are needed, even for undefined variables. * Can't emit initializers referring to their variable's address since you can't write forward declarations for variables. * Variables can be declared only as scalars or arrays, not structures. Initializers must be in the variable's declared type, which requires some code in the backend, and it means that packed pointer values are not representable. * Since it's a virtual target, we skip register allocation - no good can probably come from doing that twice. This means asm statements aren't fixed up and will fail if they use matching constraints. * No support for indirect jumps, label values, nonlocal gotos. * No alloca - ptx defines it, but it's not implemented. * No trampolines. * No debugging (at all, for now - we may add line number directives). * Limited C library support - I have a hacked up copy of newlib that provides a reasonable subset. * malloc and free are defined by ptx (these appear to be undocumented), but there isn't a realloc. I have one patch for Fortran to use a malloc/memcpy helper function in cases where we know the old size. All in all, this is not intended to be used as a C (or any other source language) compiler. I've gone through a lot of effort to make it work reasonably well, but only in order to get sufficient test coverage from the testsuites. The intended use for this is only to build it as an offload compiler, and use it through OpenACC by way of lto1. That leaves the question of how we should document it - does it need the usual constraint and option documentation, given that user's aren't expected to use any of it? A slightly earlier version of the entire patch kit was bootstrapped and tested on x86_64-linux. Ok for trunk? Bernd
The nvptx port [2/11+] No register allocation
Since it's a virtual target, I've chosen not to run register allocation. This is one of the patches necessary to make that work, it primarily adds a target hook to disable it and fixes some of the fallout. Bernd
The nvptx port [1/11+] indirect jumps
ptx doesn't have indirect jumps, so CODE_FOR_indirect_jump may not be defined. Add a sorry. Bernd gcc/ * optabs.c (emit_indirect_jump): Test HAVE_indirect_jump and emit a sorry if necessary. Index: gcc/optabs.c === --- gcc/optabs.c (revision 422345) +++ gcc/optabs.c (revision 422346) @@ -4477,13 +4477,16 @@ prepare_float_lib_cmp (rtx x, rtx y, enu /* Generate code to indirectly jump to a location given in the rtx LOC. */ void -emit_indirect_jump (rtx loc) +emit_indirect_jump (rtx loc ATTRIBUTE_UNUSED) { +#ifndef HAVE_indirect_jump + sorry (indirect jumps are not available on this target); +#else struct expand_operand ops[1]; - create_address_operand (ops[0], loc); expand_jump_insn (CODE_FOR_indirect_jump, 1, ops); emit_barrier (); +#endif } #ifdef HAVE_conditional_move
Re: The nvptx port [3/11+] Struct returns
Even when returning a structure by passing an invisible reference, gcc still likes to set the return register to the address of the struct. This is undesirable on ptx where things like the return register have to be declared, and the function really returns void at ptx level. I've added a target hook to avoid this. I figure other targets might find it beneficial to omit this unnecessary set as well. Bernd gcc/ * target.def (omit_struct_return_reg): New data hook. * doc/tm.texi.in: Add @hook TARGET_OMIT_STRUCT_RETURN_REG. * doc/tm.texi: Regenerate. * function.c (expand_function_end): Use it. Index: gcc/doc/tm.texi === --- gcc/doc/tm.texi (revision 422355) +++ gcc/doc/tm.texi (revision 422356) @@ -4560,6 +4560,14 @@ need more space than is implied by @code saving and restoring an arbitrary return value. @end defmac +@deftypevr {Target Hook} bool TARGET_OMIT_STRUCT_RETURN_REG +Normally, when a function returns a structure by memory, the address +is passed as an invisible pointer argument, but the compiler also +arranges to return the address from the function like it would a normal +pointer return value. Define this to true if that behaviour is +undesirable on your target. +@end deftypevr + @deftypefn {Target Hook} bool TARGET_RETURN_IN_MSB (const_tree @var{type}) This hook should return true if values of type @var{type} are returned at the most significant end of a register (in other words, if they are Index: gcc/doc/tm.texi.in === --- gcc/doc/tm.texi.in (revision 422355) +++ gcc/doc/tm.texi.in (revision 422356) @@ -3769,6 +3769,8 @@ need more space than is implied by @code saving and restoring an arbitrary return value. @end defmac +@hook TARGET_OMIT_STRUCT_RETURN_REG + @hook TARGET_RETURN_IN_MSB @node Aggregate Return Index: gcc/target.def === --- gcc/target.def (revision 422355) +++ gcc/target.def (revision 422356) @@ -3601,6 +3601,16 @@ structure value address at the beginning to emit adjusting code, you should do it at this point., rtx, (tree fndecl, int incoming), hook_rtx_tree_int_null) + +DEFHOOKPOD +(omit_struct_return_reg, + Normally, when a function returns a structure by memory, the address\n\ +is passed as an invisible pointer argument, but the compiler also\n\ +arranges to return the address from the function like it would a normal\n\ +pointer return value. Define this to true if that behaviour is\n\ +undesirable on your target., + bool, false) + DEFHOOK (return_in_memory, This target hook should return a nonzero value to say to return the\n\ Index: gcc/function.c === --- gcc/function.c (revision 422355) +++ gcc/function.c (revision 422356) @@ -5179,8 +5179,8 @@ expand_function_end (void) If returning a structure PCC style, the caller also depends on this value. And cfun-returns_pcc_struct is not necessarily set. */ - if (cfun-returns_struct - || cfun-returns_pcc_struct) + if ((cfun-returns_struct || cfun-returns_pcc_struct) + !targetm.calls.omit_struct_return_reg) { rtx value_address = DECL_RTL (DECL_RESULT (current_function_decl)); tree type = TREE_TYPE (DECL_RESULT (current_function_decl));
The nvptx port [4/11+] Post-RA pipeline
This stops most of the post-regalloc passes to be run if the target doesn't want register allocation. I'd previously moved them all out of postreload to the toplevel, but Jakub (I think) pointed out that the idea is not to run them to avoid crashes if reload fails e.g. for an invalid asm. So I've made a new container pass. A later patch will make thread_prologue_and_epilogue_insns callable from the backend. Bernd gcc/ * passes.def (pass_compute_alignments, pass_duplicate_computed_gotos, pass_variable_tracking, pass_free_cfg, pass_machine_reorg, pass_cleanup_barriers, pass_delay_slots, pass_split_for_shorten_branches, pass_convert_to_eh_region_ranges, pass_shorten_branches, pass_est_nothrow_function_flags, pass_dwarf2_frame, pass_final): Move outside of pass_postreload and into pass_late_compilation. (pass_late_compilation): Add. * passes.c (pass_data_late_compilation, pass_late_compilation, make_pass_late_compilation): New. * timevar.def (TV_LATE_COMPILATION): New. Index: gcc/passes.def === --- gcc/passes.def.orig +++ gcc/passes.def @@ -415,6 +415,9 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_split_before_regstack); NEXT_PASS (pass_stack_regs_run); POP_INSERT_PASSES () + POP_INSERT_PASSES () + NEXT_PASS (pass_late_compilation); + PUSH_INSERT_PASSES_WITHIN (pass_late_compilation) NEXT_PASS (pass_compute_alignments); NEXT_PASS (pass_variable_tracking); NEXT_PASS (pass_free_cfg); Index: gcc/passes.c === --- gcc/passes.c.orig +++ gcc/passes.c @@ -569,6 +569,44 @@ make_pass_postreload (gcc::context *ctxt return new pass_postreload (ctxt); } +namespace { + +const pass_data pass_data_late_compilation = +{ + RTL_PASS, /* type */ + *all-late_compilation, /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_LATE_COMPILATION, /* tv_id */ + PROP_rtl, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + 0, /* todo_flags_finish */ +}; + +class pass_late_compilation : public rtl_opt_pass +{ +public: + pass_late_compilation (gcc::context *ctxt) +: rtl_opt_pass (pass_data_late_compilation, ctxt) + {} + + /* opt_pass methods: */ + virtual bool gate (function *) + { +return reload_completed || targetm.no_register_allocation; + } + +}; // class pass_late_compilation + +} // anon namespace + +static rtl_opt_pass * +make_pass_late_compilation (gcc::context *ctxt) +{ + return new pass_late_compilation (ctxt); +} + /* Set the static pass number of pass PASS to ID and record that Index: gcc/timevar.def === --- gcc/timevar.def.orig +++ gcc/timevar.def @@ -270,6 +270,7 @@ DEFTIMEVAR (TV_EARLY_LOCAL , early DEFTIMEVAR (TV_OPTIMIZE , unaccounted optimizations) DEFTIMEVAR (TV_REST_OF_COMPILATION , rest of compilation) DEFTIMEVAR (TV_POSTRELOAD , unaccounted post reload) +DEFTIMEVAR (TV_LATE_COMPILATION , unaccounted late compilation) DEFTIMEVAR (TV_REMOVE_UNUSED , remove unused locals) DEFTIMEVAR (TV_ADDRESS_TAKEN , address taken) DEFTIMEVAR (TV_TODO , unaccounted todo)
The nvptx port [5/11+] Variable declarations
ptx assembly follows rather different rules than what's typical elsewhere. We need a new hook to add a }; string when we are finished outputting a variable with an initializer. Bernd gcc/ * target.def (decl_end): New hook. * varasm.c (assemble_variable_contents, assemble_constant_contents): Use it. * doc/tm.texi.in (TARGET_ASM_DECL_END): Add. * doc/tm.texi: Regenerate. Index: gcc/doc/tm.texi === --- gcc/doc/tm.texi.orig +++ gcc/doc/tm.texi @@ -7575,6 +7575,11 @@ The default implementation of this hook when the relevant string is @code{NULL}. @end deftypefn +@deftypefn {Target Hook} void TARGET_ASM_DECL_END (void) +Define this hook if the target assembler requires a special marker to +terminate an initialized variable declaration. +@end deftypefn + @deftypefn {Target Hook} bool TARGET_ASM_OUTPUT_ADDR_CONST_EXTRA (FILE *@var{file}, rtx @var{x}) A target hook to recognize @var{rtx} patterns that @code{output_addr_const} can't deal with, and output assembly code to @var{file} corresponding to Index: gcc/doc/tm.texi.in === --- gcc/doc/tm.texi.in.orig +++ gcc/doc/tm.texi.in @@ -5412,6 +5412,8 @@ It must not be modified by command-line @hook TARGET_ASM_INTEGER +@hook TARGET_ASM_DECL_END + @hook TARGET_ASM_OUTPUT_ADDR_CONST_EXTRA @defmac ASM_OUTPUT_ASCII (@var{stream}, @var{ptr}, @var{len}) Index: gcc/target.def === --- gcc/target.def.orig +++ gcc/target.def @@ -127,6 +127,15 @@ when the relevant string is @code{NULL}. bool, (rtx x, unsigned int size, int aligned_p), default_assemble_integer) +/* Notify the backend that we have completed emitting the data for a + decl. */ +DEFHOOK +(decl_end, + Define this hook if the target assembler requires a special marker to\n\ +terminate an initialized variable declaration., + void, (void), + hook_void_void) + /* Output code that will globalize a label. */ DEFHOOK (globalize_label, Index: gcc/varasm.c === --- gcc/varasm.c.orig +++ gcc/varasm.c @@ -1945,6 +1945,7 @@ assemble_variable_contents (tree decl, c else /* Leave space for it. */ assemble_zeros (tree_to_uhwi (DECL_SIZE_UNIT (decl))); + targetm.asm_out.decl_end (); } } @@ -3349,6 +3350,8 @@ assemble_constant_contents (tree exp, co /* Output the value of EXP. */ output_constant (exp, size, align); + + targetm.asm_out.decl_end (); } /* We must output the constant data referred to by SYMBOL; do so. */
The nvptx port [6/11+] Pseudo call args
On ptx, we'll be using pseudos to pass function args as well, and there's one assert that needs to be toned town to make that work. Bernd gcc/ * expr.c (use_reg_mode): Just return for pseudo registers. Index: gcc/expr.c === --- gcc/expr.c (revision 422421) +++ gcc/expr.c (revision 422422) @@ -2321,7 +2321,10 @@ copy_blkmode_to_reg (enum machine_mode m void use_reg_mode (rtx *call_fusage, rtx reg, enum machine_mode mode) { - gcc_assert (REG_P (reg) REGNO (reg) FIRST_PSEUDO_REGISTER); + gcc_assert (REG_P (reg)); + + if (!HARD_REGISTER_P (reg)) +return; *call_fusage = gen_rtx_EXPR_LIST (mode, gen_rtx_USE (VOIDmode, reg), *call_fusage);
The nvptx port [7/11+] Inform the port about call arguments
In ptx assembly we need to decorate call insns with the arguments that are being passed. We also need to know the exact function type. This is kind of hard to do with the existing infrastructure since things like function_arg are called at other times rather than just when emitting a call, so this patch adds two more hooks, one called just before argument registers are loaded (once for each arg), and the other just after the call is complete. Bernd gcc/ * target.def (call_args, end_call_args): New hooks. * hooks.c (hook_void_rtx_tree): New empty function. * hooks.h (hook_void_rtx_tree): Declare. * doc/tm.texi.in (TARGET_CALL_ARGS, TARGET_END_CALL_ARGS): Add. * doc/tm.texi: Regenerate. * calls.c (expand_call): Slightly rearrange the code. Use the two new hooks. (expand_library_call_value_1): Use the two new hooks. Index: gcc/doc/tm.texi === --- gcc/doc/tm.texi.orig +++ gcc/doc/tm.texi @@ -5027,6 +5027,29 @@ except the last are treated as named. You need not define this hook if it always returns @code{false}. @end deftypefn +@deftypefn {Target Hook} void TARGET_CALL_ARGS (rtx, @var{tree}) +While generating RTL for a function call, this target hook is invoked once +for each argument passed to the function, either a register returned by +@code{TARGET_FUNCTION_ARG} or a memory location. It is called just +before the point where argument registers are stored. The type of the +function to be called is also passed as the second argument; it is +@code{NULL_TREE} for libcalls. The @code{TARGET_END_CALL_ARGS} hook is +invoked just after the code to copy the return reg has been emitted. +This functionality can be used to perform special setup of call argument +registers if a target needs it. +For functions without arguments, the hook is called once with @code{pc_rtx} +passed instead of an argument register. +Most ports do not need to implement anything for this hook. +@end deftypefn + +@deftypefn {Target Hook} void TARGET_END_CALL_ARGS (void) +This target hook is invoked while generating RTL for a function call, +just after the point where the return reg is copied into a pseudo. It +signals that all the call argument and return registers for the just +emitted call are now no longer in use. +Most ports do not need to implement anything for this hook. +@end deftypefn + @deftypefn {Target Hook} bool TARGET_PRETEND_OUTGOING_VARARGS_NAMED (cumulative_args_t @var{ca}) If you need to conditionally change ABIs so that one works with @code{TARGET_SETUP_INCOMING_VARARGS}, but the other works like neither Index: gcc/doc/tm.texi.in === --- gcc/doc/tm.texi.in.orig +++ gcc/doc/tm.texi.in @@ -3929,6 +3929,10 @@ These machine description macros help im @hook TARGET_STRICT_ARGUMENT_NAMING +@hook TARGET_CALL_ARGS + +@hook TARGET_END_CALL_ARGS + @hook TARGET_PRETEND_OUTGOING_VARARGS_NAMED @node Trampolines Index: gcc/hooks.c === --- gcc/hooks.c.orig +++ gcc/hooks.c @@ -245,6 +245,11 @@ hook_void_tree (tree a ATTRIBUTE_UNUSED) } void +hook_void_rtx_tree (rtx, tree) +{ +} + +void hook_void_constcharptr (const char *a ATTRIBUTE_UNUSED) { } Index: gcc/hooks.h === --- gcc/hooks.h.orig +++ gcc/hooks.h @@ -70,6 +70,7 @@ extern void hook_void_constcharptr (cons extern void hook_void_rtx_int (rtx, int); extern void hook_void_FILEptr_constcharptr (FILE *, const char *); extern bool hook_bool_FILEptr_rtx_false (FILE *, rtx); +extern void hook_void_rtx_tree (rtx, tree); extern void hook_void_tree (tree); extern void hook_void_tree_treeptr (tree, tree *); extern void hook_void_int_int (int, int); Index: gcc/target.def === --- gcc/target.def.orig +++ gcc/target.def @@ -3825,6 +3825,33 @@ not generate any instructions in this ca default_setup_incoming_varargs) DEFHOOK +(call_args, + While generating RTL for a function call, this target hook is invoked once\n\ +for each argument passed to the function, either a register returned by\n\ +@code{TARGET_FUNCTION_ARG} or a memory location. It is called just\n\ +before the point where argument registers are stored. The type of the\n\ +function to be called is also passed as the second argument; it is\n\ +@code{NULL_TREE} for libcalls. The @code{TARGET_END_CALL_ARGS} hook is\n\ +invoked just after the code to copy the return reg has been emitted.\n\ +This functionality can be used to perform special setup of call argument\n\ +registers if a target needs it.\n\ +For functions without arguments, the hook is called once with @code{pc_rtx}\n\ +passed instead of an argument register.\n\ +Most ports do not need to implement anything for this hook., + void, (rtx, tree),
The nvptx port [8/11+] Write undefined decls.
ptx assembly requires that declarations are written for undefined variables. This adds that functionality. Bernd gcc/ * target.def (assemble_undefined_decl): New hooks. * hooks.c (hook_void_FILEptr_constcharptr_const_tree): New function. * hooks.h (hook_void_FILEptr_constcharptr_const_tree): Declare. * doc/tm.texi.in (TARGET_ASM_ASSEMBLE_UNDEFINED_DECL): Add. * doc/tm.texi: Regenerate. * output.h (assemble_undefined_decl): Declare. (get_fnname_from_decl): Declare. * varasm.c (assemble_undefined_decl): New function. (get_fnname_from_decl): New function. * final.c (rest_of_handle_final): Use it. * varpool.c (varpool_output_variables): Call assemble_undefined_decl for nodes without a definition. Index: gcc/doc/tm.texi === --- gcc/doc/tm.texi.orig +++ gcc/doc/tm.texi @@ -7899,6 +7902,13 @@ global; that is, available for reference The default implementation uses the TARGET_ASM_GLOBALIZE_LABEL target hook. @end deftypefn +@deftypefn {Target Hook} void TARGET_ASM_ASSEMBLE_UNDEFINED_DECL (FILE *@var{stream}, const char *@var{name}, const_tree @var{decl}) +This target hook is a function to output to the stdio stream +@var{stream} some commands that will declare the name associated with +@var{decl} which is not defined in the current translation unit. Most +assemblers do not require anything to be output in this case. +@end deftypefn + @defmac ASM_WEAKEN_LABEL (@var{stream}, @var{name}) A C statement (sans semicolon) to output to the stdio stream @var{stream} some commands that will make the label @var{name} weak; Index: gcc/doc/tm.texi.in === --- gcc/doc/tm.texi.in.orig +++ gcc/doc/tm.texi.in @@ -5693,6 +5693,8 @@ You may wish to use @code{ASM_OUTPUT_SIZ @hook TARGET_ASM_GLOBALIZE_DECL_NAME +@hook TARGET_ASM_ASSEMBLE_UNDEFINED_DECL + @defmac ASM_WEAKEN_LABEL (@var{stream}, @var{name}) A C statement (sans semicolon) to output to the stdio stream @var{stream} some commands that will make the label @var{name} weak; Index: gcc/hooks.c === --- gcc/hooks.c.orig +++ gcc/hooks.c @@ -139,6 +139,13 @@ hook_void_FILEptr_constcharptr (FILE *a { } +/* Generic hook that takes (FILE *, const char *, constr_tree *) and does + nothing. */ +void +hook_void_FILEptr_constcharptr_const_tree (FILE *, const char *, const_tree) +{ +} + /* Generic hook that takes (FILE *, rtx) and returns false. */ bool hook_bool_FILEptr_rtx_false (FILE *a ATTRIBUTE_UNUSED, Index: gcc/hooks.h === --- gcc/hooks.h.orig +++ gcc/hooks.h @@ -69,6 +69,8 @@ extern void hook_void_void (void); extern void hook_void_constcharptr (const char *); extern void hook_void_rtx_int (rtx, int); extern void hook_void_FILEptr_constcharptr (FILE *, const char *); +extern void hook_void_FILEptr_constcharptr_const_tree (FILE *, const char *, + const_tree); extern bool hook_bool_FILEptr_rtx_false (FILE *, rtx); extern void hook_void_rtx (rtx); extern void hook_void_tree (tree); Index: gcc/target.def === --- gcc/target.def.orig +++ gcc/target.def @@ -158,6 +158,16 @@ global; that is, available for reference The default implementation uses the TARGET_ASM_GLOBALIZE_LABEL target hook., void, (FILE *stream, tree decl), default_globalize_decl_name) +/* Output code that will declare an external variable. */ +DEFHOOK +(assemble_undefined_decl, + This target hook is a function to output to the stdio stream\n\ +@var{stream} some commands that will declare the name associated with\n\ +@var{decl} which is not defined in the current translation unit. Most\n\ +assemblers do not require anything to be output in this case., + void, (FILE *stream, const char *name, const_tree decl), + hook_void_FILEptr_constcharptr_const_tree) + /* Output code that will emit a label for unwind info, if this target requires such labels. Second argument is the decl the unwind info is associated with, third is a boolean: true if Index: gcc/final.c === --- gcc/final.c.orig +++ gcc/final.c @@ -4434,17 +4434,7 @@ leaf_renumber_regs_insn (rtx in_rtx) static unsigned int rest_of_handle_final (void) { - rtx x; - const char *fnname; - - /* Get the function's name, as described by its RTL. This may be - different from the DECL_NAME name used in the source file. */ - - x = DECL_RTL (current_function_decl); - gcc_assert (MEM_P (x)); - x = XEXP (x, 0); - gcc_assert (GET_CODE (x) == SYMBOL_REF); - fnname = XSTR (x, 0); + const char *fnname = get_fnname_from_decl (current_function_decl); assemble_start_function (current_function_decl, fnname); final_start_function (get_insns
The nvptx port [9/11+] Epilogues
We skip the late compilation passes on ptx, but there's one piece we do need - fixing up the function so that we get return insns in the right places. This patch just makes thread_prologue_and_epilogue_insns callable from the reorg pass. Bernd gcc/ * function.c (thread_prologue_and_epilogue_insns): No longer static. * function.h (thread_prologue_and_epilogue_insns): Declare. Index: gcc/function.c === --- gcc/function.c (revision 422424) +++ gcc/function.c (revision 422425) @@ -5945,7 +5945,7 @@ emit_return_for_exit (edge exit_fallthru in a sibcall omit the sibcall_epilogue if the block is not in ANTIC. */ -static void +void thread_prologue_and_epilogue_insns (void) { bool inserted; Index: gcc/function.h === --- gcc/function.h (revision 422424) +++ gcc/function.h (revision 422425) @@ -773,6 +773,8 @@ extern void free_after_compilation (stru extern void init_varasm_status (void); +extern void thread_prologue_and_epilogue_insns (void); + #ifdef RTX_CODE extern void diddle_return_value (void (*)(rtx, void*), void*); extern void clobber_return_register (void);
The nvptx port [10/11+] Target files
These are the main target files for the ptx port. t-nvptx is empty for now but will grow some content with follow up patches. Bernd * configure.ac: Allow configuring lto for nvptx. * configure: Regenerate. gcc/ * config/nvptx/nvptx.c: New file. * config/nvptx/nvptx.h: New file. * config/nvptx/nvptx-protos.h: New file. * config/nvptx/nvptx.md: New file. * config/nvptx/t-nvptx: New file. * config/nvptx/nvptx.opt: New file. * common/config/nvptx/nvptx-common.c: New file. * config.gcc: Handle nvptx-*-*. libgcc/ * config.host: Handle nvptx-*-*. * config/nvptx/t-nvptx: New file. * config/nvptx/crt0.s: New file. Index: gcc/common/config/nvptx/nvptx-common.c === --- /dev/null +++ gcc/common/config/nvptx/nvptx-common.c @@ -0,0 +1,38 @@ +/* NVPTX common hooks. + Copyright (C) 2014 Free Software Foundation, Inc. + Contributed by Bernd Schmidt ber...@codesourcery.com + +This file is part of GCC. + +GCC is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 3, or (at your option) +any later version. + +GCC is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with GCC; see the file COPYING3. If not see +http://www.gnu.org/licenses/. */ + +#include config.h +#include system.h +#include coretypes.h +#include diagnostic-core.h +#include tm.h +#include tm_p.h +#include common/common-target.h +#include common/common-target-def.h +#include opts.h +#include flags.h + +#undef TARGET_HAVE_NAMED_SECTIONS +#define TARGET_HAVE_NAMED_SECTIONS false + +#undef TARGET_DEFAULT_TARGET_FLAGS +#define TARGET_DEFAULT_TARGET_FLAGS MASK_ABI64 + +struct gcc_targetm_common targetm_common = TARGETM_COMMON_INITIALIZER; Index: gcc/config.gcc === --- gcc/config.gcc.orig +++ gcc/config.gcc @@ -420,6 +420,9 @@ nios2-*-*) cpu_type=nios2 extra_options=${extra_options} g.opt ;; +nvptx-*-*) + cpu_type=nvptx + ;; powerpc*-*-*) cpu_type=rs6000 extra_headers=ppc-asm.h altivec.h spe.h ppu_intrinsics.h paired.h spu2vmx.h vec_types.h si2vmx.h htmintrin.h htmxlintrin.h @@ -2148,6 +2151,10 @@ nios2-*-*) ;; esac ;; +nvptx-*) + tm_file=${tm_file} newlib-stdint.h + tmake_file=nvptx/t-nvptx + ;; pdp11-*-*) tm_file=${tm_file} newlib-stdint.h use_gcc_stdint=wrap Index: gcc/config/nvptx/nvptx.c === --- /dev/null +++ gcc/config/nvptx/nvptx.c @@ -0,0 +1,2024 @@ +/* Target code for NVPTX. + Copyright (C) 2014 Free Software Foundation, Inc. + Contributed by Bernd Schmidt ber...@codesourcery.com + + This file is part of GCC. + + GCC is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published + by the Free Software Foundation; either version 3, or (at your + option) any later version. + + GCC is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public + License for more details. + + You should have received a copy of the GNU General Public License + along with GCC; see the file COPYING3. If not see + http://www.gnu.org/licenses/. */ + +#include config.h +#include system.h +#include coretypes.h +#include tm.h +#include rtl.h +#include tree.h +#include insn-flags.h +#include output.h +#include insn-attr.h +#include insn-codes.h +#include expr.h +#include regs.h +#include optabs.h +#include recog.h +#include ggc.h +#include timevar.h +#include tm_p.h +#include tm-preds.h +#include tm-constrs.h +#include function.h +#include langhooks.h +#include dbxout.h +#include target.h +#include target-def.h +#include diagnostic.h +#include basic-block.h +#include stor-layout.h +#include calls.h +#include df.h +#include builtins.h +#include hashtab.h +#include sstream + +/* Record the function decls we've written, and the libfuncs and function + decls corresponding to them. */ +static std::stringstream func_decls; +static GTY((if_marked (ggc_marked_p), param_is (struct rtx_def))) + htab_t declared_libfuncs_htab; +static GTY((if_marked (ggc_marked_p), param_is (union tree_node))) + htab_t declared_fndecls_htab; +static GTY((if_marked (ggc_marked_p), param_is (union tree_node))) + htab_t needed_fndecls_htab; + +/* Allocate a new, cleared machine_function structure. */ + +static struct machine_function * +nvptx_init_machine_status (void) +{ + struct machine_function *p =
The nvptx port [11/11] More tools.
This is a bonus optional patch which adds ar, ranlib, as and ld to the ptx port. This is not proper binutils; ar and ranlib are just linked to the host versions, and the other two tools have the following functions: * nvptx-as is required to convert the compiler output to actual valid ptx assembly, primarily by reordering declarations and definitions. Believe me when I say that I've tried to make that work in the compiler itself and it's pretty much impossible without some really invasive changes. * nvptx-ld is just a pseudo linker that works by concatenating ptx input files and separating them with nul characters. Actual linking is something that happens later, when calling CUDA library functions, but existing build system make it useful to have something called ld which is able to bundle everything that's needed into a single file, and this seemed to be the simplest way of achieving this. There's a toplevel configure.ac change necessary to make ar/ranlib useable by the libgcc build. Having some tools built like this has some precedent in t-vmsnative, but as Thomas noted it does make feature tests in gcc's configure somewhat ugly (but everything works well enough to build the compiler). The alternative here is to bundle all these files into a separate nvptx-tools package which users would have to download - something that would be nice to avoid. These tools currently require GNU extensions - something I probably ought to fix if we decide to add them to the gcc build itself. Bernd * configure.ac (AR_FOR_TARGET, RANLIB_FOR_TARGET): If nvptx-*, look for them in the gcc build directory. * configure: Regenerate. gcc/ * config.gcc (nvptx-*): Define extra_programs. * config/nvptx/nvptx-as.c: New file. * config/nvptx/nvptx-ld.c: New file. * config/nvptx/t-nvptx (nvptx-ld.o, nvptx-as.o, collect-ld$(exeext), as$(exeext), ar$(exeext), ranlib$(exeext): New rules. Index: git/gcc/config.gcc === --- git.orig/gcc/config.gcc +++ git/gcc/config.gcc @@ -2154,6 +2154,7 @@ nios2-*-*) nvptx-*) tm_file=${tm_file} newlib-stdint.h tmake_file=nvptx/t-nvptx + extra_programs=collect-ld\$(exeext) as\$(exeext) ar\$(exeext) ranlib\$(exeext) ;; pdp11-*-*) tm_file=${tm_file} newlib-stdint.h Index: git/gcc/config/nvptx/nvptx-as.c === --- /dev/null +++ git/gcc/config/nvptx/nvptx-as.c @@ -0,0 +1,961 @@ +/* An assembler for ptx. + Copyright (C) 2014 Free Software Foundation, Inc. + Contributed by Nathan Sidwell nat...@codesourcery.com + Contributed by Bernd Schmidt ber...@codesourcery.com + + This file is part of GCC. + + GCC is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published + by the Free Software Foundation; either version 3, or (at your + option) any later version. + + GCC is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public + License for more details. + + You should have received a copy of the GNU General Public License + along with GCC; see the file COPYING3. If not see + http://www.gnu.org/licenses/. */ + +/* Munges gcc-generated PTX assembly so that it becomes acceptable for ptxas. + + This is not a complete assembler. We presume the source is well + formed from the compiler and can die horribly if it is not. */ + +#include getopt.h +#include stdlib.h +#include stdio.h +#include stdarg.h +#include string.h +#include wait.h +#include unistd.h +#include errno.h +#define obstack_chunk_alloc malloc +#define obstack_chunk_free free +#include obstack.h +#define HAVE_DECL_BASENAME 1 +#include libiberty.h +#include hashtab.h + +#include list + +static const char *outname = NULL; + +static void __attribute__ ((format (printf, 1, 2))) +fatal_error (const char * cmsgid, ...) +{ + va_list ap; + + va_start (ap, cmsgid); + fprintf (stderr, nvptx-as: ); + vfprintf (stderr, cmsgid, ap); + fprintf (stderr, \n); + va_end (ap); + + unlink (outname); + exit (1); +} + +struct Stmt; + +class symbol +{ + public: + symbol (const char *k) : key (k), stmts (0), pending (0), emitted (0) +{ } + + /* The name of the symbol. */ + const char *key; + /* A linked list of dependencies for the initializer. */ + std::listsymbol * deps; + /* The statement in which it is defined. */ + struct Stmt *stmts; + bool pending; + bool emitted; +}; + +/* Hash and comparison functions for these hash tables. */ + +static int hash_string_eq (const void *, const void *); +static hashval_t hash_string_hash (const void *); + +static int +hash_string_eq (const void *s1_p, const void *s2_p) +{ + const char *const *s1 = (const char *const *) s1_p; + const char *s2 = (const char *) s2_p; + return strcmp (*s1, s2) == 0; +} +
Re: The nvptx port [11/11] More tools.
On Mon, 20 Oct 2014, Bernd Schmidt wrote: These tools currently require GNU extensions - something I probably ought to fix if we decide to add them to the gcc build itself. And as regards library use, I'd expect the sources to start with #includes of config.h and system.h (and so not include system headers directly if they are included by system.h) even if no other GCC headers are useful in any way. -- Joseph S. Myers jos...@codesourcery.com