subject:"The nvptx port"

List myself as "nvptx port" maintainer (was: Thomas Schwinge appointed co-maintainer of the nvptx backend)

2023-07-25 Thread Thomas Schwinge

Hi!

On 2023-07-19T23:41:47+0200, Gerald Pfeifer  wrote:
> It's my pleasure to announce Thomas Schwinge as co-maintainer of the
> nvptx backend.
>
> Congratulations and Happy Hacking, Thomas! Please go ahead and update
> MAINTAINERS accordingly.
>
> Gerald (on behalf of the steering committee)

Thanks!  I've pushed commit 28e3d361ba0cfa7ea2f90706159a144eaf4b650e
'List myself as "nvptx port" maintainer', see attached.


Grüße
 Thomas


-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
>From 28e3d361ba0cfa7ea2f90706159a144eaf4b650e Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Tue, 25 Jul 2023 21:17:52 +0200
Subject: [PATCH] List myself as "nvptx port" maintainer

	* MAINTAINERS: List myself as "nvptx port" maintainer.
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index b626d89fe34..e9b11b43a0f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -102,6 +102,7 @@ nds32 port		Shiva Chen		
 nios2 port		Chung-Lin Tang		
 nios2 port		Sandra Loosemore	
 nvptx port		Tom de Vries		
+nvptx port		Thomas Schwinge		
 or1k port		Stafford Horne		
 pdp11 port		Paul Koning		
 powerpcspe port		Andrew Jenner		
-- 
2.34.1

Flip the nvptx port to LRA (was: [PATCH] Turn on LRA on all targets)

2023-06-30 Thread Thomas Schwinge

Hi!

On 2023-04-29T09:06:54-0600, Jeff Law via Gcc-patches  
wrote:
> On 4/29/23 07:37, Roger Sayle wrote:
>>
>> Segher Boessenkool wrote:
>>> I send this patch now so that people can start testing.
>>>
>>> --- a/gcc/config/nvptx/nvptx.cc
>>> +++ b/gcc/config/nvptx/nvptx.cc
>>> @@ -7601,9 +7601,6 @@ nvptx_asm_output_def_from_decls (FILE *stream, tree 
>>> name, tree value)
>>> #undef TARGET_ATTRIBUTE_TABLE
>>> #define TARGET_ATTRIBUTE_TABLE nvptx_attribute_table
>>>
>>> -#undef TARGET_LRA_P
>>> -#define TARGET_LRA_P hook_bool_void_false
>>> -
>>> #undef TARGET_LEGITIMATE_ADDRESS_P
>>> #define TARGET_LEGITIMATE_ADDRESS_P nvptx_legitimate_address_p
>>
>> I've tested Segher's patch on nvptx-none with make and make -k check and
>> can confirm there are no new regressions.

Confirmed.  Also, no change in nvptx target libraries built.  As
expected.

>> Nvptx is unique in that it
>> doesn't
>> use register allocation, i.e. GCC's only TARGET_NO_REGISTER_ALLOCATION
>> target,
>> so it's a little odd that it specifies which register allocator it doesn't
>> use.
>>
>> I hope this helps,
>
> It does.  Consider a patch which flips the nvptx port to LRA as
> pre-approved.

Pushed to master branch commit f7e3123638712773e8c01e17aae9dc64d9342016
"Flip the nvptx port to LRA", see attached.


Grüße
 Thomas


-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
>From f7e3123638712773e8c01e17aae9dc64d9342016 Mon Sep 17 00:00:00 2001
From: Segher Boessenkool 
Date: Sun, 23 Apr 2023 16:47:52 +
Subject: [PATCH] Flip the nvptx port to LRA

... understanding that "turn on LRA" is an exaggeration here, given that nvptx
isn't actually doing register allocation ('TARGET_NO_REGISTER_ALLOCATION').

	gcc/
	* config/nvptx/nvptx.cc (TARGET_LRA_P): Remove.

Co-authored-by: Thomas Schwinge 
---
 gcc/config/nvptx/nvptx.cc | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index e3b0304d5376..16ed78030d73 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -7633,9 +7633,6 @@ nvptx_asm_output_def_from_decls (FILE *stream, tree name, tree value)
 #undef TARGET_ATTRIBUTE_TABLE
 #define TARGET_ATTRIBUTE_TABLE nvptx_attribute_table
 
-#undef TARGET_LRA_P
-#define TARGET_LRA_P hook_bool_void_false
-
 #undef TARGET_LEGITIMATE_ADDRESS_P
 #define TARGET_LEGITIMATE_ADDRESS_P nvptx_legitimate_address_p
 
-- 
2.39.2

Re: The nvptx port [0/11+]

2015-02-18 Thread Thomas Schwinge

Hi!

On Mon, 20 Oct 2014 16:17:56 +0200, Bernd Schmidt ber...@codesourcery.com 
wrote:
 This is a patch kit that adds the nvptx port to gcc.

Committed to trunk in r220781:

commit 0f7695734890f93fe58179e36ac2f41bf4147d78
Author: tschwinge tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4
Date:   Wed Feb 18 08:01:03 2015 +

nvptx-none: Disable the lto-plugin.

config/
* elf.m4 (ACX_ELF_TARGET_IFELSE): nvptx-*-none isn't ELF.
/
* configure: Regenerate.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@220781 
138bc75d-0d04-0410-961f-82ee72b054a4
---
 ChangeLog|4 
 config/ChangeLog |4 
 config/elf.m4|7 +--
 configure|3 ++-
 4 files changed, 15 insertions(+), 3 deletions(-)

diff --git ChangeLog ChangeLog
index 0969af5..a9e4437 100644
--- ChangeLog
+++ ChangeLog
@@ -1,3 +1,7 @@
+2015-02-18  Thomas Schwinge  tho...@codesourcery.com
+
+   * configure: Regenerate.
+
 2015-02-06  Diego Novillo  dnovi...@google.com
 
* MAINTAINERS (Global Reviewers, Plugin, LTO, tree-ssa,
diff --git config/ChangeLog config/ChangeLog
index 2cbc885..c9ed121 100644
--- config/ChangeLog
+++ config/ChangeLog
@@ -1,3 +1,7 @@
+2015-02-18  Thomas Schwinge  tho...@codesourcery.com
+
+   * elf.m4 (ACX_ELF_TARGET_IFELSE): nvptx-*-none isn't ELF.
+
 2014-11-17  Bob Dunlop  bob.dun...@xyzzy.org.uk
 
* mt-ospace (CFLAGS_FOR_TARGET): Append -g -Os rather than
diff --git config/elf.m4 config/elf.m4
index da051cb..1772a44 100644
--- config/elf.m4
+++ config/elf.m4
@@ -1,4 +1,4 @@
-dnl Copyright (C) 2010, 2011 Free Software Foundation, Inc.
+dnl Copyright (C) 2010, 2011, 2015 Free Software Foundation, Inc.
 dnl This file is free software, distributed under the terms of the GNU
 dnl General Public License.  As a special exception to the GNU General
 dnl Public License, this file may be distributed as part of a program
@@ -7,6 +7,8 @@ dnl the same distribution terms as the rest of that program.
 
 dnl From Paolo Bonzini.
 
+dnl Is this an ELF target supporting the LTO plugin?
+
 dnl usage: ACX_ELF_TARGET_IFELSE([if-elf], [if-not-elf])
 AC_DEFUN([ACX_ELF_TARGET_IFELSE], [
 AC_REQUIRE([AC_CANONICAL_TARGET])
@@ -15,7 +17,8 @@ target_elf=no
 case $target in
   *-darwin* | *-aix* | *-cygwin* | *-mingw* | *-aout* | *-*coff* | \
   *-msdosdjgpp* | *-vms* | *-wince* | *-*-pe* | \
-  alpha*-dec-osf* | *-interix* | hppa[[12]]*-*-hpux*)
+  alpha*-dec-osf* | *-interix* | hppa[[12]]*-*-hpux* | \
+  nvptx-*-none)
 target_elf=no
 ;;
   *)
diff --git configure configure
index dd794db..f20a6ab 100755
--- configure
+++ configure
@@ -6047,7 +6047,8 @@ target_elf=no
 case $target in
   *-darwin* | *-aix* | *-cygwin* | *-mingw* | *-aout* | *-*coff* | \
   *-msdosdjgpp* | *-vms* | *-wince* | *-*-pe* | \
-  alpha*-dec-osf* | *-interix* | hppa[12]*-*-hpux*)
+  alpha*-dec-osf* | *-interix* | hppa[12]*-*-hpux* | \
+  nvptx-*-none)
 target_elf=no
 ;;
   *)


Grüße,
 Thomas


signature.asc
Description: PGP signature

Re: nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)

2015-02-18 Thread Thomas Schwinge

Hi!

On Wed, 4 Feb 2015 10:43:14 +0100, Jakub Jelinek ja...@redhat.com wrote:
 On Mon, Feb 02, 2015 at 04:32:34PM +0100, Thomas Schwinge wrote:
  Hi!
  
  On Tue, 23 Dec 2014 19:49:35 +0100, I wrote:
   On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt 
   ber...@codesourcery.com wrote:
The scripts (11/11) I've put up on github, along with a hacked up 
newlib. These are at [...]
  
They are likely to migrate to MentorEmbedded from bernds, but that had 
some permissions problems last week.
   
   That has recently been done:
   https://github.com/MentorEmbedded/nvptx-tools and
   https://github.com/MentorEmbedded/nvptx-newlib are now available.
   
   (I'm aware that we still are to write up how to actually build and test
   all this.)
  
  I just updated
  https://gcc.gnu.org/wiki/Offloading?action=diffrev2=26rev1=25.
 
 Can you please update the gmane URLs to corresponding
 https://gcc.gnu.org/ml/gcc-patches/ URLs?  We have our own mailing list
 archives, no need to use third party ones.

It's convenient for me (Message-IDs falls out of my mailer automatically,
and Gmane happens to support retrieving message by Message-ID), and the
sourceware mailing list archives software doesn't interlink articles
between different -MM, which I find rather limiting.


  OK to check in the following to trunk?

Committed to trunk in r220783.


  --- gcc/config/nvptx/nvptx.opt
  +++ gcc/config/nvptx/nvptx.opt
  @@ -17,13 +17,13 @@
   ; along with GCC; see the file COPYING3.  If not see
   ; http://www.gnu.org/licenses/.
   
  -m64
  -Target Report RejectNegative Mask(ABI64)
  -Generate code for a 64 bit ABI
  -
   m32
   Target Report RejectNegative InverseMask(ABI64)
  -Generate code for a 32 bit ABI
  +Generate code for a 32-bit ABI
  +
  +m64
  +Target Report RejectNegative Mask(ABI64)
  +Generate code for a 64-bit ABI
 
 I'd expect you want also Negative(m64) on the m32 option and
 Negative(m32) on the m64 option.
 
  +@table @gcctabopt
  +
  +@item -m32
  +@itemx -m64
  +@opindex m32
  +@opindex m64
  +Generate code for 32-bit or 64-bit ABI.
 
 I guess you should mention which one of those is the default (if it isn't
 configure time configurable).

Have taken a note to look into these, later.


 What about multilibs, is newlib built for both -m32 and -m64, or just the
 default option?

So far, we have concentrated only on the 64-bit x86_64 configuration;
32-bit has several known issues to be resolved.
https://gcc.gnu.org/PR65099 filed.


Grüße,
 Thomas


signature.asc
Description: PGP signature

Re: nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)

2015-02-18 Thread Jakub Jelinek

On Wed, Feb 18, 2015 at 09:50:15AM +0100, Thomas Schwinge wrote:
  What about multilibs, is newlib built for both -m32 and -m64, or just the
  default option?
 
 So far, we have concentrated only on the 64-bit x86_64 configuration;
 32-bit has several known issues to be resolved.
 https://gcc.gnu.org/PR65099 filed.

I meant 64-bit and 32-bit PTX.

Jakub

nvptx-none: Define empty GOMP_SELF_SPECS (was: The nvptx port [0/11+])

2015-02-17 Thread Thomas Schwinge

Hi!

On Mon, 20 Oct 2014 16:17:56 +0200, Bernd Schmidt ber...@codesourcery.com 
wrote:
 This is a patch kit that adds the nvptx port to gcc.

I wonder why we haven't been seeing this in our internal development
branch -- maybe because on that branch we're still discarding more
compiler options in the offloading path?

Committed to trunk in r220780:

commit 2fdc66a9fcfbc5b77c1c03d7c34893a0a086e8f8
Author: tschwinge tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4
Date:   Wed Feb 18 07:45:42 2015 +

nvptx-none: Define empty GOMP_SELF_SPECS.

Otherwise, offloading with -fopenacc or -fopenmp active will run into:

x86_64-unknown-linux-gnu-accel-nvptx-none-gcc: error: unrecognized 
command line option '-pthread'

gcc/
* config/nvptx/nvptx.h (GOMP_SELF_SPECS): Define macro.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@220780 
138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog|4 
 gcc/config/nvptx/nvptx.h |4 
 2 files changed, 8 insertions(+)

diff --git gcc/ChangeLog gcc/ChangeLog
index 2c75df6..180a605 100644
--- gcc/ChangeLog
+++ gcc/ChangeLog
@@ -1,3 +1,7 @@
+2015-02-18  Thomas Schwinge  tho...@codesourcery.com
+
+   * config/nvptx/nvptx.h (GOMP_SELF_SPECS): Define macro.
+
 2015-02-18  Andrew Pinski  apin...@cavium.com
Naveen H.S  naveen.hurugalaw...@caviumnetworks.com
 
diff --git gcc/config/nvptx/nvptx.h gcc/config/nvptx/nvptx.h
index 9a9954b..e74d16f 100644
--- gcc/config/nvptx/nvptx.h
+++ gcc/config/nvptx/nvptx.h
@@ -33,6 +33,10 @@
   builtin_define (__nvptx__);\
 } while (0)
 
+/* Avoid the default in ../../gcc.c, which adds -pthread, which is not
+   supported for nvptx.  */
+#define GOMP_SELF_SPECS 
+
 /* Storage Layout.  */
 
 #define BITS_BIG_ENDIAN 0


Grüße,
 Thomas


signature.asc
Description: PGP signature

Re: nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)

2015-02-04 Thread Jakub Jelinek

On Mon, Feb 02, 2015 at 04:32:34PM +0100, Thomas Schwinge wrote:
 Hi!
 
 On Tue, 23 Dec 2014 19:49:35 +0100, I wrote:
  On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt ber...@codesourcery.com 
  wrote:
   The scripts (11/11) I've put up on github, along with a hacked up 
   newlib. These are at [...]
 
   They are likely to migrate to MentorEmbedded from bernds, but that had 
   some permissions problems last week.
  
  That has recently been done:
  https://github.com/MentorEmbedded/nvptx-tools and
  https://github.com/MentorEmbedded/nvptx-newlib are now available.
  
  (I'm aware that we still are to write up how to actually build and test
  all this.)
 
 I just updated
 https://gcc.gnu.org/wiki/Offloading?action=diffrev2=26rev1=25.

Can you please update the gmane URLs to corresponding
https://gcc.gnu.org/ml/gcc-patches/ URLs?  We have our own mailing list
archives, no need to use third party ones.
 
 OK to check in the following to trunk?

 --- gcc/config/nvptx/nvptx.opt
 +++ gcc/config/nvptx/nvptx.opt
 @@ -17,13 +17,13 @@
  ; along with GCC; see the file COPYING3.  If not see
  ; http://www.gnu.org/licenses/.
  
 -m64
 -Target Report RejectNegative Mask(ABI64)
 -Generate code for a 64 bit ABI
 -
  m32
  Target Report RejectNegative InverseMask(ABI64)
 -Generate code for a 32 bit ABI
 +Generate code for a 32-bit ABI
 +
 +m64
 +Target Report RejectNegative Mask(ABI64)
 +Generate code for a 64-bit ABI

I'd expect you want also Negative(m64) on the m32 option and
Negative(m32) on the m64 option.

 +@table @gcctabopt
 +
 +@item -m32
 +@itemx -m64
 +@opindex m32
 +@opindex m64
 +Generate code for 32-bit or 64-bit ABI.

I guess you should mention which one of those is the default (if it isn't
configure time configurable).

What about multilibs, is newlib built for both -m32 and -m64, or just the
default option?

Jakub

Re: nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)

2015-02-02 Thread Thomas Schwinge

Hi!

On Tue, 23 Dec 2014 19:49:35 +0100, I wrote:
 On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt ber...@codesourcery.com 
 wrote:
  The scripts (11/11) I've put up on github, along with a hacked up 
  newlib. These are at [...]

  They are likely to migrate to MentorEmbedded from bernds, but that had 
  some permissions problems last week.
 
 That has recently been done:
 https://github.com/MentorEmbedded/nvptx-tools and
 https://github.com/MentorEmbedded/nvptx-newlib are now available.
 
 (I'm aware that we still are to write up how to actually build and test
 all this.)

I just updated
https://gcc.gnu.org/wiki/Offloading?action=diffrev2=26rev1=25.

OK to check in the following to trunk?

commit a0c73cb76d1f13642df7725d64bc618ee0909abc
Author: Thomas Schwinge tho...@codesourcery.com
Date:   Mon Feb 2 16:29:36 2015 +0100

Begin documenting the nvptx backend.

gcc/
* doc/install.texi (nvptx-*-none): New section.
* doc/invoke.texi (Nvidia PTX Options): Likewise.
* config/nvptx/nvptx.opt: Update.
---
 gcc/config/nvptx/nvptx.opt | 10 +-
 gcc/doc/install.texi   | 23 +++
 gcc/doc/invoke.texi| 26 ++
 3 files changed, 54 insertions(+), 5 deletions(-)

diff --git gcc/config/nvptx/nvptx.opt gcc/config/nvptx/nvptx.opt
index 1448dfc..249a61d 100644
--- gcc/config/nvptx/nvptx.opt
+++ gcc/config/nvptx/nvptx.opt
@@ -17,13 +17,13 @@
 ; along with GCC; see the file COPYING3.  If not see
 ; http://www.gnu.org/licenses/.
 
-m64
-Target Report RejectNegative Mask(ABI64)
-Generate code for a 64 bit ABI
-
 m32
 Target Report RejectNegative InverseMask(ABI64)
-Generate code for a 32 bit ABI
+Generate code for a 32-bit ABI
+
+m64
+Target Report RejectNegative Mask(ABI64)
+Generate code for a 64-bit ABI
 
 mmainkernel
 Target Report RejectNegative
diff --git gcc/doc/install.texi gcc/doc/install.texi
index c9e3bf1..b31f9b6 100644
--- gcc/doc/install.texi
+++ gcc/doc/install.texi
@@ -3302,6 +3302,8 @@ information have to.
 @item
 @uref{#nds32be-x-elf,,nds32be-*-elf}
 @item
+@uref{#nvptx-x-none,,nvptx-*-none}
+@item
 @uref{#powerpc-x-x,,powerpc*-*-*}
 @item
 @uref{#powerpc-x-darwin,,powerpc-*-darwin*}
@@ -4269,6 +4271,27 @@ Andes NDS32 target in big endian mode.
 @html
 hr /
 @end html
+@anchor{nvptx-x-none}
+@heading nvptx-*-none
+Nvidia PTX target.
+
+Instead of GNU binutils, you will need to install
+@uref{https://github.com/MentorEmbedded/nvptx-tools/,,nvptx-tools}.
+Tell GCC where to find it:
+@option{--with-build-time-tools=[install-nvptx-tools]/nvptx-none/bin}.
+
+A nvptx port of newlib is available at
+@uref{https://github.com/MentorEmbedded/nvptx-newlib/,,nvptx-newlib}.
+It can be automatically built together with GCC@.  For this, add a
+symbolic link to nvptx-newlib's @file{newlib} directory to the
+directory containing the GCC sources.
+
+Use the @option{--disable-sjlj-exceptions} and
+@option{--enable-newlib-io-long-long} options when configuring.
+
+@html
+hr /
+@end html
 @anchor{powerpc-x-x}
 @heading powerpc-*-*
 You can specify a default version for the @option{-mcpu=@var{cpu_type}}
diff --git gcc/doc/invoke.texi gcc/doc/invoke.texi
index ba81ec7..1fb329e 100644
--- gcc/doc/invoke.texi
+++ gcc/doc/invoke.texi
@@ -840,6 +840,9 @@ Objective-C and Objective-C++ Dialects}.
 -mcustom-fpu-cfg=@var{name} @gol
 -mhal -msmallc -msys-crt0=@var{name} -msys-lib=@var{name}}
 
+@emph{Nvidia PTX Options}
+@gccoptlist{-m32 -m64 -mmainkernel}
+
 @emph{PDP-11 Options}
 @gccoptlist{-mfpu  -msoft-float  -mac0  -mno-ac0  -m40  -m45  -m10 @gol
 -mbcopy  -mbcopy-builtin  -mint32  -mno-int16 @gol
@@ -11967,6 +11970,7 @@ platform.
 * MSP430 Options::
 * NDS32 Options::
 * Nios II Options::
+* Nvidia PTX Options::
 * PDP-11 Options::
 * picoChip Options::
 * PowerPC Options::
@@ -18277,6 +18281,28 @@ This option is typically used to link with a library 
provided by a HAL BSP.
 
 @end table
 
+@node Nvidia PTX Options
+@subsection Nvidia PTX Options
+@cindex Nvidia PTX options
+@cindex nvptx options
+
+These options are defined for Nvidia PTX:
+
+@table @gcctabopt
+
+@item -m32
+@itemx -m64
+@opindex m32
+@opindex m64
+Generate code for 32-bit or 64-bit ABI.
+
+@item -mmainkernel
+@opindex mmainkernel
+Link in code for a __main kernel.  This is for stand-alone instead of
+offloading execution.
+
+@end table
+
 @node PDP-11 Options
 @subsection PDP-11 Options
 @cindex PDP-11 Options


Grüße,
 Thomas


pgp0CHeeOXpKu.pgp
Description: PGP signature

nvptx-tools and nvptx-newlib (was: The nvptx port [10/11+] Target files)

2014-12-23 Thread Thomas Schwinge

Hi!

On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt ber...@codesourcery.com 
wrote:
 The scripts (11/11) I've put up on github, along with a hacked up 
 newlib. These are at
 
 https://github.com/bernds/nvptx-tools
 https://github.com/bernds/nvptx-newlib
 
 They are likely to migrate to MentorEmbedded from bernds, but that had 
 some permissions problems last week.

That has recently been done:
https://github.com/MentorEmbedded/nvptx-tools and
https://github.com/MentorEmbedded/nvptx-newlib are now available.

(I'm aware that we still are to write up how to actually build and test
all this.)


Grüße,
 Thomas


signature.asc
Description: PGP signature

Re: The nvptx port [10/11+] Target files

2014-12-12 Thread Thomas Schwinge

Hi!

On Mon, 10 Nov 2014 17:19:57 +0100, Bernd Schmidt ber...@codesourcery.com 
wrote:
 I've now committed it, in the following form.

 --- /dev/null
 +++ b/gcc/config/nvptx/nvptx.h
 @@ -0,0 +1,356 @@

 +#define ASM_OUTPUT_ALIGN(FILE, POWER)

Committed to trunk in r218689:

commit 61f8a1bd770ded96fcff88f3cbc426a23c413992
Author: tschwinge tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4
Date:   Fri Dec 12 20:14:10 2014 +

nvptx: Define valid ASM_OUTPUT_ALIGN.

gcc/
* config/nvptx/nvptx.h (ASM_OUTPUT_ALIGN): Define as a C statment.

gcc/doc/tm.texi:@defmac ASM_OUTPUT_ALIGN (@var{stream}, @var{power})
gcc/doc/tm.texi-A C statement to output to the stdio stream 
@var{stream} an assembler
gcc/doc/tm.texi-command to advance the location counter to a multiple 
of 2 to the
gcc/doc/tm.texi-@var{power} bytes.  @var{power} will be a C expression 
of type @code{int}.
gcc/doc/tm.texi-@end defmac

gcc/config/nvptx/nvptx.h:#define ASM_OUTPUT_ALIGN(FILE, POWER)

Empty is not a C statement, and so in code such as:

gcc/dwarf2out.c-  if (lsda_encoding == DW_EH_PE_aligned)
gcc/dwarf2out.c:ASM_OUTPUT_ALIGN (asm_out_file, 
floor_log2 (PTR_SIZE));
gcc/dwarf2out.c-  dw2_asm_output_data 
(size_of_encoded_value (lsda_encoding), 0,
gcc/dwarf2out.c-   Language Specific 
Data Area (none));

gcc/varasm.c-  if (align  BITS_PER_UNIT)
gcc/varasm.c:ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align 
/ BITS_PER_UNIT));
gcc/varasm.c-  assemble_variable_contents (decl, name, 
dont_output_data);

gcc/varasm.c-  if (align  0)
gcc/varasm.c:ASM_OUTPUT_ALIGN (asm_out_file, align);
gcc/varasm.c-
gcc/varasm.c-  targetm.asm_out.internal_label (asm_out_file, LTRAMP, 
0);

gcc/varasm.c-  if (align  BITS_PER_UNIT)
gcc/varasm.c:ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align 
/ BITS_PER_UNIT));
gcc/varasm.c-  assemble_constant_contents (exp, XSTR (symbol, 0), 
align);

..., GCC warns:

[...]/source-gcc/gcc/dwarf2out.c: In function 'void 
output_fde(dw_fde_ref, bool, bool, char*, int, char*, bool, int)':
[...]/source-gcc/gcc/dwarf2out.c:665:3: warning: suggest braces around 
empty body in an 'if' statement [-Wempty-body]
   ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (PTR_SIZE));
   ^

[...]/source-gcc/gcc/varasm.c: In function 'void 
assemble_variable(tree, int, int, int)':
[...]/source-gcc/gcc/varasm.c:2217:2: warning: suggest braces around 
empty body in an 'if' statement [-Wempty-body]
  ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT));
  ^
[...]/source-gcc/gcc/varasm.c: In function 'rtx_def* 
assemble_trampoline_template()':
[...]/source-gcc/gcc/varasm.c:2603:5: warning: suggest braces around 
empty body in an 'if' statement [-Wempty-body]
 ASM_OUTPUT_ALIGN (asm_out_file, align);
 ^
[...]/source-gcc/gcc/varasm.c: In function 'void 
output_constant_def_contents(rtx)':
[...]/source-gcc/gcc/varasm.c:3413:2: warning: suggest braces around 
empty body in an 'if' statement [-Wempty-body]
  ASM_OUTPUT_ALIGN (asm_out_file, floor_log2 (align / BITS_PER_UNIT));
  ^

Also, use the values, to get rid of that one:

[...]/source-gcc/gcc/final.c: In function 'rtx_insn* 
final_scan_insn(rtx_insn*, FILE*, int, int, int*)':
[...]/source-gcc/gcc/final.c:2450:12: warning: variable 'log_align' set 
but not used [-Wunused-but-set-variable]
int log_align;
^

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@218689 
138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog|  4 
 gcc/config/nvptx/nvptx.h | 10 +-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git gcc/ChangeLog gcc/ChangeLog
index 689c4fd..e5de2c6 100644
--- gcc/ChangeLog
+++ gcc/ChangeLog
@@ -1,3 +1,7 @@
+2014-12-12  Thomas Schwinge  tho...@codesourcery.com
+
+   * config/nvptx/nvptx.h (ASM_OUTPUT_ALIGN): Define as a C statment.
+
 2014-12-12  Vladimir Makarov  vmaka...@redhat.com
 
PR target/64110
diff --git gcc/config/nvptx/nvptx.h gcc/config/nvptx/nvptx.h
index c222375..5f08ba7 100644
--- gcc/config/nvptx/nvptx.h
+++ gcc/config/nvptx/nvptx.h
@@ -281,9 +281,17 @@ struct GTY(()) machine_function
 }  \
   while (0)
 
-#define ASM_OUTPUT_ALIGN(FILE, POWER)
+#define ASM_OUTPUT_ALIGN(FILE, POWER)  \
+  do   \
+{  \
+  (void) (FILE);   \
+  (void) (POWER);  \
+}

Re: The nvptx port

2014-11-17 Thread Nathan Sidwell


On 11/14/14 11:04, Jeff Law wrote:

On 11/14/14 05:36, Jakub Jelinek wrote:


So, for a warp, if some threads perform one branch of an if and other
threads another one, all threads perform the first one first (with some
maybe not doing anything), then all the threads the others (again, other
threads not doing anything)?



Nobody ever specified exactly what happens in this case to me, but I gathered
from reading the docs that once you have some threads in one path and others in
a different path, things slow down to a horrid crawl.  So you try to avoid that 
:-)


this is correct.  Don't do that.

Re: The nvptx port

2014-11-17 Thread Nathan Sidwell


On 11/14/14 10:43, Jeff Law wrote:

On 11/14/14 04:09, Bernd Schmidt wrote:

Hi Jakub,


I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems



Yes, it's unimplemented. There's an internal declaration for it but that
seems to be as far as it goes, and that declaration is 32-bit only anyway.

Right.  My recollection is it's defined in the vISA, but unimplemented.


yup, all PTX docs I've seen (which is up to 3.2) say:
'Note: The current version of PTX does not support alloca.'

and as Bernd says, the associated text only talks about a declaration for 32-bit 
land.


nathan

The nvptx port

2014-11-14 Thread Jakub Jelinek

Hi!

I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
   to be wired up and uses the %alloca documented in the PTX
   manual, what is the issue with that?  %alloca not being actually
   implemented by the current PTX assembler or translator?  Or
   some local vs. global address space issues?  If the latter,
   could at least VLAs be supported?
2) what is the reason why TLS isn't supported by the port (well,
   __emutls is emitted, but I doubt pthread_[gs]etspecific is
   implementable and thus it will not really do anything.
   Can't the port just emit all DECL_THREAD_LOCAL_P variables
   into .local instead of .global address space?  Would one
   need to convert those pointers to generic any way?
   I'm asking because e.g. libgomp uses __thread heavily and
   it would be nice to be able to use that.
3) in assembly emitted by the nvptx port, I've noticed:
.visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 
%in_ar2)
{
.reg.u64 %ar1;
.reg.u32 %ar2;
.reg.u32 %retval;
.reg.u64 %hr10;
.reg.u32 %r22;
.reg.u64 %r25;
   is the missing \t before the %retval line intentional?
4) I had a brief look at what it would take to port libgomp to PTX,
   which is needed for OpenMP offloading.  OpenMP offloaded kernels
   should start with 1 team and 1 thread in it, if we ignore
   GOMP_teams for now, I think the major things are:
   - right now libgomp is heavily pthread_* based, which is a no-go
 for nvptx I assume, I think we'll need some ifdefs in the sources
   - the main thing is that I believe we just have to replace
 gomp_team_start for nvptx; seems there are
 cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
 to spawn selected kernel in selected number of threads (and teams),
 from the docs it isn't exactly clear what the calling thread will do,
 if it is suspended and the HW core given to it is reused by something
 else (e.g. one of the newly spawned threads), then I think it should
 be usable.  Not sure what happens with .local memory of the parent
 task, if the children all have different .local memory, then
 perhaps one could just copy over what is needed from the
 invoking to the first invoked thread at start.  The question is
 how to figure out what to pass to cudeLaunchDevice (e.g. how to
 get handle of the current stream), and how to query how many
 teams and/or threads it is reasonable to ask for if the program
 wants defaults (and how many teams/threads are hard limits beyond which
 one can't go)
   - is it worth to reuse cudaLaunchDevice threads or are they cheap
 enough to start that any thread pooling should be removed for nvptx?
   - we'll need some synchronization primitives, I see atomic support is
 there, we need mutexes and semaphores I think, is that implementable
 using bar instruction?
   - the library uses __attribute__((constructor)) in 3 places or so,
 initialize_team is pthread specific and can be probably ifdefed out,
 we won't support dlclose in nvptx anyway, but at least we need some
 way to initialize the nvptx libgomp; if the initialization is done
 in global memory, would it persist in between different kernels,
 so can the initialization as separate kernel be run once, something
 else?
   - is there any way to do any affinity management, or shall we just
 ignore affinity strategies?
   - the target/offloading stuff should be most likely stubbed in the
 library for nvptx, target data/target regions inside of target
 regions are undefined behavior in OpenMP, no need to bloat things
   - any way how to query time?
   Other thoughts?

Jakub

Re: The nvptx port

2014-11-14 Thread Jakub Jelinek

On Fri, Nov 14, 2014 at 09:29:48AM +0100, Jakub Jelinek wrote:
 I have some questions about nvptx:

Oh, and
5) I have noticed gcc doesn't generate the .uni suffixes anywhere,
   while llvm generates them; are those appropriate only when a function
   is guaranteed to be run unconditionally from the toplevel kernel,
   or even in spots in arbitrary functions which might not be run
   unconditionally by all threads in thread block, but all threads
   that encounter the particular function will run the specific spot
   unconditionally?  I mean, if we have arbitrary function:
void foo (void) { something; bar (); something; }
   then the call is unconditional in there, but there is no guarantee
   somebody will not do
void baz (int x) { if (x  20) foo (); }
   and run foo only in a subset of the threads.

Jakub

Re: The nvptx port

2014-11-14 Thread Bernd Schmidt


Hi Jakub,


I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
to be wired up and uses the %alloca documented in the PTX
manual, what is the issue with that?  %alloca not being actually
implemented by the current PTX assembler or translator?


Yes, it's unimplemented. There's an internal declaration for it but that 
seems to be as far as it goes, and that declaration is 32-bit only anyway.



2) what is the reason why TLS isn't supported by the port (well,
__emutls is emitted, but I doubt pthread_[gs]etspecific is
implementable and thus it will not really do anything.
Can't the port just emit all DECL_THREAD_LOCAL_P variables
into .local instead of .global address space?


.local is stack frame memory, not TLS. The ptx docs mention the use of 
.local at file-scope as occurring only in legacy ptx code and I get 
the impression it's discouraged.


(As an aside, there's a question of how to represent a different 
concept, gang-local memory, in gcc. That would be .shared memory. We're 
currently going with just using an internal attribute)



3) in assembly emitted by the nvptx port, I've noticed:
.visible .func (.param.u32 %out_retval)foo(.param.u64 %in_ar1, .param.u32 
%in_ar2)
{
.reg.u64 %ar1;
.reg.u32 %ar2;
.reg.u32 %retval;
.reg.u64 %hr10;
.reg.u32 %r22;
.reg.u64 %r25;
is the missing \t before the %retval line intentional?


No, I can fix that up.


4) I had a brief look at what it would take to port libgomp to PTX,
which is needed for OpenMP offloading.  OpenMP offloaded kernels
should start with 1 team and 1 thread in it, if we ignore
GOMP_teams for now, I think the major things are:
- right now libgomp is heavily pthread_* based, which is a no-go
  for nvptx I assume, I think we'll need some ifdefs in the sources


I haven't looked into whether libpthread is doable. I suspect it's a 
poor match. I also haven't really looked into OpenMP, so I'm feeling a 
bit uncertain about answering your further questions.



- the main thing is that I believe we just have to replace
  gomp_team_start for nvptx; seems there are
  cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
  to spawn selected kernel in selected number of threads (and teams),
  from the docs it isn't exactly clear what the calling thread will do,
  if it is suspended and the HW core given to it is reused by something
  else (e.g. one of the newly spawned threads), then I think it should
  be usable.  Not sure what happens with .local memory of the parent
  task, if the children all have different .local memory, then
  perhaps one could just copy over what is needed from the
  invoking to the first invoked thread at start.


I'm a bit confused here, it sounds as if you want to call 
cudaLaunchDevice from ptx code? These are called from the host. As 
mentioned above, .local is probably not useful for what you want.



- is it worth to reuse cudaLaunchDevice threads or are they cheap
  enough to start that any thread pooling should be removed for nvptx?


Sorry, I don't understand the question.


- we'll need some synchronization primitives, I see atomic support is
  there, we need mutexes and semaphores I think, is that implementable
  using bar instruction?


It's probably membar you need.


- the library uses __attribute__((constructor)) in 3 places or so,
  initialize_team is pthread specific and can be probably ifdefed out,
  we won't support dlclose in nvptx anyway, but at least we need some
  way to initialize the nvptx libgomp; if the initialization is done
  in global memory, would it persist in between different kernels,
  so can the initialization as separate kernel be run once, something
  else?


I think that it would persist, and this would be my scheme for 
implementing constructors, but I haven't actually tried.



- is there any way to do any affinity management, or shall we just
  ignore affinity strategies?


Not sure what they do in libgomp. It's probably not a match for GPU 
architectures.



- any way how to query time?


There are %clock and %clock64 cycle counters.


Bernd

Re: The nvptx port

2014-11-14 Thread Bernd Schmidt


On 11/14/2014 11:01 AM, Jakub Jelinek wrote:

On Fri, Nov 14, 2014 at 09:29:48AM +0100, Jakub Jelinek wrote:

I have some questions about nvptx:


Oh, and
5) I have noticed gcc doesn't generate the .uni suffixes anywhere,
while llvm generates them; are those appropriate only when a function
is guaranteed to be run unconditionally from the toplevel kernel,
or even in spots in arbitrary functions which might not be run
unconditionally by all threads in thread block, but all threads
that encounter the particular function will run the specific spot
unconditionally?  I mean, if we have arbitrary function:
void foo (void) { something; bar (); something; }
then the call is unconditional in there, but there is no guarantee
somebody will not do
void baz (int x) { if (x  20) foo (); }
and run foo only in a subset of the threads.


It's unclear to me what the .uni suffix even does on calls. Google finds 
this:


  http://divmap.wordpress.com/home/divopt/

which suggests that it says that the call's predicate will evaluate to 
the same value on all threads. So I think for an unconditional call 
instruction it's just meaningless.



Bernd

Re: The nvptx port

2014-11-14 Thread Jakub Jelinek

On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:
 I have some questions about nvptx:
 1) you've said that alloca isn't supported, but it seems
 to be wired up and uses the %alloca documented in the PTX
 manual, what is the issue with that?  %alloca not being actually
 implemented by the current PTX assembler or translator?
 
 Yes, it's unimplemented. There's an internal declaration for it but that
 seems to be as far as it goes, and that declaration is 32-bit only anyway.

:(.  Does NVidia plan to fix that in next version?

 2) what is the reason why TLS isn't supported by the port (well,
 __emutls is emitted, but I doubt pthread_[gs]etspecific is
 implementable and thus it will not really do anything.
 Can't the port just emit all DECL_THREAD_LOCAL_P variables
 into .local instead of .global address space?
 
 .local is stack frame memory, not TLS. The ptx docs mention the use of
 .local at file-scope as occurring only in legacy ptx code and I get the
 impression it's discouraged.

:(.  So what other option one has to implement something like TLS, even
using inline asm or similar?  There is %tid, so perhaps indexing some array
with %tid?  The trouble with that is that some thread can do
#pragma omp parallel again, and I bet the %tid afterwards would be
again 0-(n-1), and if it is an index into a global array, it wouldn't work
well then.  Maybe without anything like TLS we can't really support nested
parallelism, only one level of #pragma omp parallel inside of nvptx regions.
But, if we add support for #pragma omp team, we'd either need the array
in gang-local memory, or some other special register to give us gang id.

BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
multiple host threads, so the question is how without local TLS we can
actually do anything at all.  Sure, we can pass parameters to the kernel,
but we'd need to propagate it through all functions.  Or can
cudaGetParameterBuffer be used for that?

 4) I had a brief look at what it would take to port libgomp to PTX,
 which is needed for OpenMP offloading.  OpenMP offloaded kernels
 should start with 1 team and 1 thread in it, if we ignore
 GOMP_teams for now, I think the major things are:
 - right now libgomp is heavily pthread_* based, which is a no-go
   for nvptx I assume, I think we'll need some ifdefs in the sources
 
 I haven't looked into whether libpthread is doable. I suspect it's a poor
 match. I also haven't really looked into OpenMP, so I'm feeling a bit
 uncertain about answering your further questions.

What OpenMP needs is essentially:
- some way to spawn multiple threads (fork-join model), where the parent
  thread is the first one among those other threads, or, if that isn't
  possible, the first thread pretends to be the same as the first thread
  and the parent thread sleeps
- something like pthread_mutex_lock/unlock (only basic; or say atomic ops + 
futex
  we use for Linux)
- something like sem_* semaphore
- and some TLS or something similar (pthread_[gs]etspecific etc.)

 - the main thing is that I believe we just have to replace
   gomp_team_start for nvptx; seems there are
   cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
   to spawn selected kernel in selected number of threads (and teams),
   from the docs it isn't exactly clear what the calling thread will do,
   if it is suspended and the HW core given to it is reused by something
   else (e.g. one of the newly spawned threads), then I think it should
   be usable.  Not sure what happens with .local memory of the parent
   task, if the children all have different .local memory, then
   perhaps one could just copy over what is needed from the
   invoking to the first invoked thread at start.
 
 I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice
 from ptx code? These are called from the host. As mentioned above, .local is
 probably not useful for what you want.

In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned
it should be possible, there is:
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
(
.param .b64 func,
.param .b64 parameterBuffer,
.param .align 4 .b8 gridDimension[12],
.param .align 4 .b8 blockDimension[12],
.param .b32 sharedMemSize,
.param .b64 stream
)
;
(or s/.b64/.b32/ for -m32) that should be usable from within PTX.
The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic
parallelism (because all other variants are just bad for OpenMP, you'd need
to preallocate all the gangs/threads (without knowing how many you'll need),
and perhaps let them sleep on some barrier until you have work for them.

 - is it worth to reuse cudaLaunchDevice threads or are they cheap
   enough to start that any thread pooling should be removed for nvptx?
 
 Sorry, I don't understand the question.

I meant what is the cost of cudaLaunchDevice

Re: The nvptx port

2014-11-14 Thread Bernd Schmidt

I'm adding Thomas and Cesar to the Cc list, they may have more insight 
into CUDA library questions as I haven't really looked into that part 
all that much.


On 11/14/2014 12:39 PM, Jakub Jelinek wrote:

On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:

I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
to be wired up and uses the %alloca documented in the PTX
manual, what is the issue with that?  %alloca not being actually
implemented by the current PTX assembler or translator?


Yes, it's unimplemented. There's an internal declaration for it but that
seems to be as far as it goes, and that declaration is 32-bit only anyway.


:(.  Does NVidia plan to fix that in next version?


I very much doubt it. It was like this in CUDA 5.0 when we started 
working on it, and it's still like this in CUDA 6.5.



2) what is the reason why TLS isn't supported by the port (well,
__emutls is emitted, but I doubt pthread_[gs]etspecific is
implementable and thus it will not really do anything.
Can't the port just emit all DECL_THREAD_LOCAL_P variables
into .local instead of .global address space?


.local is stack frame memory, not TLS. The ptx docs mention the use of
.local at file-scope as occurring only in legacy ptx code and I get the
impression it's discouraged.


:(.  So what other option one has to implement something like TLS, even
using inline asm or similar?  There is %tid, so perhaps indexing some array
with %tid?


That ought to work. For performance you'd want that array in .shared 
memory but I believe that's limited in size.



BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
multiple host threads, so the question is how without local TLS we can
actually do anything at all.  Sure, we can pass parameters to the kernel,
but we'd need to propagate it through all functions.  Or can
cudaGetParameterBuffer be used for that?


Presumably a kernel could copy its arguments out to memory somewhere 
when it's called?



4) I had a brief look at what it would take to port libgomp to PTX,
which is needed for OpenMP offloading.  OpenMP offloaded kernels
should start with 1 team and 1 thread in it, if we ignore
GOMP_teams for now, I think the major things are:
- right now libgomp is heavily pthread_* based, which is a no-go
  for nvptx I assume, I think we'll need some ifdefs in the sources


I haven't looked into whether libpthread is doable. I suspect it's a poor
match. I also haven't really looked into OpenMP, so I'm feeling a bit
uncertain about answering your further questions.


What OpenMP needs is essentially:
- some way to spawn multiple threads (fork-join model), where the parent
   thread is the first one among those other threads, or, if that isn't
   possible, the first thread pretends to be the same as the first thread
   and the parent thread sleeps
- something like pthread_mutex_lock/unlock (only basic; or say atomic ops + 
futex
   we use for Linux)
- something like sem_* semaphore
- and some TLS or something similar (pthread_[gs]etspecific etc.)


- the main thing is that I believe we just have to replace
  gomp_team_start for nvptx; seems there are
  cudaLaunchDevice (and cudaGetParameterBuffer) functions one can use
  to spawn selected kernel in selected number of threads (and teams),
  from the docs it isn't exactly clear what the calling thread will do,
  if it is suspended and the HW core given to it is reused by something
  else (e.g. one of the newly spawned threads), then I think it should
  be usable.  Not sure what happens with .local memory of the parent
  task, if the children all have different .local memory, then
  perhaps one could just copy over what is needed from the
  invoking to the first invoked thread at start.


I'm a bit confused here, it sounds as if you want to call cudaLaunchDevice
from ptx code? These are called from the host. As mentioned above, .local is
probably not useful for what you want.


In CUDA_Dynamic_Parallelism_Programming_Guide.pdf in C.3.2 it is mentioned
it should be possible, there is:
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
(
.param .b64 func,
.param .b64 parameterBuffer,
.param .align 4 .b8 gridDimension[12],
.param .align 4 .b8 blockDimension[12],
.param .b32 sharedMemSize,
.param .b64 stream
)
;
(or s/.b64/.b32/ for -m32) that should be usable from within PTX.
The Liao-OpenMP-Accelerator-Model-2013.pdf paper also mentions using dynamic
parallelism (because all other variants are just bad for OpenMP, you'd need
to preallocate all the gangs/threads (without knowing how many you'll need),
and perhaps let them sleep on some barrier until you have work for them.


The latter would have been essentially the model I'd have tried to use 
(instead of sleeping, conditionalize on %tid==0). I didn't know there 
was a way to launch kernels from ptx code and haven't thought about

Re: The nvptx port

2014-11-14 Thread Jakub Jelinek

On Fri, Nov 14, 2014 at 01:12:40PM +0100, Bernd Schmidt wrote:
 :(.  So what other option one has to implement something like TLS, even
 using inline asm or similar?  There is %tid, so perhaps indexing some array
 with %tid?
 
 That ought to work. For performance you'd want that array in .shared memory
 but I believe that's limited in size.

Any way to query those limits?  Size of .shared memory, number of threads in
warp, number of warps, etc.?  In OpenACC, are all workers in a single gang
the same warp?

 BTW, one can still invoke OpenMP target regions (even OpenACC regions) from
 multiple host threads, so the question is how without local TLS we can
 actually do anything at all.  Sure, we can pass parameters to the kernel,
 but we'd need to propagate it through all functions.  Or can
 cudaGetParameterBuffer be used for that?
 
 Presumably a kernel could copy its arguments out to memory somewhere when
 it's called?

The question is where.  If it is global memory, then how would you find out
what value is for your team and what value is for some other team?

 - we'll need some synchronization primitives, I see atomic support is
   there, we need mutexes and semaphores I think, is that implementable
   using bar instruction?
 
 It's probably membar you need.
 
 That is a memory barrier, I need threads to wait on each other, wake up one
 another etc.
 
 Hmm. It's worthwhile to keep in mind that GPU threads really behave somewhat
 differently from CPUs (they don't really execute independently); the OMP
 model may just be a poor match for the architecture in general.
 One could busywait on a spinlock, but AFAIK there isn't really a way to put
 a thread to sleep. By not executing independently, I mean this: I believe if
 one thread in a warp is waiting on the spinlock, all the other ones are also
 busywaiting. There may be other effects that seem odd if one approaches it
 from a CPU perspective - for example you probably want only one thread in a
 warp to try to take the spinlock.

So, for a warp, if some threads perform one branch of an if and other
threads another one, all threads perform the first one first (with some
maybe not doing anything), then all the threads the others (again, other
threads not doing anything)?

As for the match, OpenMP isn't written for a particular accelerator, though
supposedly the addition of #pragma omp teams construct was done for NVidia.
So, some OpenMP code may be efficient on PTX, while other code might not be
that much (e.g. if all threads in a warp need to execute the same thing,
supposedly #pragma omp task isn't very good idea for the devices).

Jakub

Re: The nvptx port

2014-11-14 Thread Bernd Schmidt


On 11/14/2014 01:36 PM, Jakub Jelinek wrote:

Any way to query those limits?  Size of .shared memory, number of threads in
warp, number of warps, etc.?


I'd have to google most of that. There seems to be a WARP_SZ constant 
available in ptx to get the size of the warp.



In OpenACC, are all workers in a single gang
the same warp?


No, warps are a relatively small size (32 threads).


So, for a warp, if some threads perform one branch of an if and other
threads another one, all threads perform the first one first (with some
maybe not doing anything), then all the threads the others (again, other
threads not doing anything)?


I believe that's what happens.


Bernd

Re: The nvptx port

2014-11-14 Thread Cesar Philippidis

On 11/14/2014 04:12 AM, Bernd Schmidt wrote:

 - we'll need some synchronization primitives, I see atomic
 support is
   there, we need mutexes and semaphores I think, is that
 implementable
   using bar instruction?

 It's probably membar you need.

 That is a memory barrier, I need threads to wait on each other, wake
 up one
 another etc.
 
 Hmm. It's worthwhile to keep in mind that GPU threads really behave
 somewhat differently from CPUs (they don't really execute
 independently); the OMP model may just be a poor match for the
 architecture in general.
 One could busywait on a spinlock, but AFAIK there isn't really a way to
 put a thread to sleep. By not executing independently, I mean this: I
 believe if one thread in a warp is waiting on the spinlock, all the
 other ones are also busywaiting. There may be other effects that seem
 odd if one approaches it from a CPU perspective - for example you
 probably want only one thread in a warp to try to take the spinlock.

Thread synchronization in CUDA is different from conventional CPUs.
Using the gang/thread terminology, there's no way to synchronize two
threads in two different gangs in PTX without invoking separate kernels.
Basically, after a kernel is invoked, the host/accelerator (the later
using dynamic parallelism) waits for the kernel to finish, and that
effectively creates a barrier.

PTX does have an intra-gang synchronization primitive, which is helpful
if the control flow diverges within a gang. Also, unless I'm mistaken,
the PTX atomic operations only work within a gang.

Also, keep in mind that PTX doesn't have a global TID. The user needs to
calculate it using ctaid/tid and friends.

Cesar

Re: The nvptx port

2014-11-14 Thread Jakub Jelinek

On Fri, Nov 14, 2014 at 07:37:49AM -0800, Cesar Philippidis wrote:
  Hmm. It's worthwhile to keep in mind that GPU threads really behave
  somewhat differently from CPUs (they don't really execute
  independently); the OMP model may just be a poor match for the
  architecture in general.
  One could busywait on a spinlock, but AFAIK there isn't really a way to
  put a thread to sleep. By not executing independently, I mean this: I
  believe if one thread in a warp is waiting on the spinlock, all the
  other ones are also busywaiting. There may be other effects that seem
  odd if one approaches it from a CPU perspective - for example you
  probably want only one thread in a warp to try to take the spinlock.
 
 Thread synchronization in CUDA is different from conventional CPUs.
 Using the gang/thread terminology, there's no way to synchronize two
 threads in two different gangs in PTX without invoking separate kernels.
 Basically, after a kernel is invoked, the host/accelerator (the later
 using dynamic parallelism) waits for the kernel to finish, and that
 effectively creates a barrier.

I believe in OpenMP terminology a gang is a team, and inter-teams barriers
are not supposed to work etc. (though, I think locks and atomic instructions
still are, so is critical region, so I really hope atomics are atomic even
inter-gang).  So for synchronization (mutexes and semaphores, from which
barriers are implemented; but perhaps could also use bar.arrive and bar.sync)
we mainly need synchronization within the gang.

 Also, keep in mind that PTX doesn't have a global TID. The user needs to
 calculate it using ctaid/tid and friends.

Ok.  Is %gridid needed for that combo too?

Jakub

Re: The nvptx port

2014-11-14 Thread Cesar Philippidis

On 11/14/2014 08:18 AM, Jakub Jelinek wrote:

 Also, keep in mind that PTX doesn't have a global TID. The user needs to
 calculate it using ctaid/tid and friends.
 
 Ok.  Is %gridid needed for that combo too?

Eventually, probably. Currently, we're launching all of our kernels with
cuLaunchKernel, and that function doesn't take grids into account.

Nvidia's documentation is kind of confusing. They use different
terminology for their high level CUDA stuff and the low level PTX. E.g.,
what CUDA refers to blocks/warps, PTX calls CTAs. I'm not sure what
grids corresponds to, but I think it might be devices. If that's the
case, the runtime does have the capability to select which device to run
a kernel on. But, it can't run a single kernel on multiple devices
unless you use asynchronous kernel invocations.

Cesar

Re: The nvptx port

2014-11-14 Thread Jeff Law


On 11/14/14 04:09, Bernd Schmidt wrote:

Hi Jakub,


I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
to be wired up and uses the %alloca documented in the PTX
manual, what is the issue with that?  %alloca not being actually
implemented by the current PTX assembler or translator?


Yes, it's unimplemented. There's an internal declaration for it but that
seems to be as far as it goes, and that declaration is 32-bit only anyway.

Right.  My recollection is it's defined in the vISA, but unimplemented.

Jeff

Re: The nvptx port

2014-11-14 Thread Jeff Law


On 11/14/14 04:39, Jakub Jelinek wrote:

On Fri, Nov 14, 2014 at 12:09:03PM +0100, Bernd Schmidt wrote:

I have some questions about nvptx:
1) you've said that alloca isn't supported, but it seems
to be wired up and uses the %alloca documented in the PTX
manual, what is the issue with that?  %alloca not being actually
implemented by the current PTX assembler or translator?


Yes, it's unimplemented. There's an internal declaration for it but that
seems to be as far as it goes, and that declaration is 32-bit only anyway.


:(.  Does NVidia plan to fix that in next version?
They haven't indicated any such plans to me directly.  However, there's 
a clear direction to support arbitrary C/C++ over time.


jeff

Re: The nvptx port

2014-11-14 Thread Jakub Jelinek

On Fri, Nov 14, 2014 at 08:37:52AM -0800, Cesar Philippidis wrote:
 On 11/14/2014 08:18 AM, Jakub Jelinek wrote:
 
  Also, keep in mind that PTX doesn't have a global TID. The user needs to
  calculate it using ctaid/tid and friends.
  
  Ok.  Is %gridid needed for that combo too?
 
 Eventually, probably. Currently, we're launching all of our kernels with
 cuLaunchKernel, and that function doesn't take grids into account.

I wonder if cudaLaunchDevice called from PTX will result in a different
%gridid or not, will see next week if I manage to get the HW and SW stack

Jakub

Re: The nvptx port

2014-11-14 Thread Jeff Law


On 11/14/14 04:39, Jakub Jelinek wrote:


:(.  So what other option one has to implement something like TLS, even
using inline asm or similar?  There is %tid, so perhaps indexing some array
with %tid?  The trouble with that is that some thread can do
#pragma omp parallel again, and I bet the %tid afterwards would be
again 0-(n-1), and if it is an index into a global array, it wouldn't work
well then.  Maybe without anything like TLS we can't really support nested
parallelism, only one level of #pragma omp parallel inside of nvptx regions.
But, if we add support for #pragma omp team, we'd either need the array
in gang-local memory, or some other special register to give us gang id.
Does the interface to the hardware even allow a model where we can 
launch another offload task while one is in progress?


Jeff

Re: The nvptx port

2014-11-14 Thread Jeff Law


On 11/14/14 05:36, Jakub Jelinek wrote:


So, for a warp, if some threads perform one branch of an if and other
threads another one, all threads perform the first one first (with some
maybe not doing anything), then all the threads the others (again, other
threads not doing anything)?
Nobody ever specified exactly what happens in this case to me, but I 
gathered from reading the docs that once you have some threads in one 
path and others in a different path, things slow down to a horrid crawl. 
 So you try to avoid that :-)



jeff

Re: The nvptx port [0/11+]

2014-11-12 Thread Richard Biener

On Mon, Oct 20, 2014 at 4:17 PM, Bernd Schmidt ber...@codesourcery.com wrote:
 This is a patch kit that adds the nvptx port to gcc. It contains preliminary
 patches to add needed functionality, the target files, and one somewhat
 optional patch with additional target tools. There'll be more patch series,
 one for the testsuite, and one to make the offload functionality work with
 this port. Also required are the previous four rtl patches, two of which
 weren't entirely approved yet.

 For the moment, I've stripped out all the address space support that got
 bogged down in review by brokenness in our representation of address spaces.
 The ptx address spaces are of course still defined and used inside the
 backend.

 Ptx really isn't a usual target - it is a virtual target which is then
 translated by another compiler (ptxas) to the final code that runs on the
 GPU. There are many restrictions, some imposed by the GPU hardware, and some
 by the fact that not everything you'd want can be represented in ptx. Here
 are some of the highlights:
  * Everything is typed - variables, functions, registers. This can
cause problems with KR style C or anything else that doesn't
have a proper type internally.
  * Declarations are needed, even for undefined variables.
  * Can't emit initializers referring to their variable's address since
you can't write forward declarations for variables.
  * Variables can be declared only as scalars or arrays, not
structures. Initializers must be in the variable's declared type,
which requires some code in the backend, and it means that packed
pointer values are not representable.
  * Since it's a virtual target, we skip register allocation - no good
can probably come from doing that twice. This means asm statements
aren't fixed up and will fail if they use matching constraints.
  * No support for indirect jumps, label values, nonlocal gotos.
  * No alloca - ptx defines it, but it's not implemented.
  * No trampolines.
  * No debugging (at all, for now - we may add line number directives).
  * Limited C library support - I have a hacked up copy of newlib
that provides a reasonable subset.
  * malloc and free are defined by ptx (these appear to be
undocumented), but there isn't a realloc. I have one patch for
Fortran to use a malloc/memcpy helper function in cases where we
know the old size.

 All in all, this is not intended to be used as a C (or any other source
 language) compiler. I've gone through a lot of effort to make it work
 reasonably well, but only in order to get sufficient test coverage from the
 testsuites. The intended use for this is only to build it as an offload
 compiler, and use it through OpenACC by way of lto1. That leaves the
 question of how we should document it - does it need the usual constraint
 and option documentation, given that user's aren't expected to use any of
 it?

 A slightly earlier version of the entire patch kit was bootstrapped and
 tested on x86_64-linux. Ok for trunk?

Now that this has been committed - I notice that there is no entry
in MAINTAINERS for the port.  I propose Bernd.

Thanks,
Richard.


 Bernd

Re: The nvptx port [0/11+]

2014-11-12 Thread Jeff Law


On 11/12/14 05:34, Richard Biener wrote:



Now that this has been committed - I notice that there is no entry
in MAINTAINERS for the port.  I propose Bernd.
Well, ahead of you there.   I proposed Bernd to the steering committee 
as the maintainer a little while ago.  I need to go back and count votes :-)


jeff

Re: The nvptx port [10/11+] Target files

2014-11-10 Thread Bernd Schmidt


On 10/30/2014 12:35 AM, Jeff Law wrote:

A nit -- Richard S. recently removed the need to include the enum
for enum machine_mode.  I believe he had a script to handle the
mundane parts of that change.  Please make sure to update the nvptx port
to conform to that new convention, obviously feel free to use the script
if you want.

You may need to update with James Greenhalgh's changes to
MOVE_BY_PIECES_P and friends.

With those two issues addressed as needed, this is OK for the trunk.


I've now committed it, in the following form. Other than the enum thing, 
this also adds some atomic instructions.


The scripts (11/11) I've put up on github, along with a hacked up 
newlib. These are at


https://github.com/bernds/nvptx-tools
https://github.com/bernds/nvptx-newlib

They are likely to migrate to MentorEmbedded from bernds, but that had 
some permissions problems last week.



Bernd

commit 659744a99d815b168716b4460e32f6a21593e494
Author: Bernd Schmidt ber...@codesourcery.com
Date:   Thu Nov 6 19:03:57 2014 +0100

Add the nvptx port.

	* configure.ac: Handle nvptx-*-*.
	* configure: Regenerate.

	gcc/
	* config/nvptx/nvptx.c: New file.
	* config/nvptx/nvptx.h: New file.
	* config/nvptx/nvptx-protos.h: New file.
	* config/nvptx/nvptx.md: New file.
	* config/nvptx/t-nvptx: New file.
	* config/nvptx/nvptx.opt: New file.
	* common/config/nvptx/nvptx-common.c: New file.
	* config.gcc: Handle nvptx-*-*.

	libgcc/
	* config.host: Handle nvptx-*-*.
	* shared-object.mk (as-flags-$o): Define.
	($(base)$(objext), $(base)_s$(objext)): Use it instead of
	-xassembler-with-cpp.
	* static-object.mk: Identical changes.
	* config/nvptx/t-nvptx: New file.
	* config/nvptx/crt0.s: New file.
	* config/nvptx/free.asm: New file.
	* config/nvptx/malloc.asm: New file.
	* config/nvptx/realloc.c: New file.

diff --git a/ChangeLog b/ChangeLog
index fd6172a..e83d1e6 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2014-11-06  Bernd Schmidt  ber...@codesourcery.com
+
+	* configure.ac: Handle nvptx-*-*.
+	* configure: Regenerate.
+
 2014-11-06  Prachi Godbole  prachi.godb...@imgtec.com
 
 	* MAINTAINERS (Write After Approval): Add myself.
diff --git a/configure b/configure
index d0c760b..0e014a3 100755
--- a/configure
+++ b/configure
@@ -3779,6 +3779,10 @@ case ${target} in
   mips*-*-*)
 noconfigdirs=$noconfigdirs gprof
 ;;
+  nvptx*-*-*)
+# nvptx is just a compiler
+noconfigdirs=$noconfigdirs target-libssp target-libstdc++-v3 target-libobjc
+;;
   sh-*-* | sh64-*-*)
 case ${target} in
   sh*-*-elf)
diff --git a/configure.ac b/configure.ac
index 2f0af4a..b1ef069 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1138,6 +1138,10 @@ case ${target} in
   mips*-*-*)
 noconfigdirs=$noconfigdirs gprof
 ;;
+  nvptx*-*-*)
+# nvptx is just a compiler
+noconfigdirs=$noconfigdirs target-libssp target-libstdc++-v3 target-libobjc
+;;
   sh-*-* | sh64-*-*)
 case ${target} in
   sh*-*-elf)
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 731a7bc8b..c170e69 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,14 @@
+2014-11-10  Bernd Schmidt  ber...@codesourcery.com
+
+	* config/nvptx/nvptx.c: New file.
+	* config/nvptx/nvptx.h: New file.
+	* config/nvptx/nvptx-protos.h: New file.
+	* config/nvptx/nvptx.md: New file.
+	* config/nvptx/t-nvptx: New file.
+	* config/nvptx/nvptx.opt: New file.
+	* common/config/nvptx/nvptx-common.c: New file.
+	* config.gcc: Handle nvptx-*-*.
+
 2014-11-10  Richard Biener  rguent...@suse.de
 
 	* tree-ssa-operands.c (finalize_ssa_uses): Properly put
diff --git a/gcc/common/config/nvptx/nvptx-common.c b/gcc/common/config/nvptx/nvptx-common.c
new file mode 100644
index 000..80ab076
--- /dev/null
+++ b/gcc/common/config/nvptx/nvptx-common.c
@@ -0,0 +1,38 @@
+/* NVPTX common hooks.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   Contributed by Bernd Schmidt ber...@codesourcery.com
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+http://www.gnu.org/licenses/.  */
+
+#include config.h
+#include system.h
+#include coretypes.h
+#include diagnostic-core.h
+#include tm.h
+#include tm_p.h
+#include common/common-target.h
+#include common/common-target-def.h
+#include opts.h
+#include flags.h
+
+#undef TARGET_HAVE_NAMED_SECTIONS
+#define TARGET_HAVE_NAMED_SECTIONS false
+
+#undef TARGET_DEFAULT_TARGET_FLAGS

Re: The nvptx port [10/11+] Target files

2014-11-10 Thread Jakub Jelinek

On Mon, Nov 10, 2014 at 05:19:57PM +0100, Bernd Schmidt wrote:
 commit 659744a99d815b168716b4460e32f6a21593e494
 Author: Bernd Schmidt ber...@codesourcery.com
 Date:   Thu Nov 6 19:03:57 2014 +0100

Note, in r217301 you've committed a change to pr35468.c, not mentioned in
the ChangeLog, that uses no_const_addr_space effective target that is never
defined.  Can you please revert or commit a patch that adds support for that
to gcc/testsuite/lib/ ?

+ERROR: gcc.c-torture/compile/pr35468.c   -O0 : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O0 : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+ERROR: gcc.c-torture/compile/pr35468.c   -O1 : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O1 : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+ERROR: gcc.c-torture/compile/pr35468.c   -O2 -flto -flto-partition=none : 
unknown effective target keyword \`no_const_addr_space' for  
dg-require-effective-target 2 no_const_addr_space 
+UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O2 -flto -flto-partition=none : 
unknown effective target keyword \`no_const_addr_space' for  
dg-require-effective-target 2 no_const_addr_space 
+ERROR: gcc.c-torture/compile/pr35468.c   -O2 -flto : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O2 -flto : unknown effective 
target keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+ERROR: gcc.c-torture/compile/pr35468.c   -O2 : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O2 : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+ERROR: gcc.c-torture/compile/pr35468.c   -O3 -fomit-frame-pointer : unknown 
effective target keyword \`no_const_addr_space' for  
dg-require-effective-target 2 no_const_addr_space 
+UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O3 -fomit-frame-pointer : 
unknown effective target keyword \`no_const_addr_space' for  
dg-require-effective-target 2 no_const_addr_space 
+ERROR: gcc.c-torture/compile/pr35468.c   -O3 -g : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O3 -g : unknown effective 
target keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+ERROR: gcc.c-torture/compile/pr35468.c   -Os : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+UNRESOLVED: gcc.c-torture/compile/pr35468.c   -Os : unknown effective target 
keyword \`no_const_addr_space' for  dg-require-effective-target 2 
no_const_addr_space 
+ERROR: gcc.dg/pr44194-1.c: syntax error in target selector target  { { { { { 
i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
hppa*64*-*-* } }  { ! alpha*-*-* }{ ! powerpc*-*-linux* } || 
powerpc_elfv2! nvptx-*-* for  dg-do 1 compile { target { { { { { { { 
i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
hppa*64*-*-* } }  { ! alpha*-*-* } }  { { ! powerpc*-*-linux* } || 
powerpc_elfv2 }  { ! nvptx-*-* } } } 
+UNRESOLVED: gcc.dg/pr44194-1.c: syntax error in target selector target  { { { 
{ { i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
hppa*64*-*-* } }  { ! alpha*-*-* }{ ! powerpc*-*-linux* } || 
powerpc_elfv2! nvptx-*-* for  dg-do 1 compile { target { { { { { { { 
i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
hppa*64*-*-* } }  { ! alpha*-*-* } }  { { ! powerpc*-*-linux* } || 
powerpc_elfv2 }  { ! nvptx-*-* } } } 
+FAIL: gcc.dg/pr45352-1.c (test for excess errors)

Jakub

Re: The nvptx port [10/11+] Target files

2014-11-10 Thread H.J. Lu

On Mon, Nov 10, 2014 at 12:04 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Mon, Nov 10, 2014 at 05:19:57PM +0100, Bernd Schmidt wrote:
 commit 659744a99d815b168716b4460e32f6a21593e494
 Author: Bernd Schmidt ber...@codesourcery.com
 Date:   Thu Nov 6 19:03:57 2014 +0100

 Note, in r217301 you've committed a change to pr35468.c, not mentioned in
 the ChangeLog, that uses no_const_addr_space effective target that is never
 defined.  Can you please revert or commit a patch that adds support for that
 to gcc/testsuite/lib/ ?

 +ERROR: gcc.c-torture/compile/pr35468.c   -O0 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O0 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O1 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O1 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O2 -flto -flto-partition=none : 
 unknown effective target keyword \`no_const_addr_space' for  
 dg-require-effective-target 2 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O2 -flto -flto-partition=none 
 : unknown effective target keyword \`no_const_addr_space' for  
 dg-require-effective-target 2 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O2 -flto : unknown effective 
 target keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O2 -flto : unknown effective 
 target keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O2 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O2 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O3 -fomit-frame-pointer : unknown 
 effective target keyword \`no_const_addr_space' for  
 dg-require-effective-target 2 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O3 -fomit-frame-pointer : 
 unknown effective target keyword \`no_const_addr_space' for  
 dg-require-effective-target 2 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O3 -g : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O3 -g : unknown effective 
 target keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -Os : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -Os : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.dg/pr44194-1.c: syntax error in target selector target  { { { { 
 { i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
 hppa*64*-*-* } }  { ! alpha*-*-* }{ ! powerpc*-*-linux* } || 
 powerpc_elfv2! nvptx-*-* for  dg-do 1 compile { target { { { { { { { 
 i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
 hppa*64*-*-* } }  { ! alpha*-*-* } }  { { ! powerpc*-*-linux* } || 
 powerpc_elfv2 }  { ! nvptx-*-* } } } 
 +UNRESOLVED: gcc.dg/pr44194-1.c: syntax error in target selector target  { { 
 { { { i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
 hppa*64*-*-* } }  { ! alpha*-*-* }{ ! powerpc*-*-linux* } || 
 powerpc_elfv2! nvptx-*-* for  dg-do 1 compile { target { { { { { { { 
 i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
 hppa*64*-*-* } }  { ! alpha*-*-* } }  { { ! powerpc*-*-linux* } || 
 powerpc_elfv2 }  { ! nvptx-*-* } } } 
 +FAIL: gcc.dg/pr45352-1.c (test for excess errors)

 Jakub



-- 
H.J.

Re: The nvptx port [10/11+] Target files

2014-11-10 Thread H.J. Lu

On Mon, Nov 10, 2014 at 12:04 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Mon, Nov 10, 2014 at 05:19:57PM +0100, Bernd Schmidt wrote:
 commit 659744a99d815b168716b4460e32f6a21593e494
 Author: Bernd Schmidt ber...@codesourcery.com
 Date:   Thu Nov 6 19:03:57 2014 +0100

 Note, in r217301 you've committed a change to pr35468.c, not mentioned in
 the ChangeLog, that uses no_const_addr_space effective target that is never
 defined.  Can you please revert or commit a patch that adds support for that
 to gcc/testsuite/lib/ ?

 +ERROR: gcc.c-torture/compile/pr35468.c   -O0 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O0 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O1 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O1 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O2 -flto -flto-partition=none : 
 unknown effective target keyword \`no_const_addr_space' for  
 dg-require-effective-target 2 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O2 -flto -flto-partition=none 
 : unknown effective target keyword \`no_const_addr_space' for  
 dg-require-effective-target 2 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O2 -flto : unknown effective 
 target keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O2 -flto : unknown effective 
 target keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O2 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O2 : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O3 -fomit-frame-pointer : unknown 
 effective target keyword \`no_const_addr_space' for  
 dg-require-effective-target 2 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O3 -fomit-frame-pointer : 
 unknown effective target keyword \`no_const_addr_space' for  
 dg-require-effective-target 2 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -O3 -g : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -O3 -g : unknown effective 
 target keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.c-torture/compile/pr35468.c   -Os : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +UNRESOLVED: gcc.c-torture/compile/pr35468.c   -Os : unknown effective target 
 keyword \`no_const_addr_space' for  dg-require-effective-target 2 
 no_const_addr_space 
 +ERROR: gcc.dg/pr44194-1.c: syntax error in target selector target  { { { { 
 { i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
 hppa*64*-*-* } }  { ! alpha*-*-* }{ ! powerpc*-*-linux* } || 
 powerpc_elfv2! nvptx-*-* for  dg-do 1 compile { target { { { { { { { 
 i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
 hppa*64*-*-* } }  { ! alpha*-*-* } }  { { ! powerpc*-*-linux* } || 
 powerpc_elfv2 }  { ! nvptx-*-* } } } 
 +UNRESOLVED: gcc.dg/pr44194-1.c: syntax error in target selector target  { { 
 { { { i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
 hppa*64*-*-* } }  { ! alpha*-*-* }{ ! powerpc*-*-linux* } || 
 powerpc_elfv2! nvptx-*-* for  dg-do 1 compile { target { { { { { { { 
 i?86-*-* x86_64-*-* }  x32 } || lp64 }  { ! s390*-*-* } }  { ! 
 hppa*64*-*-* } }  { ! alpha*-*-* } }  { { ! powerpc*-*-linux* } || 
 powerpc_elfv2 }  { ! nvptx-*-* } } } 
 +FAIL: gcc.dg/pr45352-1.c (test for excess errors)

 Jakub

I reverted the change in gcc.c-torture/compile/pr35468.c.
I also checked in this patch to add missing braces in
gcc.dg/pr44194-1.c.



-- 
H.J.
-
Index: ChangeLog
===
--- ChangeLog (revision 217315)
+++ ChangeLog (working copy)
@@ -1,3 +1,7 @@
+2014-11-10  H.J. Lu  hongjiu...@intel.com
+
+ * gcc.dg/pr44194-1.c (dg-do): Add missing braces.
+
 2014-11-10 Roman Gareev  gareevro...@gmail.com

  * gcc.dg/graphite/isl-ast-gen-blocks-1.c: Remove using of
Index: gcc.dg/pr44194-1.c
===
--- gcc.dg/pr44194-1.c (revision 217315)
+++ gcc.dg/pr44194-1.c

Re: The nvptx port [10/11+] Target files

2014-11-10 Thread Mike Stump

On Nov 10, 2014, at 12:37 PM, H.J. Lu hjl.to...@gmail.com wrote:
 I also checked in this patch to add missing braces in
 gcc.dg/pr44194-1.c.

Thanks.

the nvptx port

2014-11-07 Thread VandeVondele Joost

Hi Bernd,

reading the patches, it seems like there is no mention of sm_35, only sm_30. 
So, I'm wondering what 'sub'targets will initially be supported, and 
if/how/when various processors will be selected.

Thanks,

Joost

Re: The nvptx port [8/11+] Write undefined decls.

2014-11-05 Thread Bernd Schmidt


On 10/22/2014 08:11 PM, Jeff Law wrote:

I'm not going to insist you do this in the same way as the PA.  That was
a different era -- we had significant motivation to make things work in
such a way that everything could be buried in the pa specific files.
That sometimes led to less than optimal approaches to fix certain problems.


So... is this patch approved?


Bernd

Re: The nvptx port [10/11+] Target files

2014-11-05 Thread Bernd Schmidt


On 11/04/2014 05:51 PM, Bernd Schmidt wrote:

On 11/04/2014 05:48 PM, Richard Henderson wrote:

On 10/28/2014 03:56 PM, Bernd Schmidt wrote:

+nvptx_ptx_type_from_mode (enum machine_mode mode, bool promote)
+{
+  switch (mode)
+{
+case BLKmode:
+  return .b8;
+case BImode:
+  return .pred;
+case QImode:
+  if (promote)
+return .u32;
+  else
+return .u8;
+case HImode:
+  return .u16;


Promote here too?  Or does this have nothing to do with


+static enum machine_mode
+arg_promotion (enum machine_mode mode)
+{
+  if (mode == QImode || mode == HImode)
+return SImode;
+  return mode;
+}


No, these are different problems - the one in arg promotion is purely
about KR C and trying to match untyped function decls with calls, while
the type_from_mode bit was about some ptx ideosyncracy. Although I
forget what the problem was, that code is more than a year old - I'll
see if I can get rid of this.


Err, no, it's quite necessary. From the manual The .u8, .s8 and .b8 
instruction types are restricted to ld, st and cvt instructions. This 
means that if the compiler generates reasonable-looking code along the 
lines of


.reg .u8 %r70;
mov.u8 %r70,48;

you get

ptxas 2211-1.o, line 191; error   : Arguments mismatch for 
instruction 'mov'


Now, one _could_ write .cvt.u8.u32 for the load immediate, but then one 
would also have to write .cvt.u8.u8 for register-register moves, and 
that's starting to look iffy. I don't really want to rely on the ptx 
assembler to do the right thing for conversions from one type to itself.



Bernd

Re: The nvptx port [8/11+] Write undefined decls.

2014-11-05 Thread Jeff Law


On 11/05/14 05:01, Bernd Schmidt wrote:

On 10/22/2014 08:11 PM, Jeff Law wrote:

I'm not going to insist you do this in the same way as the PA.  That was
a different era -- we had significant motivation to make things work in
such a way that everything could be buried in the pa specific files.
That sometimes led to less than optimal approaches to fix certain
problems.


So... is this patch approved?

Yes, sorry for not being explicit.

Jeff

Re: The nvptx port [1/11+] indirect jumps

2014-11-04 Thread Bernd Schmidt


On 10/20/2014 04:19 PM, Bernd Schmidt wrote:

ptx doesn't have indirect jumps, so CODE_FOR_indirect_jump may not be
defined.  Add a sorry.


Looking back through all the mails it turns out this one wasn't approved 
yet. Ping?



Bernd

Re: The nvptx port [1/11+] indirect jumps

2014-11-04 Thread Richard Henderson

On 11/04/2014 04:32 PM, Bernd Schmidt wrote:
 On 10/20/2014 04:19 PM, Bernd Schmidt wrote:
 ptx doesn't have indirect jumps, so CODE_FOR_indirect_jump may not be
 defined.  Add a sorry.
 
 Looking back through all the mails it turns out this one wasn't approved yet.
 Ping?

Ok.


r~

Re: The nvptx port [10/11+] Target files

2014-11-04 Thread Richard Henderson

On 10/28/2014 03:56 PM, Bernd Schmidt wrote:
 +nvptx_ptx_type_from_mode (enum machine_mode mode, bool promote)
 +{
 +  switch (mode)
 +{
 +case BLKmode:
 +  return .b8;
 +case BImode:
 +  return .pred;
 +case QImode:
 +  if (promote)
 + return .u32;
 +  else
 + return .u8;
 +case HImode:
 +  return .u16;

Promote here too?  Or does this have nothing to do with

 +static enum machine_mode
 +arg_promotion (enum machine_mode mode)
 +{
 +  if (mode == QImode || mode == HImode)
 +return SImode;
 +  return mode;
 +}


r~

Re: The nvptx port [10/11+] Target files

2014-11-04 Thread Bernd Schmidt


On 11/04/2014 05:48 PM, Richard Henderson wrote:

On 10/28/2014 03:56 PM, Bernd Schmidt wrote:

+nvptx_ptx_type_from_mode (enum machine_mode mode, bool promote)
+{
+  switch (mode)
+{
+case BLKmode:
+  return .b8;
+case BImode:
+  return .pred;
+case QImode:
+  if (promote)
+   return .u32;
+  else
+   return .u8;
+case HImode:
+  return .u16;


Promote here too?  Or does this have nothing to do with


+static enum machine_mode
+arg_promotion (enum machine_mode mode)
+{
+  if (mode == QImode || mode == HImode)
+return SImode;
+  return mode;
+}


No, these are different problems - the one in arg promotion is purely 
about KR C and trying to match untyped function decls with calls, while 
the type_from_mode bit was about some ptx ideosyncracy. Although I 
forget what the problem was, that code is more than a year old - I'll 
see if I can get rid of this.



Bernd

Re: The nvptx port [11/11] More tools.

2014-11-03 Thread Jeff Law


On 10/31/14 17:50, Bernd Schmidt wrote:

On 10/31/2014 09:56 PM, Jeff Law wrote:

Pondering this a bit more, I think this is fine in concept.  As you
note, removing the GNU extensions or at least making them conditional
would be good since these are going to be built with the host tools.

I'm not going to dig into the implementations...  I'm going to assume
the nvptx maintainer (that's highly likely to be you :-) will own their
care and feeding.


I was beginning to think I'd just make a separate package. That could
then also include a nvptx-run which would have to link against CUDA
libraries.

Your call.

jeff

Re: The nvptx port [11/11] More tools.

2014-10-31 Thread Jeff Law


On 10/20/14 08:48, Bernd Schmidt wrote:

This is a bonus optional patch which adds ar, ranlib, as and ld to the
ptx port. This is not proper binutils; ar and ranlib are just linked to
the host versions, and the other two tools have the following functions:

* nvptx-as is required to convert the compiler output to actual valid
   ptx assembly, primarily by reordering declarations and definitions.
   Believe me when I say that I've tried to make that work in the
   compiler itself and it's pretty much impossible without some really
   invasive changes.
* nvptx-ld is just a pseudo linker that works by concatenating ptx
   input files and separating them with nul characters. Actual linking
   is something that happens later, when calling CUDA library functions,
   but existing build system make it useful to have something called
   ld which is able to bundle everything that's needed into a single
   file, and this seemed to be the simplest way of achieving this.

There's a toplevel configure.ac change necessary to make ar/ranlib
useable by the libgcc build. Having some tools built like this has some
precedent in t-vmsnative, but as Thomas noted it does make feature tests
in gcc's configure somewhat ugly (but everything works well enough to
build the compiler). The alternative here is to bundle all these files
into a separate nvptx-tools package which users would have to download -
something that would be nice to avoid.

These tools currently require GNU extensions - something I probably
ought to fix if we decide to add them to the gcc build itself.
Pondering this a bit more, I think this is fine in concept.  As you 
note, removing the GNU extensions or at least making them conditional 
would be good since these are going to be built with the host tools.


I'm not going to dig into the implementations...  I'm going to assume 
the nvptx maintainer (that's highly likely to be you :-) will own their 
care and feeding.


jeff

Re: The nvptx port [11/11] More tools.

2014-10-31 Thread Bernd Schmidt


On 10/31/2014 09:56 PM, Jeff Law wrote:

Pondering this a bit more, I think this is fine in concept.  As you
note, removing the GNU extensions or at least making them conditional
would be good since these are going to be built with the host tools.

I'm not going to dig into the implementations...  I'm going to assume
the nvptx maintainer (that's highly likely to be you :-) will own their
care and feeding.


I was beginning to think I'd just make a separate package. That could 
then also include a nvptx-run which would have to link against CUDA 
libraries.



Bernd

Re: The nvptx port [7/11+] Inform the port about call arguments

2014-10-29 Thread Jeff Law


On 10/28/14 08:49, Bernd Schmidt wrote:

On 10/22/2014 08:12 PM, Jeff Law wrote:

Yea, let's keep your approach.  Just wanted to explore a bit since the
PA seems to have a variety of similar characteristics.


Here's an updated version of the patch. I experimented a little with ptx
calling conventions and ran into an arg that had to be moved with
memcpy, which exposed an ordering problem - all call_args were added to
the memcpy call. So the invocation of the hook had to be moved downwards
a bit, and the calculation of the return value needs to happen after it
(since nvptx_function_value needs to know whether we are actually trying
to construct a call at the moment).

Bootstrapped and tested on x86_64-linux, ok?

OK.

Jeff

Re: The nvptx port [10/11+] Target files

2014-10-29 Thread Jeff Law


On 10/28/14 08:56, Bernd Schmidt wrote:


I have patches that expose all the address spaces to the middle-end
through a lower-as pass that runs early. The preliminary patches for
that ran into some resistance and into general brokenness of our address
space support, so I decided to rip all that out for the moment to get
the basic port into the next version.

This new version also implements a way of providing realloc that was
suggested in another thread. Calls to malloc and free are redirected to
libgcc variants. I'm not a big fan of wasting extra space on every
allocation (which is why I didn't originally consider this approach
viable), but it seems we'll have to do it that way. There's a change to
the libgcc build system: on ptx we need comments in the assembly to
survive, so we can't use -xassembler-with-cpp. I've not found any files
named *.asm, so I've changed that suffix to mean plain assembler.


Bernd


010-target.diff


* configure.ac: Allow configuring lto for nvptx.
* configure: Regenerate.

gcc/
* config/nvptx/nvptx.c: New file.
* config/nvptx/nvptx.h: New file.
* config/nvptx/nvptx-protos.h: New file.
* config/nvptx/nvptx.md: New file.
* config/nvptx/t-nvptx: New file.
* config/nvptx/nvptx.opt: New file.
* common/config/nvptx/nvptx-common.c: New file.
* config.gcc: Handle nvptx-*-*.

libgcc/
* config.host: Handle nvptx-*-*.
* shared-object.mk (as-flags-$o): Define.
($(base)$(objext), $(base)_s$(objext)): Use it instead of
-xassembler-with-cpp.
* static-object.mk: Identical changes.
* config/nvptx/t-nvptx: New file.
* config/nvptx/crt0.s: New file.
* config/nvptx/free.asm: New file.
* config/nvptx/malloc.asm: New file.
* config/nvptx/realloc.c: New file.
A nit -- Richard S. recently removed the need to include the enum 
for enum machine_mode.  I believe he had a script to handle the 
mundane parts of that change.  Please make sure to update the nvptx port 
to conform to that new convention, obviously feel free to use the script 
if you want.


You may need to update with James Greenhalgh's changes to 
MOVE_BY_PIECES_P and friends.


With those two issues addressed as needed, this is OK for the trunk.


FWIW, I'm amazed at how many similarities there are between what needs 
to be done for the PTX tools and what needed to be done to interface 
with the native HPPA tools way-back-when.  Simply amazing.


I notice that you've got some OpenMP bits (write_as_kernel).  Are y'all 
doing any testing with OpenMP or is that an artifact of layering OpenACC 
on top of the OpenMP infrastructure?


Also, I've asked the steering committee to appoint you as the maintainer 
for the nvptx port as well.


jeff

Re: The nvptx port [10/11+] Target files

2014-10-29 Thread Bernd Schmidt


On 10/30/2014 12:35 AM, Jeff Law wrote:

A nit -- Richard S. recently removed the need to include the enum
for enum machine_mode.  I believe he had a script to handle the
mundane parts of that change.  Please make sure to update the nvptx port
to conform to that new convention, obviously feel free to use the script
if you want.

You may need to update with James Greenhalgh's changes to
MOVE_BY_PIECES_P and friends.


Ok, I'll look into those.


With those two issues addressed as needed, this is OK for the trunk.


Thanks! I've pinged some of the preliminary patches that went unapproved 
up to this point.


One leftover issue, discussed in the [0/11] mail - what amount of 
documentation is appropriate for this, given that we don't want to 
support using this as anything other than an offload compiler? Should I 
still add all the standard invoke.texi/gccint.texi pieces?



I notice that you've got some OpenMP bits (write_as_kernel).  Are y'all
doing any testing with OpenMP or is that an artifact of layering OpenACC
on top of the OpenMP infrastructure?


The distinction between .kernel and .func is really not to do with 
either - only .kernels are callable from the host, and only .funcs are 
callable from within ptx code.



Bernd

Re: The nvptx port [10/11+] Target files

2014-10-29 Thread Jeff Law


On 10/29/14 17:55, Bernd Schmidt wrote:

Thanks! I've pinged some of the preliminary patches that went unapproved
up to this point.

Thanks.




One leftover issue, discussed in the [0/11] mail - what amount of
documentation is appropriate for this, given that we don't want to
support using this as anything other than an offload compiler? Should I
still add all the standard invoke.texi/gccint.texi pieces?
I'm still not sure here.  nvptx is quite a bit different than anything 
we've done in the past and I'm not sure how much of the traditional 
stuff we want to document vs on the other end how much of the special 
stuff we want to document.  I simply don't know.



I notice that you've got some OpenMP bits (write_as_kernel).  Are y'all
doing any testing with OpenMP or is that an artifact of layering OpenACC
on top of the OpenMP infrastructure?


The distinction between .kernel and .func is really not to do with
either - only .kernels are callable from the host, and only .funcs are
callable from within ptx code.

Ok.  Thanks for clarifying.

jeff

Re: The nvptx port [7/11+] Inform the port about call arguments

2014-10-28 Thread Bernd Schmidt


On 10/22/2014 08:12 PM, Jeff Law wrote:

Yea, let's keep your approach.  Just wanted to explore a bit since the
PA seems to have a variety of similar characteristics.


Here's an updated version of the patch. I experimented a little with ptx 
calling conventions and ran into an arg that had to be moved with 
memcpy, which exposed an ordering problem - all call_args were added to 
the memcpy call. So the invocation of the hook had to be moved downwards 
a bit, and the calculation of the return value needs to happen after it 
(since nvptx_function_value needs to know whether we are actually trying 
to construct a call at the moment).


Bootstrapped and tested on x86_64-linux, ok?


Bernd

	gcc/
	* target.def (call_args, end_call_args): New hooks.
	* hooks.c (hook_void_rtx_tree): New empty function.
	* hooks.h (hook_void_rtx_tree): Declare.
	* doc/tm.texi.in (TARGET_CALL_ARGS, TARGET_END_CALL_ARGS): Add.
	* doc/tm.texi: Regenerate.
	* calls.c (expand_call): Slightly rearrange the code.  Use the two new
	hooks.
	(expand_library_call_value_1): Use the two new hooks.


Index: gcc/doc/tm.texi
===
--- gcc/doc/tm.texi.orig
+++ gcc/doc/tm.texi
@@ -4960,6 +4960,29 @@ except the last are treated as named.
 You need not define this hook if it always returns @code{false}.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_CALL_ARGS (rtx, @var{tree})
+While generating RTL for a function call, this target hook is invoked once
+for each argument passed to the function, either a register returned by
+@code{TARGET_FUNCTION_ARG} or a memory location.  It is called just
+before the point where argument registers are stored.  The type of the
+function to be called is also passed as the second argument; it is
+@code{NULL_TREE} for libcalls.  The @code{TARGET_END_CALL_ARGS} hook is
+invoked just after the code to copy the return reg has been emitted.
+This functionality can be used to perform special setup of call argument
+registers if a target needs it.
+For functions without arguments, the hook is called once with @code{pc_rtx}
+passed instead of an argument register.
+Most ports do not need to implement anything for this hook.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_END_CALL_ARGS (void)
+This target hook is invoked while generating RTL for a function call,
+just after the point where the return reg is copied into a pseudo.  It
+signals that all the call argument and return registers for the just
+emitted call are now no longer in use.
+Most ports do not need to implement anything for this hook.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_PRETEND_OUTGOING_VARARGS_NAMED (cumulative_args_t @var{ca})
 If you need to conditionally change ABIs so that one works with
 @code{TARGET_SETUP_INCOMING_VARARGS}, but the other works like neither
Index: gcc/doc/tm.texi.in
===
--- gcc/doc/tm.texi.in.orig
+++ gcc/doc/tm.texi.in
@@ -3856,6 +3856,10 @@ These machine description macros help im
 
 @hook TARGET_STRICT_ARGUMENT_NAMING
 
+@hook TARGET_CALL_ARGS
+
+@hook TARGET_END_CALL_ARGS
+
 @hook TARGET_PRETEND_OUTGOING_VARARGS_NAMED
 
 @node Trampolines
Index: gcc/hooks.c
===
--- gcc/hooks.c.orig
+++ gcc/hooks.c
@@ -245,6 +245,11 @@ hook_void_tree (tree a ATTRIBUTE_UNUSED)
 }
 
 void
+hook_void_rtx_tree (rtx, tree)
+{
+}
+
+void
 hook_void_constcharptr (const char *a ATTRIBUTE_UNUSED)
 {
 }
Index: gcc/hooks.h
===
--- gcc/hooks.h.orig
+++ gcc/hooks.h
@@ -71,6 +71,7 @@ extern void hook_void_constcharptr (cons
 extern void hook_void_rtx_insn_int (rtx_insn *, int);
 extern void hook_void_FILEptr_constcharptr (FILE *, const char *);
 extern bool hook_bool_FILEptr_rtx_false (FILE *, rtx);
+extern void hook_void_rtx_tree (rtx, tree);
 extern void hook_void_tree (tree);
 extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
Index: gcc/target.def
===
--- gcc/target.def.orig
+++ gcc/target.def
@@ -3816,6 +3816,33 @@ not generate any instructions in this ca
  default_setup_incoming_varargs)
 
 DEFHOOK
+(call_args,
+ While generating RTL for a function call, this target hook is invoked once\n\
+for each argument passed to the function, either a register returned by\n\
+@code{TARGET_FUNCTION_ARG} or a memory location.  It is called just\n\
+before the point where argument registers are stored.  The type of the\n\
+function to be called is also passed as the second argument; it is\n\
+@code{NULL_TREE} for libcalls.  The @code{TARGET_END_CALL_ARGS} hook is\n\
+invoked just after the code to copy the return reg has been emitted.\n\
+This functionality can be used to perform special setup of call

Re: The nvptx port [10/11+] Target files

2014-10-28 Thread Bernd Schmidt


On 10/22/2014 08:01 PM, Jeff Law wrote:

Please make sure all the functions in nvptx.c have function comments.


Done, and replaced regno 4 with NVPTX_RETURN_REGNUM.


+const char *
+nvptx_output_call_insn (rtx insn, rtx result, rtx callee)

If possible, promote first argument to rtx_insn *.


Also done.


+/* Clean up subreg operands.  */

Which means what?  A little more descriptive here would be helpful.


Expanded.


I'm surprised there's not more hair around the address space issues.  I
expected more problems there.


I have patches that expose all the address spaces to the middle-end 
through a lower-as pass that runs early. The preliminary patches for 
that ran into some resistance and into general brokenness of our address 
space support, so I decided to rip all that out for the moment to get 
the basic port into the next version.


This new version also implements a way of providing realloc that was 
suggested in another thread. Calls to malloc and free are redirected to 
libgcc variants. I'm not a big fan of wasting extra space on every 
allocation (which is why I didn't originally consider this approach 
viable), but it seems we'll have to do it that way. There's a change to 
the libgcc build system: on ptx we need comments in the assembly to 
survive, so we can't use -xassembler-with-cpp. I've not found any files 
named *.asm, so I've changed that suffix to mean plain assembler.



Bernd

	* configure.ac: Allow configuring lto for nvptx.
	* configure: Regenerate.

	gcc/
	* config/nvptx/nvptx.c: New file.
	* config/nvptx/nvptx.h: New file.
	* config/nvptx/nvptx-protos.h: New file.
	* config/nvptx/nvptx.md: New file.
	* config/nvptx/t-nvptx: New file.
	* config/nvptx/nvptx.opt: New file.
	* common/config/nvptx/nvptx-common.c: New file.
	* config.gcc: Handle nvptx-*-*.

	libgcc/
	* config.host: Handle nvptx-*-*.
	* shared-object.mk (as-flags-$o): Define.
	($(base)$(objext), $(base)_s$(objext)): Use it instead of
	-xassembler-with-cpp.
	* static-object.mk: Identical changes.
	* config/nvptx/t-nvptx: New file.
	* config/nvptx/crt0.s: New file.
	* config/nvptx/free.asm: New file.
	* config/nvptx/malloc.asm: New file.
	* config/nvptx/realloc.c: New file.


Index: gcc/common/config/nvptx/nvptx-common.c
===
--- /dev/null
+++ gcc/common/config/nvptx/nvptx-common.c
@@ -0,0 +1,38 @@
+/* NVPTX common hooks.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   Contributed by Bernd Schmidt ber...@codesourcery.com
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+http://www.gnu.org/licenses/.  */
+
+#include config.h
+#include system.h
+#include coretypes.h
+#include diagnostic-core.h
+#include tm.h
+#include tm_p.h
+#include common/common-target.h
+#include common/common-target-def.h
+#include opts.h
+#include flags.h
+
+#undef TARGET_HAVE_NAMED_SECTIONS
+#define TARGET_HAVE_NAMED_SECTIONS false
+
+#undef TARGET_DEFAULT_TARGET_FLAGS
+#define TARGET_DEFAULT_TARGET_FLAGS MASK_ABI64
+
+struct gcc_targetm_common targetm_common = TARGETM_COMMON_INITIALIZER;
Index: gcc/config.gcc
===
--- gcc/config.gcc.orig
+++ gcc/config.gcc
@@ -420,6 +420,9 @@ nios2-*-*)
 	cpu_type=nios2
 	extra_options=${extra_options} g.opt
 	;;
+nvptx-*-*)
+	cpu_type=nvptx
+	;;
 powerpc*-*-*)
 	cpu_type=rs6000
 	extra_headers=ppc-asm.h altivec.h spe.h ppu_intrinsics.h paired.h spu2vmx.h vec_types.h si2vmx.h htmintrin.h htmxlintrin.h
@@ -2148,6 +2151,10 @@ nios2-*-*)
 		;;
 esac
 	;;
+nvptx-*)
+	tm_file=${tm_file} newlib-stdint.h
+	tmake_file=nvptx/t-nvptx
+	;;
 pdp11-*-*)
 	tm_file=${tm_file} newlib-stdint.h
 	use_gcc_stdint=wrap
Index: gcc/config/nvptx/nvptx.c
===
--- /dev/null
+++ gcc/config/nvptx/nvptx.c
@@ -0,0 +1,2118 @@
+/* Target code for NVPTX.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   Contributed by Bernd Schmidt ber...@codesourcery.com
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even

Re: The nvptx port [11/11] More tools.

2014-10-24 Thread Jeff Law


On 10/22/14 15:11, Bernd Schmidt wrote:

On 10/22/2014 10:31 PM, Jeff Law wrote:

These tools currently require GNU extensions - something I probably
ought to fix if we decide to add them to the gcc build itself.

Would these be more appropriate in binutils?


I don't think so, given that we don't need any piece of regular
binutils. There's no meaningful way to build libbfd. It would be strange
to build binutils and have everything that's normally part of it
disabled at configure time.
Fair enough, but I'm having trouble seeing these in GCC.  Makes me 
wonder if they ought to be a package unto themselves, nvptxtools or 
somesuch.


Note that as a separate package, you don't have to remove the GNU 
extensions :-)


jeff

Re: The nvptx port [1/11+] indirect jumps

2014-10-22 Thread Richard Biener

On Tue, Oct 21, 2014 at 11:32 PM, Bernd Schmidt ber...@codesourcery.com wrote:
 On 10/21/2014 11:30 PM, Jakub Jelinek wrote:

 At least for OpenMP, the best would be if the #pragma omp target regions
 and/or #pragma omp declare target functions contain anything a particular
 offloading accelerator can't handle, instead of failing the whole
 compilation perhaps just emit some at least by default non-fatal warning
 and not emit anything for the particular offloading target, which would
 mean
 either host fallback, or, if some other offloading target succeeded, just
 that target.


 I guess a test could be added to mkoffload if gcc were to return a different
 value for a sorry vs. any other compilation failure. The tool could then
 choose not to produce offloading support for that target.

But that would be for the whole file instead of for the specific region?

So maybe we should produce one LTO offload object for each offload
function and make the symbols they are supposed to provide weak
so a fail doesn't end up failing to link the main program?

Looks like this gets somewhat awkward with the LTO setup.

Richard.


 Bernd

Re: The nvptx port [1/11+] indirect jumps

2014-10-22 Thread Jakub Jelinek

On Wed, Oct 22, 2014 at 10:18:49AM +0200, Richard Biener wrote:
 On Tue, Oct 21, 2014 at 11:32 PM, Bernd Schmidt ber...@codesourcery.com 
 wrote:
  On 10/21/2014 11:30 PM, Jakub Jelinek wrote:
 
  At least for OpenMP, the best would be if the #pragma omp target regions
  and/or #pragma omp declare target functions contain anything a particular
  offloading accelerator can't handle, instead of failing the whole
  compilation perhaps just emit some at least by default non-fatal warning
  and not emit anything for the particular offloading target, which would
  mean
  either host fallback, or, if some other offloading target succeeded, just
  that target.
 
 
  I guess a test could be added to mkoffload if gcc were to return a different
  value for a sorry vs. any other compilation failure. The tool could then
  choose not to produce offloading support for that target.
 
 But that would be for the whole file instead of for the specific region?
 
 So maybe we should produce one LTO offload object for each offload
 function and make the symbols they are supposed to provide weak
 so a fail doesn't end up failing to link the main program?
 
 Looks like this gets somewhat awkward with the LTO setup.

I don't think we want to do a fine-grained granularity here, it will only
lead to significant nightmares.  E.g. a target region can call other target
functions, if a target function it calls (perhaps directly through a series
of other target functions, perhaps indirectly through function pointers
etc.) can't be supported by the host, you'd need to give up on offloading
all target regions that do or could invoke that.  That can be in another TU
within the same shared library etc.  And, if some regions are emitted and
others are not, #pragma omp target data will behave less predictably and
more confusingly, right now it can test, does this library have usable
offloading for everything it provides (i.e. libgomp would ask the plugin to
initialize offloading from the current shared library if not already done,
and if successful, say it supports offloading for the particular device and
map variables to that device as requested, otherwise it would just assume
only host fallback is possible and not really map anything).  When a target
region is hit, from either within the target data region or elsewhere, it is
already figured out if it has to fallback to host or not.

Now, if you have fine-grained offloading, 33.2% of target regions being
offloadable, the rest not, what would you actually do in target data region?
It doesn't generically know what target regions will be encountered.
So act as if offloading perhaps was possible?  But then at each target
region find out if it is really possible?

IMHO people that care about performance will use target regions with care,
with the offloading targets that they care about in mind, for those that
don't care about that, either they will be lucky and things will work out
all, or they will just end up with host fallback.

Jakub

Re: The nvptx port [1/11+] indirect jumps

2014-10-22 Thread Thomas Schwinge

Hi!

On Wed, 22 Oct 2014 10:18:49 +0200, Richard Biener richard.guent...@gmail.com 
wrote:
 On Tue, Oct 21, 2014 at 11:32 PM, Bernd Schmidt ber...@codesourcery.com 
 wrote:
  On 10/21/2014 11:30 PM, Jakub Jelinek wrote:
 
  At least for OpenMP, the best would be if the #pragma omp target regions
  and/or #pragma omp declare target functions contain anything a particular
  offloading accelerator can't handle, instead of failing the whole
  compilation perhaps just emit some at least by default non-fatal warning
  and not emit anything for the particular offloading target, which would
  mean
  either host fallback, or, if some other offloading target succeeded, just
  that target.
 
 
  I guess a test could be added to mkoffload if gcc were to return a different
  value for a sorry vs. any other compilation failure. The tool could then
  choose not to produce offloading support for that target.
 
 But that would be for the whole file instead of for the specific region?

I'm not sure that's what you're suggesting, but at least on non-shared
memory offloading devices, you can't switch arbitrarily between
offloading device(s) and host-fallback, for you have to do data
management between the non-shared memories.

 So maybe we should produce one LTO offload object for each offload
 function and make the symbols they are supposed to provide weak
 so a fail doesn't end up failing to link the main program?
 
 Looks like this gets somewhat awkward with the LTO setup.


Grüße,
 Thomas


pgp6yaImJYJpu.pgp
Description: PGP signature

Re: The nvptx port [1/11+] indirect jumps

2014-10-22 Thread Richard Biener

On Wed, Oct 22, 2014 at 10:34 AM, Thomas Schwinge
tho...@codesourcery.com wrote:
 Hi!

 On Wed, 22 Oct 2014 10:18:49 +0200, Richard Biener 
 richard.guent...@gmail.com wrote:
 On Tue, Oct 21, 2014 at 11:32 PM, Bernd Schmidt ber...@codesourcery.com 
 wrote:
  On 10/21/2014 11:30 PM, Jakub Jelinek wrote:
 
  At least for OpenMP, the best would be if the #pragma omp target regions
  and/or #pragma omp declare target functions contain anything a particular
  offloading accelerator can't handle, instead of failing the whole
  compilation perhaps just emit some at least by default non-fatal warning
  and not emit anything for the particular offloading target, which would
  mean
  either host fallback, or, if some other offloading target succeeded, just
  that target.
 
 
  I guess a test could be added to mkoffload if gcc were to return a 
  different
  value for a sorry vs. any other compilation failure. The tool could then
  choose not to produce offloading support for that target.

 But that would be for the whole file instead of for the specific region?

 I'm not sure that's what you're suggesting, but at least on non-shared
 memory offloading devices, you can't switch arbitrarily between
 offloading device(s) and host-fallback, for you have to do data
 management between the non-shared memories.

Oh, I see.  For HSA we simply don't emit an offload variant for code
we cannot handle.  But only for those parts.

So it's only offload or fallback for other devices?  Thus also never
share work between both for example (run N threads on the CPU
and M threads on the offload target)?

Richard.

 So maybe we should produce one LTO offload object for each offload
 function and make the symbols they are supposed to provide weak
 so a fail doesn't end up failing to link the main program?

 Looks like this gets somewhat awkward with the LTO setup.


 Grüße,
  Thomas

Re: The nvptx port [1/11+] indirect jumps

2014-10-22 Thread Jakub Jelinek

On Wed, Oct 22, 2014 at 12:02:16PM +0200, Richard Biener wrote:
  I'm not sure that's what you're suggesting, but at least on non-shared
  memory offloading devices, you can't switch arbitrarily between
  offloading device(s) and host-fallback, for you have to do data
  management between the non-shared memories.
 
 Oh, I see.  For HSA we simply don't emit an offload variant for code
 we cannot handle.  But only for those parts.
 
 So it's only offload or fallback for other devices?  Thus also never

Yeah.

 share work between both for example (run N threads on the CPU
 and M threads on the offload target)?

I believe at least for the non-shared memory the OpenMP model wouldn't allow
that.  Of course, user can do the sharing explicitly (though OpenMP 4.0
doesn't have asynchronous target regions): one could e.g. run a couple of
host tasks on the offloading region with if (0) - forced host fallback,
ensure e.g. one team and one parallel thread in that case,
and then in one host task with if (1) and use as many teams and parallel
threads as available on the offloading device.

Jakub

Re: The nvptx port [10/11+] Target files

2014-10-22 Thread Jeff Law


On 10/20/14 08:33, Bernd Schmidt wrote:

These are the main target files for the ptx port. t-nvptx is empty for
now but will grow some content with follow up patches.


Bernd


010-target.diff


* configure.ac: Allow configuring lto for nvptx.
* configure: Regenerate.

gcc/
* config/nvptx/nvptx.c: New file.
* config/nvptx/nvptx.h: New file.
* config/nvptx/nvptx-protos.h: New file.
* config/nvptx/nvptx.md: New file.
* config/nvptx/t-nvptx: New file.
* config/nvptx/nvptx.opt: New file.
* common/config/nvptx/nvptx-common.c: New file.
* config.gcc: Handle nvptx-*-*.

libgcc/
* config.host: Handle nvptx-*-*.
* config/nvptx/t-nvptx: New file.
* config/nvptx/crt0.s: New file.
Please make sure all the functions in nvptx.c have function comments. 
nvptx_split_reg_p, write_as_kernel, nvptx_write_function_decl, 
write_function_decl_only, nvptx_function_incoming_arg, 
nvptx_promote_function_mode, nvptx_maybe_convert_symbolic_operand, etc.


There are many others..  A scan over that entire file would be appreciated.





+
+/* TARGET_FUNCTION_VALUE implementation.  Returns an RTX representing the place
+   where function FUNC returns or receives a value of data type TYPE.  */
+
+static rtx
+nvptx_function_value (const_tree type, const_tree func ATTRIBUTE_UNUSED,
+ bool outgoing)
+{
+  int unsignedp = TYPE_UNSIGNED (type);
+  enum machine_mode orig_mode = TYPE_MODE (type);
+  enum machine_mode mode = promote_function_mode (type, orig_mode,
+ unsignedp, NULL_TREE, 1);
+  if (outgoing)
+return gen_rtx_REG (mode, 4);
+  if (cfun-machine-start_call == NULL_RTX)
+/* Pretend to return in a hard reg for early uses before pseudos can be
+   generated.  */
+return gen_rtx_REG (mode, 4);
+  return gen_reg_rtx (mode);

Rather than magic register numbers, can you use something symbolic?


+}
+
+/* Implement TARGET_LIBCALL_VALUE.  */
+
+static rtx
+nvptx_libcall_value (enum machine_mode mode, const_rtx)
+{
+  if (cfun-machine-start_call == NULL_RTX)
+/* Pretend to return in a hard reg for early uses before pseudos can be
+   generated.  */
+return gen_rtx_REG (mode, 4);
+  return gen_reg_rtx (mode);
+}

Similarly.



+
+/* Implement TARGET_FUNCTION_VALUE_REGNO_P.  */
+
+static bool
+nvptx_function_value_regno_p (const unsigned int regno)
+{
+  return regno == 4;
+}

Here too.



+
+bool
+nvptx_hard_regno_mode_ok (int regno, enum machine_mode mode)
+{
+  if (regno != 4 || cfun == NULL || cfun-machine-ret_reg_mode == VOIDmode)
+return true;
+  return mode == cfun-machine-ret_reg_mode;
+}

Function comment.  Magic register #.



+
+const char *
+nvptx_output_call_insn (rtx insn, rtx result, rtx callee)

If possible, promote first argument to rtx_insn *.


+
+/* Clean up subreg operands.  */
Which means what?  A little more descriptive here would be helpful.  I 
have a guess what you need to do here, but more commentary would be 
helpful for someone that hasn't read through the virtual PTX ISA.


The machine description is about what I would expect, in fact, it shows 
how nice a virtual ISA can be.


Overall it seems pretty reasonable.  Most of the difficulty appears to 
be interfacing with the 3rd party tools, but that's largely expected.


I'm surprised there's not more hair around the address space issues.  I 
expected more problems there.


I'm going to trust that all the ABI related stuff is correct.  I'm not 
going to second guess any of that stuff.


I think we've got a couple things to iterate on from yesterday and 
you've got some minor stuff to address as noted above, but this looks 
pretty close to being ready.



jeff

Re: The nvptx port [7/11+] Inform the port about call arguments

2014-10-22 Thread Jeff Law


On 10/21/14 16:06, Bernd Schmidt wrote:

On 10/21/2014 11:53 PM, Jeff Law wrote:


So, in the end I'm torn.  I don't like adding new hooks when they're not
needed, but I have some reservations about relying on the order of stuff
in CALL_INSN_FUNCTION_USAGE and I worry a bit that you might end up with
stuff other than arguments on that list -- the PA port could filter on
the hard registers used for passing arguments, so other stuff appearing
isn't a big deal.


This is another worry. Also, at the moment we don't actually add the
pseudos to CALL_INSN_FUNCTION_USAGE (that's patch 6/11), we use the regs
saved by the call_args hook to make proper USEs in a PARALLEL. I'm not
convinced the rest of the compiler would be too happy to see pseudos there.

So, in all I'd say it's probably possible to do it that way, but it
feels a lot iffier than I'd be happy with. I for one didn't know about
the PA requirement, so I could easily have broken it unknowingly if I'd
made some random change modifying call expansion.
Yea, let's keep your approach.  Just wanted to explore a bit since the 
PA seems to have a variety of similar characteristics.


jeff

Re: The nvptx port [8/11+] Write undefined decls.

2014-10-22 Thread Jeff Law


On 10/21/14 16:15, Bernd Schmidt wrote:

On 10/22/2014 12:05 AM, Jeff Law wrote:

On 10/20/14 14:30, Bernd Schmidt wrote:

ptx assembly requires that declarations are written for undefined
variables. This adds that functionality.

Does this need to happen at the use site, or can it be deferred?


This is independent of use sites. The patch just adds another walk over
the varpool to emit not just the defined vars.

Ideally we'd maintain an order that declares or defines every variable
before it is referenced by an initializer, but the attempt to do that in
the compiler totally failed due to references between constant pools and
regular variables. The nvptx-as tool we have fixes up the order of
declarations after the first compilation stage.


THe PA had to do something similar.  We built up a vector of every
external object in ASM_OUTPUT_EXTERNAL, but did not emit anything.

Then in ASM_FILE_END, we walked that vector and anything that was
actually referenced (as opposed to just just declared) we would emit the
magic .IMPORT lines.


Sounds like the PA could use this hook to simplify its code quite a bit.
The PA stuff is a trivial amount of code :-)  But it is a bit awkward in 
that we're using a per-variable hook to stash, then the end-file hook to 
walk the stashed stuff.


IIRC, the problem is tentative definitions.  Otherwise we'd just emit 
the .import statements as we saw the declarations.  I believe that was 
to properly interface with the HP assembler/linker.


We also have to defer emitting plabels, but I can't recall the 
braindamage behind that.



I'm not going to insist you do this in the same way as the PA.  That was 
a different era -- we had significant motivation to make things work in 
such a way that everything could be buried in the pa specific files. 
That sometimes led to less than optimal approaches to fix certain problems.



Jeff

Re: The nvptx port [11/11] More tools.

2014-10-22 Thread Jeff Law


On 10/20/14 08:48, Bernd Schmidt wrote:

This is a bonus optional patch which adds ar, ranlib, as and ld to the
ptx port. This is not proper binutils; ar and ranlib are just linked to
the host versions, and the other two tools have the following functions:

* nvptx-as is required to convert the compiler output to actual valid
   ptx assembly, primarily by reordering declarations and definitions.
   Believe me when I say that I've tried to make that work in the
   compiler itself and it's pretty much impossible without some really
   invasive changes.
* nvptx-ld is just a pseudo linker that works by concatenating ptx
   input files and separating them with nul characters. Actual linking
   is something that happens later, when calling CUDA library functions,
   but existing build system make it useful to have something called
   ld which is able to bundle everything that's needed into a single
   file, and this seemed to be the simplest way of achieving this.

There's a toplevel configure.ac change necessary to make ar/ranlib
useable by the libgcc build. Having some tools built like this has some
precedent in t-vmsnative, but as Thomas noted it does make feature tests
in gcc's configure somewhat ugly (but everything works well enough to
build the compiler). The alternative here is to bundle all these files
into a separate nvptx-tools package which users would have to download -
something that would be nice to avoid.

These tools currently require GNU extensions - something I probably
ought to fix if we decide to add them to the gcc build itself.

Would these be more appropriate in binutils?

Jeff

Re: The nvptx port [11/11] More tools.

2014-10-22 Thread Bernd Schmidt


On 10/22/2014 10:31 PM, Jeff Law wrote:

These tools currently require GNU extensions - something I probably
ought to fix if we decide to add them to the gcc build itself.

Would these be more appropriate in binutils?


I don't think so, given that we don't need any piece of regular 
binutils. There's no meaningful way to build libbfd. It would be strange 
to build binutils and have everything that's normally part of it 
disabled at configure time.



Bernd

Re: The nvptx port [0/11+]

2014-10-21 Thread Richard Biener

On Mon, Oct 20, 2014 at 4:17 PM, Bernd Schmidt ber...@codesourcery.com wrote:
 This is a patch kit that adds the nvptx port to gcc. It contains preliminary
 patches to add needed functionality, the target files, and one somewhat
 optional patch with additional target tools. There'll be more patch series,
 one for the testsuite, and one to make the offload functionality work with
 this port. Also required are the previous four rtl patches, two of which
 weren't entirely approved yet.

 For the moment, I've stripped out all the address space support that got
 bogged down in review by brokenness in our representation of address spaces.
 The ptx address spaces are of course still defined and used inside the
 backend.

 Ptx really isn't a usual target - it is a virtual target which is then
 translated by another compiler (ptxas) to the final code that runs on the
 GPU. There are many restrictions, some imposed by the GPU hardware, and some
 by the fact that not everything you'd want can be represented in ptx. Here
 are some of the highlights:
  * Everything is typed - variables, functions, registers. This can
cause problems with KR style C or anything else that doesn't
have a proper type internally.
  * Declarations are needed, even for undefined variables.
  * Can't emit initializers referring to their variable's address since
you can't write forward declarations for variables.
  * Variables can be declared only as scalars or arrays, not
structures. Initializers must be in the variable's declared type,
which requires some code in the backend, and it means that packed
pointer values are not representable.
  * Since it's a virtual target, we skip register allocation - no good
can probably come from doing that twice. This means asm statements
aren't fixed up and will fail if they use matching constraints.

So with this restriction I wonder why it didn't make sense to go the
HSA backend route emitting PTX from a GIMPLE SSA pass.  This
would have avoided the LTO dance as well ...

That is, what is the advantage of expanding to RTL here - what
main benefits do you get from that which you thought would be
different to handle if doing code generation from GIMPLE SSA?

For HSA we even do register allocation (to a fixed virtual register
set), sth simple enough on SSA.  We of course also have to do
instruction selection but luckily virtual ISAs are easy to target.

So were you worried about duplicating instruction selection
and or doing it manually instead of with well-known machine
descriptions?

I'm just curious - I am not asking you to rewrite the beast ;)

Thanks,
Richard.

  * No support for indirect jumps, label values, nonlocal gotos.
  * No alloca - ptx defines it, but it's not implemented.
  * No trampolines.
  * No debugging (at all, for now - we may add line number directives).
  * Limited C library support - I have a hacked up copy of newlib
that provides a reasonable subset.
  * malloc and free are defined by ptx (these appear to be
undocumented), but there isn't a realloc. I have one patch for
Fortran to use a malloc/memcpy helper function in cases where we
know the old size.

 All in all, this is not intended to be used as a C (or any other source
 language) compiler. I've gone through a lot of effort to make it work
 reasonably well, but only in order to get sufficient test coverage from the
 testsuites. The intended use for this is only to build it as an offload
 compiler, and use it through OpenACC by way of lto1. That leaves the
 question of how we should document it - does it need the usual constraint
 and option documentation, given that user's aren't expected to use any of
 it?

 A slightly earlier version of the entire patch kit was bootstrapped and
 tested on x86_64-linux. Ok for trunk?


 Bernd

Re: The nvptx port [0/11+]

2014-10-21 Thread Jakub Jelinek

On Mon, Oct 20, 2014 at 04:17:56PM +0200, Bernd Schmidt wrote:
  * Can't emit initializers referring to their variable's address since
you can't write forward declarations for variables.

Can't that be handled by emitting the initializer without the address and
some constructor that fixes up the initializer at runtime?

  * Variables can be declared only as scalars or arrays, not
structures. Initializers must be in the variable's declared type,
which requires some code in the backend, and it means that packed
pointer values are not representable.

Can't you represent structures and unions as arrays of chars?
For constant initializers that don't need relocations the compiler can
surely turn them into arrays of char initializers (e.g. fold-const.c
native_encode_expr/native_interpret_expr could be used for that).
Supposedly it would mean slower than perhaps necessary loads/stores of
aligned larger fields from the structure, but if it is an alternative to
not supporting structures/unions at all, that sounds like so severe
limitation that it can be pretty fatal for the target.

  * No support for indirect jumps, label values, nonlocal gotos.

Not even indirect calls?  How do you implement C++ or Fortran vtables?

Jakub

Re: The nvptx port [0/11+]

2014-10-21 Thread Bernd Schmidt


On 10/21/2014 10:18 AM, Richard Biener wrote:

So with this restriction I wonder why it didn't make sense to go the
HSA backend route emitting PTX from a GIMPLE SSA pass.  This
would have avoided the LTO dance as well ...


Quite simple - there isn't an established way to do this. If I'd known 
you were doing something like this when I started the work I might have 
looked into that approach.



Bernd

Re: The nvptx port [0/11+]

2014-10-21 Thread Bernd Schmidt


On 10/21/2014 10:42 AM, Jakub Jelinek wrote:

On Mon, Oct 20, 2014 at 04:17:56PM +0200, Bernd Schmidt wrote:

  * Can't emit initializers referring to their variable's address since
you can't write forward declarations for variables.


Can't that be handled by emitting the initializer without the address and
some constructor that fixes up the initializer at runtime?


That reminds me that constructors are something I forgot to add to the 
list. I'm thinking about making these work with some trickery in the 
linker, but at the moment they are unsupported.



Can't you represent structures and unions as arrays of chars?
For constant initializers that don't need relocations the compiler can
surely turn them into arrays of char initializers (e.g. fold-const.c
native_encode_expr/native_interpret_expr could be used for that).
Supposedly it would mean slower than perhaps necessary loads/stores of
aligned larger fields from the structure, but if it is an alternative to
not supporting structures/unions at all, that sounds like so severe
limitation that it can be pretty fatal for the target.


Oh, structs and unions are supported, and essentially that's what I'm 
doing - I choose a base integer type to represent them. That happens to 
be the size of a pointer, so properly aligned symbol refs can be 
emitted. It's just the packed ones that can't be done.



  * No support for indirect jumps, label values, nonlocal gotos.


Not even indirect calls?  How do you implement C++ or Fortran vtables?


Indirect calls do exist.


Bernd

Re: The nvptx port [0/11+]

2014-10-21 Thread Richard Biener

On Tue, Oct 21, 2014 at 12:53 PM, Bernd Schmidt ber...@codesourcery.com wrote:
 On 10/21/2014 10:18 AM, Richard Biener wrote:

 So with this restriction I wonder why it didn't make sense to go the
 HSA backend route emitting PTX from a GIMPLE SSA pass.  This
 would have avoided the LTO dance as well ...


 Quite simple - there isn't an established way to do this. If I'd known you
 were doing something like this when I started the work I might have looked
 into that approach.

Ah, I see.  I think having both ways now is good so we can compare
pros and cons in practice (and make further targets follow the better
approach if there is one).

Richard.


 Bernd

Re: The nvptx port [1/11+] indirect jumps

2014-10-21 Thread Jeff Law


On 10/20/14 14:19, Bernd Schmidt wrote:

ptx doesn't have indirect jumps, so CODE_FOR_indirect_jump may not be
defined.  Add a sorry.


Bernd

001-indjumps.diff


gcc/
* optabs.c (emit_indirect_jump): Test HAVE_indirect_jump and emit a
sorry if necessary.
So doesn't this imply no hot-cold partitioning since we use indirect 
jumps to get across the partition?  Similarly doesn't this imply other 
missing features (setjmp/longjmp, nonlocal gotos, computed jumps?


Do you need some mechanism to ensure that hot/cold partitioning isn't 
enabled?  Do you need some kind of message specific to the other 
features, or are we going to assume that the user will map from the 
indirect jump message back to the use of setjmp/longjmp or something 
similar?


How are switches implemented (if at all)?

Jeff

Re: The nvptx port [2/11+] No register allocation

2014-10-21 Thread Jeff Law


On 10/20/14 14:20, Bernd Schmidt wrote:

Since it's a virtual target, I've chosen not to run register allocation.
This is one of the patches necessary to make that work, it primarily
adds a target hook to disable it and fixes some of the fallout.


Bernd


002-noregalloc.diff


gcc/
* target.def (no_register_allocation): New data hook.
* doc/tm.texi.in: Add @hook TARGET_NO_REGISTER_ALLOCATION.
* doc/tm.texi: Regenerate.
* ira.c (gate_ira): New function.
(pass_data_ira): Set has_gate.
(pass_ira): Add a gate function.
(pass_data_reload): Likewise.
(pass_reload): Add a gate function.
(pass_ira): Use it.
* reload1.c (eliminate_regs): If reg_eliminte_is NULL, assert that
no register allocation happens on the target and return.
* final.c (alter_subreg): Ensure register is not a pseudo before
calling simplify_subreg.
(output_operand): Assert that x isn't a pseudo only if doing
register allocation.\

s/reg_eliminte/reg_eliminate/

Otherwise this looks fine. Note potential for rethinking this change at 
some point in the future as we get more experience with these kinds of 
targets.


Jeff

Re: The nvptx port [3/11+] Struct returns

2014-10-21 Thread Jeff Law


On 10/20/14 14:22, Bernd Schmidt wrote:

Even when returning a structure by passing an invisible reference, gcc
still likes to set the return register to the address of the struct.
This is undesirable on ptx where things like the return register have to
be declared, and the function really returns void at ptx level. I've
added a target hook to avoid this. I figure other targets might find it
beneficial to omit this unnecessary set as well.


Bernd


003-sretreg.diff


gcc/
* target.def (omit_struct_return_reg): New data hook.
* doc/tm.texi.in: Add @hook TARGET_OMIT_STRUCT_RETURN_REG.
* doc/tm.texi: Regenerate.
* function.c (expand_function_end): Use it.
My first thought when reading this surprise that we actually return a 
value here and a desire to just zap that code completely since there's 
virtually no chance the optimizer will be able to delete it.


But then I remembered how much I hate dealing with this kind of ABI 
issue.  I suspect nobody actually specifies behavior here other than to 
indicate when pass by invisible reference is used and what register 
holds that incoming value.


So, OK for the trunk.

jeff

Re: The nvptx port [4/11+] Post-RA pipeline

2014-10-21 Thread Jeff Law


On 10/20/14 14:24, Bernd Schmidt wrote:

This stops most of the post-regalloc passes to be run if the target
doesn't want register allocation. I'd previously moved them all out of
postreload to the toplevel, but Jakub (I think) pointed out that the
idea is not to run them to avoid crashes if reload fails e.g. for an
invalid asm. So I've made a new container pass.

A later patch will make thread_prologue_and_epilogue_insns callable from
the backend.


Bernd


004-postra.diff


gcc/
* passes.def (pass_compute_alignments, pass_duplicate_computed_gotos,
pass_variable_tracking, pass_free_cfg, pass_machine_reorg,
pass_cleanup_barriers, pass_delay_slots,
pass_split_for_shorten_branches, pass_convert_to_eh_region_ranges,
pass_shorten_branches, pass_est_nothrow_function_flags,
pass_dwarf2_frame, pass_final): Move outside of pass_postreload and
into pass_late_compilation.
(pass_late_compilation): Add.
* passes.c (pass_data_late_compilation, pass_late_compilation,
make_pass_late_compilation): New.
* timevar.def (TV_LATE_COMPILATION): New.

OK.
jeff

Re: The nvptx port [5/11+] Variable declarations

2014-10-21 Thread Jeff Law


On 10/20/14 14:25, Bernd Schmidt wrote:

ptx assembly follows rather different rules than what's typical
elsewhere. We need a new hook to add a  }; string when we are finished
outputting a variable with an initializer.


Bernd


005-declend.diff


gcc/
* target.def (decl_end): New hook.
* varasm.c (assemble_variable_contents, assemble_constant_contents):
Use it.
* doc/tm.texi.in (TARGET_ASM_DECL_END): Add.
* doc/tm.texi: Regenerate.

Ok.
jeff

Re: The nvptx port [6/11+] Pseudo call args

2014-10-21 Thread Jeff Law


On 10/20/14 14:26, Bernd Schmidt wrote:

On ptx, we'll be using pseudos to pass function args as well, and
there's one assert that needs to be toned town to make that work.


Bernd


006-usereg.diff


gcc/
* expr.c (use_reg_mode): Just return for pseudo registers.

OK.

I pondered asking for this to be conditional on the 
no-register-allocation conditional, but then  decided it wasn't worth 
the effort.


jeff

Re: The nvptx port [1/11+] indirect jumps

2014-10-21 Thread Bernd Schmidt


On 10/21/2014 08:26 PM, Jeff Law wrote:

* optabs.c (emit_indirect_jump): Test HAVE_indirect_jump and emit a
sorry if necessary.

So doesn't this imply no hot-cold partitioning since we use indirect
jumps to get across the partition?  Similarly doesn't this imply other
missing features (setjmp/longjmp, nonlocal gotos, computed jumps?


Pretty much yes to all.


Do you need some mechanism to ensure that hot/cold partitioning isn't
enabled?


I guess I could clear flag_reorder_blocks_and_partition in 
nvptx_option_override. The problem hasn't come up so far.



Do you need some kind of message specific to the other
features, or are we going to assume that the user will map from the
indirect jump message back to the use of setjmp/longjmp or something
similar?


I have some sorry calls in things like a dummy nonlocal_goto pattern. It 
doesn't quite manage to catch everything without an ICE yet though.



How are switches implemented (if at all)?


Comparison tree as you'd generate for small switches on all other targets.


Bernd

Re: The nvptx port [7/11+] Inform the port about call arguments

2014-10-21 Thread Jeff Law


On 10/20/14 14:29, Bernd Schmidt wrote:

In ptx assembly we need to decorate call insns with the arguments that
are being passed. We also need to know the exact function type. This is
kind of hard to do with the existing infrastructure since things like
function_arg are called at other times rather than just when emitting a
call, so this patch adds two more hooks, one called just before argument
registers are loaded (once for each arg), and the other just after the
call is complete.


Bernd


007-callargs.diff


gcc/
* target.def (call_args, end_call_args): New hooks.
* hooks.c (hook_void_rtx_tree): New empty function.
* hooks.h (hook_void_rtx_tree): Declare.
* doc/tm.texi.in (TARGET_CALL_ARGS, TARGET_END_CALL_ARGS): Add.
* doc/tm.texi: Regenerate.
* calls.c (expand_call): Slightly rearrange the code.  Use the two new
hooks.
(expand_library_call_value_1): Use the two new hooks.
How exactly do you need to decorate?  Just mention the register, size 
information or do you need full type information?


We've had targets where we had to indicate register banks for each 
argument.  Those would walk CALL_INSN_FUNCTION_USAGE to find the 
argument registers, then from the register # we would know which 
register bank to use.   Would that work for you?


Jeff

Re: The nvptx port [1/11+] indirect jumps

2014-10-21 Thread Jakub Jelinek

On Tue, Oct 21, 2014 at 11:00:35PM +0200, Bernd Schmidt wrote:
 On 10/21/2014 08:26 PM, Jeff Law wrote:
 * optabs.c (emit_indirect_jump): Test HAVE_indirect_jump and emit a
 sorry if necessary.
 So doesn't this imply no hot-cold partitioning since we use indirect
 jumps to get across the partition?  Similarly doesn't this imply other
 missing features (setjmp/longjmp, nonlocal gotos, computed jumps?
 
 Pretty much yes to all.
 
 Do you need some mechanism to ensure that hot/cold partitioning isn't
 enabled?
 
 I guess I could clear flag_reorder_blocks_and_partition in
 nvptx_option_override. The problem hasn't come up so far.
 
 Do you need some kind of message specific to the other
 features, or are we going to assume that the user will map from the
 indirect jump message back to the use of setjmp/longjmp or something
 similar?
 
 I have some sorry calls in things like a dummy nonlocal_goto pattern. It
 doesn't quite manage to catch everything without an ICE yet though.

With all the sorry additions, what is actually the plan for OpenMP (dunno how
OpenACC is different in this regard)?
At least for OpenMP, the best would be if the #pragma omp target regions
and/or #pragma omp declare target functions contain anything a particular
offloading accelerator can't handle, instead of failing the whole
compilation perhaps just emit some at least by default non-fatal warning
and not emit anything for the particular offloading target, which would mean
either host fallback, or, if some other offloading target succeeded, just
that target.
The unsupported stuff can be machine dependent builtins that can't be
transformed, or e.g. the various things you've listed as unsupportable by
the PTX backend right now.

Jakub

Re: The nvptx port [7/11+] Inform the port about call arguments

2014-10-21 Thread Bernd Schmidt


On 10/21/2014 11:11 PM, Jeff Law wrote:

On 10/20/14 14:29, Bernd Schmidt wrote:

In ptx assembly we need to decorate call insns with the arguments that
are being passed. We also need to know the exact function type. This is
kind of hard to do with the existing infrastructure since things like
function_arg are called at other times rather than just when emitting a
call, so this patch adds two more hooks, one called just before argument
registers are loaded (once for each arg), and the other just after the
call is complete.


How exactly do you need to decorate?  Just mention the register, size
information or do you need full type information?


A normal call looks like

{
  .param.u32 %retval_in;
  .param.u64 %out_arg0;
  st.param.u64 [%out_arg0], %r1400;
  call (%retval_in), PopCnt, (%out_arg0);
  ld.param.u32%r1403, [%retval_in];
}

which declares local variables for the args and return value, stores the 
pseudos in the outgoing args, calls the function with explicitly named 
args and return values, and loads the incoming return value. All this is 
produced by nvptx_output_call_insn for a single CALL rtx insn.


Indirect calls additionally need to produce a .callprototype pseudo-op 
which looks like a function declaration; for normal calls the called 
function must already be declared elsewhere. The machinery to produce 
such .callprototypes is also used to produce a ptx decl from the call 
insn for an external KR declaration with no argument types.



We've had targets where we had to indicate register banks for each
argument.  Those would walk CALL_INSN_FUNCTION_USAGE to find the
argument registers, then from the register # we would know which
register bank to use.   Would that work for you?


Couple of problems with this - the fusage isn't available to gen_call, 
it gets added to the call insn after it is emitted, but the backend 
would like to have this information when emitting the insn. Also, I'd 
need the order to be reliable and I don't think CALL_INSN_FUNCTION_USAGE 
is really designed to guarantee that (I suspect the order of register 
args and things like the struct return reg is wrong). I also need the 
exact function type and the call_args hook seems like the easiest way to 
communicate it to the backend.



Bernd

Re: The nvptx port [1/11+] indirect jumps

2014-10-21 Thread Bernd Schmidt


On 10/21/2014 11:30 PM, Jakub Jelinek wrote:

At least for OpenMP, the best would be if the #pragma omp target regions
and/or #pragma omp declare target functions contain anything a particular
offloading accelerator can't handle, instead of failing the whole
compilation perhaps just emit some at least by default non-fatal warning
and not emit anything for the particular offloading target, which would mean
either host fallback, or, if some other offloading target succeeded, just
that target.


I guess a test could be added to mkoffload if gcc were to return a 
different value for a sorry vs. any other compilation failure. The tool 
could then choose not to produce offloading support for that target.



Bernd

Re: The nvptx port [7/11+] Inform the port about call arguments

2014-10-21 Thread Jeff Law


On 10/21/14 21:29, Bernd Schmidt wrote:


A normal call looks like

{
   .param.u32 %retval_in;
   .param.u64 %out_arg0;
   st.param.u64 [%out_arg0], %r1400;
   call (%retval_in), PopCnt, (%out_arg0);
   ld.param.u32%r1403, [%retval_in];
}

which declares local variables for the args and return value, stores the
pseudos in the outgoing args, calls the function with explicitly named
args and return values, and loads the incoming return value. All this is
produced by nvptx_output_call_insn for a single CALL rtx insn.

So far, so good.



Indirect calls additionally need to produce a .callprototype pseudo-op
which looks like a function declaration; for normal calls the called
function must already be declared elsewhere. The machinery to produce
such .callprototypes is also used to produce a ptx decl from the call
insn for an external KR declaration with no argument types.

Yea, no surprise here.



Couple of problems with this - the fusage isn't available to gen_call,
it gets added to the call insn after it is emitted, but the backend
would like to have this information when emitting the insn.
Right.  Targets which have needed this emit the decorations at 
insn-output time so the fusage has been attached.




Also, I'd
need the order to be reliable and I don't think CALL_INSN_FUNCTION_USAGE
is really designed to guarantee that (I suspect the order of register
args and things like the struct return reg is wrong). I also need the
exact function type and the call_args hook seems like the easiest way to
communicate it to the backend.
We've depended on the ordering in the PA, well, forever.  However, I 
doubt ordering of regs in the fusage is documented at all!  We could 
change that.


So, in the end I'm torn.  I don't like adding new hooks when they're not 
needed, but I have some reservations about relying on the order of stuff 
in CALL_INSN_FUNCTION_USAGE and I worry a bit that you might end up with 
stuff other than arguments on that list -- the PA port could filter on 
the hard registers used for passing arguments, so other stuff appearing 
isn't a big deal.


Let me sleep on this one :-)
Jeff

Re: The nvptx port [8/11+] Write undefined decls.

2014-10-21 Thread Jeff Law


On 10/20/14 14:30, Bernd Schmidt wrote:

ptx assembly requires that declarations are written for undefined
variables. This adds that functionality.


Bernd


008-undefdecl.diff


gcc/
* target.def (assemble_undefined_decl): New hooks.
* hooks.c (hook_void_FILEptr_constcharptr_const_tree): New function.
* hooks.h (hook_void_FILEptr_constcharptr_const_tree): Declare.
* doc/tm.texi.in (TARGET_ASM_ASSEMBLE_UNDEFINED_DECL): Add.
* doc/tm.texi: Regenerate.
* output.h (assemble_undefined_decl): Declare.
(get_fnname_from_decl): Declare.
* varasm.c (assemble_undefined_decl): New function.
(get_fnname_from_decl): New function.
* final.c (rest_of_handle_final): Use it.
* varpool.c (varpool_output_variables): Call assemble_undefined_decl
for nodes without a definition.

Does this need to happen at the use site, or can it be deferred?

THe PA had to do something similar.  We built up a vector of every 
external object in ASM_OUTPUT_EXTERNAL, but did not emit anything.


Then in ASM_FILE_END, we walked that vector and anything that was 
actually referenced (as opposed to just just declared) we would emit the 
magic .IMPORT lines.


Jeff

Re: The nvptx port [9/11+] Epilogues

2014-10-21 Thread Jeff Law


On 10/20/14 14:32, Bernd Schmidt wrote:

We skip the late compilation passes on ptx, but there's one piece we do
need - fixing up the function so that we get return insns in the right
places. This patch just makes thread_prologue_and_epilogue_insns
callable from the reorg pass.


Bernd

009-proep.diff


gcc/
* function.c (thread_prologue_and_epilogue_insns): No longer static.
* function.h (thread_prologue_and_epilogue_insns): Declare.

OK.
Jeff

Re: The nvptx port [7/11+] Inform the port about call arguments

2014-10-21 Thread Bernd Schmidt


On 10/21/2014 11:53 PM, Jeff Law wrote:


So, in the end I'm torn.  I don't like adding new hooks when they're not
needed, but I have some reservations about relying on the order of stuff
in CALL_INSN_FUNCTION_USAGE and I worry a bit that you might end up with
stuff other than arguments on that list -- the PA port could filter on
the hard registers used for passing arguments, so other stuff appearing
isn't a big deal.


This is another worry. Also, at the moment we don't actually add the 
pseudos to CALL_INSN_FUNCTION_USAGE (that's patch 6/11), we use the regs 
saved by the call_args hook to make proper USEs in a PARALLEL. I'm not 
convinced the rest of the compiler would be too happy to see pseudos there.


So, in all I'd say it's probably possible to do it that way, but it 
feels a lot iffier than I'd be happy with. I for one didn't know about 
the PA requirement, so I could easily have broken it unknowingly if I'd 
made some random change modifying call expansion.



Bernd

Re: The nvptx port [8/11+] Write undefined decls.

2014-10-21 Thread Bernd Schmidt


On 10/22/2014 12:05 AM, Jeff Law wrote:

On 10/20/14 14:30, Bernd Schmidt wrote:

ptx assembly requires that declarations are written for undefined
variables. This adds that functionality.

Does this need to happen at the use site, or can it be deferred?


This is independent of use sites. The patch just adds another walk over 
the varpool to emit not just the defined vars.


Ideally we'd maintain an order that declares or defines every variable 
before it is referenced by an initializer, but the attempt to do that in 
the compiler totally failed due to references between constant pools and 
regular variables. The nvptx-as tool we have fixes up the order of 
declarations after the first compilation stage.



THe PA had to do something similar.  We built up a vector of every
external object in ASM_OUTPUT_EXTERNAL, but did not emit anything.

Then in ASM_FILE_END, we walked that vector and anything that was
actually referenced (as opposed to just just declared) we would emit the
magic .IMPORT lines.


Sounds like the PA could use this hook to simplify its code quite a bit.

Looking at the patch again I noticed there's still some unrelated code 
in here - the patch used to be quite a lot larger and got shrunk due to 
the failure mentioned above. get_fnname_for_decl is just a new function 
broken out of rest_of_handle_final, it is used by the nvptx.c code.



Bernd

The nvptx port [0/11+]

2014-10-20 Thread Bernd Schmidt

This is a patch kit that adds the nvptx port to gcc. It contains 
preliminary patches to add needed functionality, the target files, and 
one somewhat optional patch with additional target tools. There'll be 
more patch series, one for the testsuite, and one to make the offload 
functionality work with this port. Also required are the previous four 
rtl patches, two of which weren't entirely approved yet.


For the moment, I've stripped out all the address space support that got 
bogged down in review by brokenness in our representation of address 
spaces. The ptx address spaces are of course still defined and used 
inside the backend.


Ptx really isn't a usual target - it is a virtual target which is then 
translated by another compiler (ptxas) to the final code that runs on 
the GPU. There are many restrictions, some imposed by the GPU hardware, 
and some by the fact that not everything you'd want can be represented 
in ptx. Here are some of the highlights:

 * Everything is typed - variables, functions, registers. This can
   cause problems with KR style C or anything else that doesn't
   have a proper type internally.
 * Declarations are needed, even for undefined variables.
 * Can't emit initializers referring to their variable's address since
   you can't write forward declarations for variables.
 * Variables can be declared only as scalars or arrays, not
   structures. Initializers must be in the variable's declared type,
   which requires some code in the backend, and it means that packed
   pointer values are not representable.
 * Since it's a virtual target, we skip register allocation - no good
   can probably come from doing that twice. This means asm statements
   aren't fixed up and will fail if they use matching constraints.
 * No support for indirect jumps, label values, nonlocal gotos.
 * No alloca - ptx defines it, but it's not implemented.
 * No trampolines.
 * No debugging (at all, for now - we may add line number directives).
 * Limited C library support - I have a hacked up copy of newlib
   that provides a reasonable subset.
 * malloc and free are defined by ptx (these appear to be
   undocumented), but there isn't a realloc. I have one patch for
   Fortran to use a malloc/memcpy helper function in cases where we
   know the old size.

All in all, this is not intended to be used as a C (or any other source 
language) compiler. I've gone through a lot of effort to make it work 
reasonably well, but only in order to get sufficient test coverage from 
the testsuites. The intended use for this is only to build it as an 
offload compiler, and use it through OpenACC by way of lto1. That leaves 
the question of how we should document it - does it need the usual 
constraint and option documentation, given that user's aren't expected 
to use any of it?


A slightly earlier version of the entire patch kit was bootstrapped and 
tested on x86_64-linux. Ok for trunk?



Bernd

The nvptx port [2/11+] No register allocation

2014-10-20 Thread Bernd Schmidt

Since it's a virtual target, I've chosen not to run register allocation. 
This is one of the patches necessary to make that work, it primarily 
adds a target hook to disable it and fixes some of the fallout.



Bernd

The nvptx port [1/11+] indirect jumps

2014-10-20 Thread Bernd Schmidt


ptx doesn't have indirect jumps, so CODE_FOR_indirect_jump may not be
defined.  Add a sorry.


Bernd
	gcc/
	* optabs.c (emit_indirect_jump): Test HAVE_indirect_jump and emit a
	sorry if necessary.


Index: gcc/optabs.c
===
--- gcc/optabs.c	(revision 422345)
+++ gcc/optabs.c	(revision 422346)
@@ -4477,13 +4477,16 @@ prepare_float_lib_cmp (rtx x, rtx y, enu
 /* Generate code to indirectly jump to a location given in the rtx LOC.  */
 
 void
-emit_indirect_jump (rtx loc)
+emit_indirect_jump (rtx loc ATTRIBUTE_UNUSED)
 {
+#ifndef HAVE_indirect_jump
+  sorry (indirect jumps are not available on this target);
+#else
   struct expand_operand ops[1];
-
   create_address_operand (ops[0], loc);
   expand_jump_insn (CODE_FOR_indirect_jump, 1, ops);
   emit_barrier ();
+#endif
 }
 
 #ifdef HAVE_conditional_move

Re: The nvptx port [3/11+] Struct returns

2014-10-20 Thread Bernd Schmidt

Even when returning a structure by passing an invisible reference, gcc 
still likes to set the return register to the address of the struct. 
This is undesirable on ptx where things like the return register have to 
be declared, and the function really returns void at ptx level. I've 
added a target hook to avoid this. I figure other targets might find it 
beneficial to omit this unnecessary set as well.



Bernd

	gcc/
	* target.def (omit_struct_return_reg): New data hook.
	* doc/tm.texi.in: Add @hook TARGET_OMIT_STRUCT_RETURN_REG.
	* doc/tm.texi: Regenerate.
	* function.c (expand_function_end): Use it.


Index: gcc/doc/tm.texi
===
--- gcc/doc/tm.texi	(revision 422355)
+++ gcc/doc/tm.texi	(revision 422356)
@@ -4560,6 +4560,14 @@ need more space than is implied by @code
 saving and restoring an arbitrary return value.
 @end defmac
 
+@deftypevr {Target Hook} bool TARGET_OMIT_STRUCT_RETURN_REG
+Normally, when a function returns a structure by memory, the address
+is passed as an invisible pointer argument, but the compiler also
+arranges to return the address from the function like it would a normal
+pointer return value.  Define this to true if that behaviour is
+undesirable on your target.
+@end deftypevr
+
 @deftypefn {Target Hook} bool TARGET_RETURN_IN_MSB (const_tree @var{type})
 This hook should return true if values of type @var{type} are returned
 at the most significant end of a register (in other words, if they are
Index: gcc/doc/tm.texi.in
===
--- gcc/doc/tm.texi.in	(revision 422355)
+++ gcc/doc/tm.texi.in	(revision 422356)
@@ -3769,6 +3769,8 @@ need more space than is implied by @code
 saving and restoring an arbitrary return value.
 @end defmac
 
+@hook TARGET_OMIT_STRUCT_RETURN_REG
+
 @hook TARGET_RETURN_IN_MSB
 
 @node Aggregate Return
Index: gcc/target.def
===
--- gcc/target.def	(revision 422355)
+++ gcc/target.def	(revision 422356)
@@ -3601,6 +3601,16 @@ structure value address at the beginning
 to emit adjusting code, you should do it at this point.,
  rtx, (tree fndecl, int incoming),
  hook_rtx_tree_int_null)
+
+DEFHOOKPOD
+(omit_struct_return_reg,
+ Normally, when a function returns a structure by memory, the address\n\
+is passed as an invisible pointer argument, but the compiler also\n\
+arranges to return the address from the function like it would a normal\n\
+pointer return value.  Define this to true if that behaviour is\n\
+undesirable on your target.,
+ bool, false)
+
 DEFHOOK
 (return_in_memory,
  This target hook should return a nonzero value to say to return the\n\
Index: gcc/function.c
===
--- gcc/function.c	(revision 422355)
+++ gcc/function.c	(revision 422356)
@@ -5179,8 +5179,8 @@ expand_function_end (void)
  If returning a structure PCC style,
  the caller also depends on this value.
  And cfun-returns_pcc_struct is not necessarily set.  */
-  if (cfun-returns_struct
-  || cfun-returns_pcc_struct)
+  if ((cfun-returns_struct || cfun-returns_pcc_struct)
+   !targetm.calls.omit_struct_return_reg)
 {
   rtx value_address = DECL_RTL (DECL_RESULT (current_function_decl));
   tree type = TREE_TYPE (DECL_RESULT (current_function_decl));

The nvptx port [4/11+] Post-RA pipeline

2014-10-20 Thread Bernd Schmidt

This stops most of the post-regalloc passes to be run if the target 
doesn't want register allocation. I'd previously moved them all out of 
postreload to the toplevel, but Jakub (I think) pointed out that the 
idea is not to run them to avoid crashes if reload fails e.g. for an 
invalid asm. So I've made a new container pass.


A later patch will make thread_prologue_and_epilogue_insns callable from 
the backend.



Bernd

	gcc/
	* passes.def (pass_compute_alignments, pass_duplicate_computed_gotos,
	pass_variable_tracking, pass_free_cfg, pass_machine_reorg,
	pass_cleanup_barriers, pass_delay_slots,
	pass_split_for_shorten_branches, pass_convert_to_eh_region_ranges,
	pass_shorten_branches, pass_est_nothrow_function_flags,
	pass_dwarf2_frame, pass_final): Move outside of pass_postreload and
	into pass_late_compilation.
	(pass_late_compilation): Add.
	* passes.c (pass_data_late_compilation, pass_late_compilation,
	make_pass_late_compilation): New.
	* timevar.def (TV_LATE_COMPILATION): New.


Index: gcc/passes.def
===
--- gcc/passes.def.orig
+++ gcc/passes.def
@@ -415,6 +415,9 @@ along with GCC; see the file COPYING3.
 	  NEXT_PASS (pass_split_before_regstack);
 	  NEXT_PASS (pass_stack_regs_run);
 	  POP_INSERT_PASSES ()
+  POP_INSERT_PASSES ()
+  NEXT_PASS (pass_late_compilation);
+  PUSH_INSERT_PASSES_WITHIN (pass_late_compilation)
 	  NEXT_PASS (pass_compute_alignments);
 	  NEXT_PASS (pass_variable_tracking);
 	  NEXT_PASS (pass_free_cfg);
Index: gcc/passes.c
===
--- gcc/passes.c.orig
+++ gcc/passes.c
@@ -569,6 +569,44 @@ make_pass_postreload (gcc::context *ctxt
   return new pass_postreload (ctxt);
 }
 
+namespace {
+
+const pass_data pass_data_late_compilation =
+{
+  RTL_PASS, /* type */
+  *all-late_compilation, /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_LATE_COMPILATION, /* tv_id */
+  PROP_rtl, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_late_compilation : public rtl_opt_pass
+{
+public:
+  pass_late_compilation (gcc::context *ctxt)
+: rtl_opt_pass (pass_data_late_compilation, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+return reload_completed || targetm.no_register_allocation;
+  }
+
+}; // class pass_late_compilation
+
+} // anon namespace
+
+static rtl_opt_pass *
+make_pass_late_compilation (gcc::context *ctxt)
+{
+  return new pass_late_compilation (ctxt);
+}
+
 
 
 /* Set the static pass number of pass PASS to ID and record that
Index: gcc/timevar.def
===
--- gcc/timevar.def.orig
+++ gcc/timevar.def
@@ -270,6 +270,7 @@ DEFTIMEVAR (TV_EARLY_LOCAL	 , early
 DEFTIMEVAR (TV_OPTIMIZE		 , unaccounted optimizations)
 DEFTIMEVAR (TV_REST_OF_COMPILATION   , rest of compilation)
 DEFTIMEVAR (TV_POSTRELOAD	 , unaccounted post reload)
+DEFTIMEVAR (TV_LATE_COMPILATION	 , unaccounted late compilation)
 DEFTIMEVAR (TV_REMOVE_UNUSED	 , remove unused locals)
 DEFTIMEVAR (TV_ADDRESS_TAKEN	 , address taken)
 DEFTIMEVAR (TV_TODO		 , unaccounted todo)

The nvptx port [5/11+] Variable declarations

2014-10-20 Thread Bernd Schmidt

ptx assembly follows rather different rules than what's typical 
elsewhere. We need a new hook to add a  }; string when we are finished 
outputting a variable with an initializer.



Bernd

	gcc/
	* target.def (decl_end): New hook.
	* varasm.c (assemble_variable_contents, assemble_constant_contents):
	Use it.
	* doc/tm.texi.in (TARGET_ASM_DECL_END): Add.
	* doc/tm.texi: Regenerate.


Index: gcc/doc/tm.texi
===
--- gcc/doc/tm.texi.orig
+++ gcc/doc/tm.texi
@@ -7575,6 +7575,11 @@ The default implementation of this hook
 when the relevant string is @code{NULL}.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_ASM_DECL_END (void)
+Define this hook if the target assembler requires a special marker to
+terminate an initialized variable declaration.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_ASM_OUTPUT_ADDR_CONST_EXTRA (FILE *@var{file}, rtx @var{x})
 A target hook to recognize @var{rtx} patterns that @code{output_addr_const}
 can't deal with, and output assembly code to @var{file} corresponding to
Index: gcc/doc/tm.texi.in
===
--- gcc/doc/tm.texi.in.orig
+++ gcc/doc/tm.texi.in
@@ -5412,6 +5412,8 @@ It must not be modified by command-line
 
 @hook TARGET_ASM_INTEGER
 
+@hook TARGET_ASM_DECL_END
+
 @hook TARGET_ASM_OUTPUT_ADDR_CONST_EXTRA
 
 @defmac ASM_OUTPUT_ASCII (@var{stream}, @var{ptr}, @var{len})
Index: gcc/target.def
===
--- gcc/target.def.orig
+++ gcc/target.def
@@ -127,6 +127,15 @@ when the relevant string is @code{NULL}.
  bool, (rtx x, unsigned int size, int aligned_p),
  default_assemble_integer)
 
+/* Notify the backend that we have completed emitting the data for a
+   decl.  */
+DEFHOOK
+(decl_end,
+ Define this hook if the target assembler requires a special marker to\n\
+terminate an initialized variable declaration.,
+ void, (void),
+ hook_void_void)
+
 /* Output code that will globalize a label.  */
 DEFHOOK
 (globalize_label,
Index: gcc/varasm.c
===
--- gcc/varasm.c.orig
+++ gcc/varasm.c
@@ -1945,6 +1945,7 @@ assemble_variable_contents (tree decl, c
   else
 	/* Leave space for it.  */
 	assemble_zeros (tree_to_uhwi (DECL_SIZE_UNIT (decl)));
+  targetm.asm_out.decl_end ();
 }
 }
 
@@ -3349,6 +3350,8 @@ assemble_constant_contents (tree exp, co
 
   /* Output the value of EXP.  */
   output_constant (exp, size, align);
+
+  targetm.asm_out.decl_end ();
 }
 
 /* We must output the constant data referred to by SYMBOL; do so.  */

The nvptx port [6/11+] Pseudo call args

2014-10-20 Thread Bernd Schmidt

On ptx, we'll be using pseudos to pass function args as well, and 
there's one assert that needs to be toned town to make that work.



Bernd

	gcc/
	* expr.c (use_reg_mode): Just return for pseudo registers.


Index: gcc/expr.c
===
--- gcc/expr.c	(revision 422421)
+++ gcc/expr.c	(revision 422422)
@@ -2321,7 +2321,10 @@ copy_blkmode_to_reg (enum machine_mode m
 void
 use_reg_mode (rtx *call_fusage, rtx reg, enum machine_mode mode)
 {
-  gcc_assert (REG_P (reg)  REGNO (reg)  FIRST_PSEUDO_REGISTER);
+  gcc_assert (REG_P (reg));
+
+  if (!HARD_REGISTER_P (reg))
+return;
 
   *call_fusage
 = gen_rtx_EXPR_LIST (mode, gen_rtx_USE (VOIDmode, reg), *call_fusage);

The nvptx port [7/11+] Inform the port about call arguments

2014-10-20 Thread Bernd Schmidt

In ptx assembly we need to decorate call insns with the arguments that 
are being passed. We also need to know the exact function type. This is 
kind of hard to do with the existing infrastructure since things like 
function_arg are called at other times rather than just when emitting a 
call, so this patch adds two more hooks, one called just before argument 
registers are loaded (once for each arg), and the other just after the 
call is complete.



Bernd

	gcc/
	* target.def (call_args, end_call_args): New hooks.
	* hooks.c (hook_void_rtx_tree): New empty function.
	* hooks.h (hook_void_rtx_tree): Declare.
	* doc/tm.texi.in (TARGET_CALL_ARGS, TARGET_END_CALL_ARGS): Add.
	* doc/tm.texi: Regenerate.
	* calls.c (expand_call): Slightly rearrange the code.  Use the two new
	hooks.
	(expand_library_call_value_1): Use the two new hooks.


Index: gcc/doc/tm.texi
===
--- gcc/doc/tm.texi.orig
+++ gcc/doc/tm.texi
@@ -5027,6 +5027,29 @@ except the last are treated as named.
 You need not define this hook if it always returns @code{false}.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_CALL_ARGS (rtx, @var{tree})
+While generating RTL for a function call, this target hook is invoked once
+for each argument passed to the function, either a register returned by
+@code{TARGET_FUNCTION_ARG} or a memory location.  It is called just
+before the point where argument registers are stored.  The type of the
+function to be called is also passed as the second argument; it is
+@code{NULL_TREE} for libcalls.  The @code{TARGET_END_CALL_ARGS} hook is
+invoked just after the code to copy the return reg has been emitted.
+This functionality can be used to perform special setup of call argument
+registers if a target needs it.
+For functions without arguments, the hook is called once with @code{pc_rtx}
+passed instead of an argument register.
+Most ports do not need to implement anything for this hook.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_END_CALL_ARGS (void)
+This target hook is invoked while generating RTL for a function call,
+just after the point where the return reg is copied into a pseudo.  It
+signals that all the call argument and return registers for the just
+emitted call are now no longer in use.
+Most ports do not need to implement anything for this hook.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_PRETEND_OUTGOING_VARARGS_NAMED (cumulative_args_t @var{ca})
 If you need to conditionally change ABIs so that one works with
 @code{TARGET_SETUP_INCOMING_VARARGS}, but the other works like neither
Index: gcc/doc/tm.texi.in
===
--- gcc/doc/tm.texi.in.orig
+++ gcc/doc/tm.texi.in
@@ -3929,6 +3929,10 @@ These machine description macros help im
 
 @hook TARGET_STRICT_ARGUMENT_NAMING
 
+@hook TARGET_CALL_ARGS
+
+@hook TARGET_END_CALL_ARGS
+
 @hook TARGET_PRETEND_OUTGOING_VARARGS_NAMED
 
 @node Trampolines
Index: gcc/hooks.c
===
--- gcc/hooks.c.orig
+++ gcc/hooks.c
@@ -245,6 +245,11 @@ hook_void_tree (tree a ATTRIBUTE_UNUSED)
 }
 
 void
+hook_void_rtx_tree (rtx, tree)
+{
+}
+
+void
 hook_void_constcharptr (const char *a ATTRIBUTE_UNUSED)
 {
 }
Index: gcc/hooks.h
===
--- gcc/hooks.h.orig
+++ gcc/hooks.h
@@ -70,6 +70,7 @@ extern void hook_void_constcharptr (cons
 extern void hook_void_rtx_int (rtx, int);
 extern void hook_void_FILEptr_constcharptr (FILE *, const char *);
 extern bool hook_bool_FILEptr_rtx_false (FILE *, rtx);
+extern void hook_void_rtx_tree (rtx, tree);
 extern void hook_void_tree (tree);
 extern void hook_void_tree_treeptr (tree, tree *);
 extern void hook_void_int_int (int, int);
Index: gcc/target.def
===
--- gcc/target.def.orig
+++ gcc/target.def
@@ -3825,6 +3825,33 @@ not generate any instructions in this ca
  default_setup_incoming_varargs)
 
 DEFHOOK
+(call_args,
+ While generating RTL for a function call, this target hook is invoked once\n\
+for each argument passed to the function, either a register returned by\n\
+@code{TARGET_FUNCTION_ARG} or a memory location.  It is called just\n\
+before the point where argument registers are stored.  The type of the\n\
+function to be called is also passed as the second argument; it is\n\
+@code{NULL_TREE} for libcalls.  The @code{TARGET_END_CALL_ARGS} hook is\n\
+invoked just after the code to copy the return reg has been emitted.\n\
+This functionality can be used to perform special setup of call argument\n\
+registers if a target needs it.\n\
+For functions without arguments, the hook is called once with @code{pc_rtx}\n\
+passed instead of an argument register.\n\
+Most ports do not need to implement anything for this hook.,
+ void, (rtx, tree),

The nvptx port [8/11+] Write undefined decls.

2014-10-20 Thread Bernd Schmidt

ptx assembly requires that declarations are written for undefined 
variables. This adds that functionality.



Bernd

	gcc/
	* target.def (assemble_undefined_decl): New hooks.
	* hooks.c (hook_void_FILEptr_constcharptr_const_tree): New function.
	* hooks.h (hook_void_FILEptr_constcharptr_const_tree): Declare.
	* doc/tm.texi.in (TARGET_ASM_ASSEMBLE_UNDEFINED_DECL): Add.
	* doc/tm.texi: Regenerate.
	* output.h (assemble_undefined_decl): Declare.
	(get_fnname_from_decl): Declare.
	* varasm.c (assemble_undefined_decl): New function.
	(get_fnname_from_decl): New function.
	* final.c (rest_of_handle_final): Use it.
	* varpool.c (varpool_output_variables): Call assemble_undefined_decl
	for nodes without a definition.


Index: gcc/doc/tm.texi
===
--- gcc/doc/tm.texi.orig
+++ gcc/doc/tm.texi
@@ -7899,6 +7902,13 @@ global; that is, available for reference
 The default implementation uses the TARGET_ASM_GLOBALIZE_LABEL target hook.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_ASM_ASSEMBLE_UNDEFINED_DECL (FILE *@var{stream}, const char *@var{name}, const_tree @var{decl})
+This target hook is a function to output to the stdio stream
+@var{stream} some commands that will declare the name associated with
+@var{decl} which is not defined in the current translation unit.  Most
+assemblers do not require anything to be output in this case.
+@end deftypefn
+
 @defmac ASM_WEAKEN_LABEL (@var{stream}, @var{name})
 A C statement (sans semicolon) to output to the stdio stream
 @var{stream} some commands that will make the label @var{name} weak;
Index: gcc/doc/tm.texi.in
===
--- gcc/doc/tm.texi.in.orig
+++ gcc/doc/tm.texi.in
@@ -5693,6 +5693,8 @@ You may wish to use @code{ASM_OUTPUT_SIZ
 
 @hook TARGET_ASM_GLOBALIZE_DECL_NAME
 
+@hook TARGET_ASM_ASSEMBLE_UNDEFINED_DECL
+
 @defmac ASM_WEAKEN_LABEL (@var{stream}, @var{name})
 A C statement (sans semicolon) to output to the stdio stream
 @var{stream} some commands that will make the label @var{name} weak;
Index: gcc/hooks.c
===
--- gcc/hooks.c.orig
+++ gcc/hooks.c
@@ -139,6 +139,13 @@ hook_void_FILEptr_constcharptr (FILE *a
 {
 }
 
+/* Generic hook that takes (FILE *, const char *, constr_tree *) and does
+   nothing.  */
+void
+hook_void_FILEptr_constcharptr_const_tree (FILE *, const char *, const_tree)
+{
+}
+
 /* Generic hook that takes (FILE *, rtx) and returns false.  */
 bool
 hook_bool_FILEptr_rtx_false (FILE *a ATTRIBUTE_UNUSED,
Index: gcc/hooks.h
===
--- gcc/hooks.h.orig
+++ gcc/hooks.h
@@ -69,6 +69,8 @@ extern void hook_void_void (void);
 extern void hook_void_constcharptr (const char *);
 extern void hook_void_rtx_int (rtx, int);
 extern void hook_void_FILEptr_constcharptr (FILE *, const char *);
+extern void hook_void_FILEptr_constcharptr_const_tree (FILE *, const char *,
+		   const_tree);
 extern bool hook_bool_FILEptr_rtx_false (FILE *, rtx);
 extern void hook_void_rtx (rtx);
 extern void hook_void_tree (tree);
Index: gcc/target.def
===
--- gcc/target.def.orig
+++ gcc/target.def
@@ -158,6 +158,16 @@ global; that is, available for reference
 The default implementation uses the TARGET_ASM_GLOBALIZE_LABEL target hook.,
  void, (FILE *stream, tree decl), default_globalize_decl_name)
 
+/* Output code that will declare an external variable.  */
+DEFHOOK
+(assemble_undefined_decl,
+ This target hook is a function to output to the stdio stream\n\
+@var{stream} some commands that will declare the name associated with\n\
+@var{decl} which is not defined in the current translation unit.  Most\n\
+assemblers do not require anything to be output in this case.,
+ void, (FILE *stream, const char *name, const_tree decl),
+ hook_void_FILEptr_constcharptr_const_tree)
+
 /* Output code that will emit a label for unwind info, if this
target requires such labels.  Second argument is the decl the
unwind info is associated with, third is a boolean: true if
Index: gcc/final.c
===
--- gcc/final.c.orig
+++ gcc/final.c
@@ -4434,17 +4434,7 @@ leaf_renumber_regs_insn (rtx in_rtx)
 static unsigned int
 rest_of_handle_final (void)
 {
-  rtx x;
-  const char *fnname;
-
-  /* Get the function's name, as described by its RTL.  This may be
- different from the DECL_NAME name used in the source file.  */
-
-  x = DECL_RTL (current_function_decl);
-  gcc_assert (MEM_P (x));
-  x = XEXP (x, 0);
-  gcc_assert (GET_CODE (x) == SYMBOL_REF);
-  fnname = XSTR (x, 0);
+  const char *fnname = get_fnname_from_decl (current_function_decl);
 
   assemble_start_function (current_function_decl, fnname);
   final_start_function (get_insns

The nvptx port [9/11+] Epilogues

2014-10-20 Thread Bernd Schmidt

We skip the late compilation passes on ptx, but there's one piece we do 
need - fixing up the function so that we get return insns in the right 
places. This patch just makes thread_prologue_and_epilogue_insns 
callable from the reorg pass.



Bernd
	gcc/
	* function.c (thread_prologue_and_epilogue_insns): No longer static.
	* function.h (thread_prologue_and_epilogue_insns): Declare.


Index: gcc/function.c
===
--- gcc/function.c	(revision 422424)
+++ gcc/function.c	(revision 422425)
@@ -5945,7 +5945,7 @@ emit_return_for_exit (edge exit_fallthru
in a sibcall omit the sibcall_epilogue if the block is not in
ANTIC.  */
 
-static void
+void
 thread_prologue_and_epilogue_insns (void)
 {
   bool inserted;
Index: gcc/function.h
===
--- gcc/function.h	(revision 422424)
+++ gcc/function.h	(revision 422425)
@@ -773,6 +773,8 @@ extern void free_after_compilation (stru
 
 extern void init_varasm_status (void);
 
+extern void thread_prologue_and_epilogue_insns (void);
+
 #ifdef RTX_CODE
 extern void diddle_return_value (void (*)(rtx, void*), void*);
 extern void clobber_return_register (void);

The nvptx port [10/11+] Target files

2014-10-20 Thread Bernd Schmidt

These are the main target files for the ptx port. t-nvptx is empty for 
now but will grow some content with follow up patches.



Bernd


	* configure.ac: Allow configuring lto for nvptx.
	* configure: Regenerate.

	gcc/
	* config/nvptx/nvptx.c: New file.
	* config/nvptx/nvptx.h: New file.
	* config/nvptx/nvptx-protos.h: New file.
	* config/nvptx/nvptx.md: New file.
	* config/nvptx/t-nvptx: New file.
	* config/nvptx/nvptx.opt: New file.
	* common/config/nvptx/nvptx-common.c: New file.
	* config.gcc: Handle nvptx-*-*.

	libgcc/
	* config.host: Handle nvptx-*-*.
	* config/nvptx/t-nvptx: New file.
	* config/nvptx/crt0.s: New file.


Index: gcc/common/config/nvptx/nvptx-common.c
===
--- /dev/null
+++ gcc/common/config/nvptx/nvptx-common.c
@@ -0,0 +1,38 @@
+/* NVPTX common hooks.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   Contributed by Bernd Schmidt ber...@codesourcery.com
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+http://www.gnu.org/licenses/.  */
+
+#include config.h
+#include system.h
+#include coretypes.h
+#include diagnostic-core.h
+#include tm.h
+#include tm_p.h
+#include common/common-target.h
+#include common/common-target-def.h
+#include opts.h
+#include flags.h
+
+#undef TARGET_HAVE_NAMED_SECTIONS
+#define TARGET_HAVE_NAMED_SECTIONS false
+
+#undef TARGET_DEFAULT_TARGET_FLAGS
+#define TARGET_DEFAULT_TARGET_FLAGS MASK_ABI64
+
+struct gcc_targetm_common targetm_common = TARGETM_COMMON_INITIALIZER;
Index: gcc/config.gcc
===
--- gcc/config.gcc.orig
+++ gcc/config.gcc
@@ -420,6 +420,9 @@ nios2-*-*)
 	cpu_type=nios2
 	extra_options=${extra_options} g.opt
 	;;
+nvptx-*-*)
+	cpu_type=nvptx
+	;;
 powerpc*-*-*)
 	cpu_type=rs6000
 	extra_headers=ppc-asm.h altivec.h spe.h ppu_intrinsics.h paired.h spu2vmx.h vec_types.h si2vmx.h htmintrin.h htmxlintrin.h
@@ -2148,6 +2151,10 @@ nios2-*-*)
 		;;
 esac
 	;;
+nvptx-*)
+	tm_file=${tm_file} newlib-stdint.h
+	tmake_file=nvptx/t-nvptx
+	;;
 pdp11-*-*)
 	tm_file=${tm_file} newlib-stdint.h
 	use_gcc_stdint=wrap
Index: gcc/config/nvptx/nvptx.c
===
--- /dev/null
+++ gcc/config/nvptx/nvptx.c
@@ -0,0 +1,2024 @@
+/* Target code for NVPTX.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   Contributed by Bernd Schmidt ber...@codesourcery.com
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   http://www.gnu.org/licenses/.  */
+
+#include config.h
+#include system.h
+#include coretypes.h
+#include tm.h
+#include rtl.h
+#include tree.h
+#include insn-flags.h
+#include output.h
+#include insn-attr.h
+#include insn-codes.h
+#include expr.h
+#include regs.h
+#include optabs.h
+#include recog.h
+#include ggc.h
+#include timevar.h
+#include tm_p.h
+#include tm-preds.h
+#include tm-constrs.h
+#include function.h
+#include langhooks.h
+#include dbxout.h
+#include target.h
+#include target-def.h
+#include diagnostic.h
+#include basic-block.h
+#include stor-layout.h
+#include calls.h
+#include df.h
+#include builtins.h
+#include hashtab.h
+#include sstream
+
+/* Record the function decls we've written, and the libfuncs and function
+   decls corresponding to them.  */
+static std::stringstream func_decls;
+static GTY((if_marked (ggc_marked_p), param_is (struct rtx_def)))
+  htab_t declared_libfuncs_htab;
+static GTY((if_marked (ggc_marked_p), param_is (union tree_node)))
+  htab_t declared_fndecls_htab;
+static GTY((if_marked (ggc_marked_p), param_is (union tree_node)))
+  htab_t needed_fndecls_htab;
+
+/* Allocate a new, cleared machine_function structure.  */
+
+static struct machine_function *
+nvptx_init_machine_status (void)
+{
+  struct machine_function *p =

The nvptx port [11/11] More tools.

2014-10-20 Thread Bernd Schmidt

This is a bonus optional patch which adds ar, ranlib, as and ld to the 
ptx port. This is not proper binutils; ar and ranlib are just linked to 
the host versions, and the other two tools have the following functions:


* nvptx-as is required to convert the compiler output to actual valid
  ptx assembly, primarily by reordering declarations and definitions.
  Believe me when I say that I've tried to make that work in the
  compiler itself and it's pretty much impossible without some really
  invasive changes.
* nvptx-ld is just a pseudo linker that works by concatenating ptx
  input files and separating them with nul characters. Actual linking
  is something that happens later, when calling CUDA library functions,
  but existing build system make it useful to have something called
  ld which is able to bundle everything that's needed into a single
  file, and this seemed to be the simplest way of achieving this.

There's a toplevel configure.ac change necessary to make ar/ranlib 
useable by the libgcc build. Having some tools built like this has some 
precedent in t-vmsnative, but as Thomas noted it does make feature tests 
in gcc's configure somewhat ugly (but everything works well enough to 
build the compiler). The alternative here is to bundle all these files 
into a separate nvptx-tools package which users would have to download - 
something that would be nice to avoid.


These tools currently require GNU extensions - something I probably 
ought to fix if we decide to add them to the gcc build itself.



Bernd

	* configure.ac (AR_FOR_TARGET, RANLIB_FOR_TARGET): If nvptx-*,
	look for them in the gcc build directory.
	* configure: Regenerate.

	gcc/
	* config.gcc (nvptx-*): Define extra_programs.
	* config/nvptx/nvptx-as.c: New file.
	* config/nvptx/nvptx-ld.c: New file.
	* config/nvptx/t-nvptx (nvptx-ld.o, nvptx-as.o, collect-ld$(exeext),
	as$(exeext), ar$(exeext), ranlib$(exeext): New rules.

Index: git/gcc/config.gcc
===
--- git.orig/gcc/config.gcc
+++ git/gcc/config.gcc
@@ -2154,6 +2154,7 @@ nios2-*-*)
 nvptx-*)
 	tm_file=${tm_file} newlib-stdint.h
 	tmake_file=nvptx/t-nvptx
+	extra_programs=collect-ld\$(exeext) as\$(exeext) ar\$(exeext) ranlib\$(exeext)
 	;;
 pdp11-*-*)
 	tm_file=${tm_file} newlib-stdint.h
Index: git/gcc/config/nvptx/nvptx-as.c
===
--- /dev/null
+++ git/gcc/config/nvptx/nvptx-as.c
@@ -0,0 +1,961 @@
+/* An assembler for ptx.
+   Copyright (C) 2014 Free Software Foundation, Inc.
+   Contributed by Nathan Sidwell nat...@codesourcery.com
+   Contributed by Bernd Schmidt ber...@codesourcery.com
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   http://www.gnu.org/licenses/.  */
+
+/* Munges gcc-generated PTX assembly so that it becomes acceptable for ptxas.
+
+   This is not a complete assembler.  We presume the source is well
+   formed from the compiler and can die horribly if it is not.  */
+
+#include getopt.h
+#include stdlib.h
+#include stdio.h
+#include stdarg.h
+#include string.h
+#include wait.h
+#include unistd.h
+#include errno.h
+#define obstack_chunk_alloc malloc
+#define obstack_chunk_free free
+#include obstack.h
+#define HAVE_DECL_BASENAME 1
+#include libiberty.h
+#include hashtab.h
+
+#include list
+
+static const char *outname = NULL;
+
+static void __attribute__ ((format (printf, 1, 2)))
+fatal_error (const char * cmsgid, ...)
+{
+  va_list ap;
+
+  va_start (ap, cmsgid);
+  fprintf (stderr, nvptx-as: );
+  vfprintf (stderr, cmsgid, ap);
+  fprintf (stderr, \n);
+  va_end (ap);
+
+  unlink (outname);
+  exit (1);
+}
+
+struct Stmt;
+
+class symbol
+{
+ public:
+  symbol (const char *k) : key (k), stmts (0), pending (0), emitted (0)
+{ }
+
+  /* The name of the symbol.  */
+  const char *key;
+  /* A linked list of dependencies for the initializer.  */
+  std::listsymbol * deps;
+  /* The statement in which it is defined.  */
+  struct Stmt *stmts;
+  bool pending;
+  bool emitted;
+};
+
+/* Hash and comparison functions for these hash tables.  */
+
+static int hash_string_eq (const void *, const void *);
+static hashval_t hash_string_hash (const void *);
+
+static int
+hash_string_eq (const void *s1_p, const void *s2_p)
+{
+  const char *const *s1 = (const char *const *) s1_p;
+  const char *s2 = (const char *) s2_p;
+  return strcmp (*s1, s2) == 0;
+}
+

Re: The nvptx port [11/11] More tools.

2014-10-20 Thread Joseph S. Myers

On Mon, 20 Oct 2014, Bernd Schmidt wrote:

 These tools currently require GNU extensions - something I probably ought to
 fix if we decide to add them to the gcc build itself.

And as regards library use, I'd expect the sources to start with #includes 
of config.h and system.h (and so not include system headers directly if 
they are included by system.h) even if no other GCC headers are useful in 
any way.

-- 
Joseph S. Myers
jos...@codesourcery.com

97 matches

Mail list logo