Re: GDC for ARM MacOS / OSX

2023-06-29 Thread Cecil Ward via D.gnu

On Thursday, 29 June 2023 at 09:18:19 UTC, Iain Buclaw wrote:

On Thursday, 29 June 2023 at 05:27:57 UTC, Cecil Ward wrote:
I tried getting GCC on my ARM M2 Mac using the homebrew 
package manager, which is how I got LDC. It gave me C++ and C 
and FORTRAN, but no sign of any GDC.


How is GCC available when it has no support for 
aarch64-darwin2x?


(Experimental support is available in [iains fork on 
github](https://github.com/iains/gcc-darwin-arm64) targeting 
the 14.x development branch)


Iain and I have been going through arm64-darwin support in D 
run-time, mostly it's just getting cross bootstrap from gcc-11 
done cleanly.  Nothing that I expect to be made concretely 
available just yet, though I am hoping that GCC will finally 
get support for M1/M2 by the time 14.1 is released in May 
though.


Thankyou for your good work Iain. I don’t have a working GDC as 
the one on my Raspberry Pi AAarch64 Debian Buster dies with an 
error message ever time you try to compile. I have run GDC on 
32-bit ARM on the Raspberry Pi and in godbolt.org on x86-64.


I will get hold of an x86-64 box with Linux on it. Get round the 
problems that way.


GDC for ARM MacOS / OSX

2023-06-28 Thread Cecil Ward via D.gnu
I tried getting GCC on my ARM M2 Mac using the homebrew package 
manager, which is how I got LDC. It gave me C++ and C and 
FORTRAN, but no sign of any GDC.


Re: bug report x86-64 code: je / jbe

2023-06-17 Thread Cecil Ward via D.gnu

On Friday, 16 June 2023 at 13:12:06 UTC, Iain Buclaw wrote:

On Wednesday, 14 June 2023 at 12:35:43 UTC, Cecil Ward wrote:
I have just noticed a bug in the latest release of GDC that 
targets x86-64. For example GDC 12.3 and above versions too, 
running on X86-64, targeting self. This was built with:

 -O3 -frelease -march=alderlake



What leads you to believe that it is buggy?


Generated code is:

   je L1
   jbe L1



What I see is the first instruction is going to relate to this 
condition.



if ( unlikely( p == 1 ) ) return x;


Then the next instruction is the condition in the following 
for-loop.



for ( exp_ui_t i = p; i > 1; i >>= 1 )


Redundant jump? Yes, arguably. Leads to wrong runtime? Doesn't 
look that way.


Completely agree with Iain, it’s not incorrect code, I wasn’t 
intending to suggest that. I’d just say suboptimal, and not the 
very best code generation possible.


bug report x86-64 code: je / jbe

2023-06-14 Thread Cecil Ward via D.gnu
I have just noticed a bug in the latest release of GDC that 
targets x86-64. For example GDC 12.3 and above versions too, 
running on X86-64, targeting self. This was built with:

 -O3 -frelease -march=alderlake

Generated code is:

   je L1
   jbe L1

…
…
L1:  ret

The probably reason for this is my use of the GDC built in for 
indicating whether conditional jumps are likely or unlikely to be 
taken. I wrote a trivial routine likely() and unlikely() and used 
this as follows:


public
T pow(T, exp_ui_t )( in T x, in exp_ui_t p ) pure @safe nothrow 
@nogc

if ( is ( exp_ui_t == ulong ) || is ( exp_ui_t == uint ) )
in  {
static assert( is ( typeof( x * x ) ) );
assert( p >= 0 );
}
out ( ret ) {
assert( ( p == 0 && ret == 1 ) || !( p == 0 ) );
}
do
{
if ( unlikely( p == 0 ) ) return 1;
if ( unlikely( p == 1 ) ) return x;

/*
if ( unlikely( x == 0 ) )   // fast-path opt, unnecessary
return x;
if ( unlikely( x == 1 ) )   // fast-path opt, unnecessary
return x;
*/

T s = x;
T v = 1;

for ( exp_ui_t i = p; i > 1; i >>= 1 )
{
v = ( i & 0x1 ) ? s * v : v;
s = s * s;
}
   //assert( p > 1 && pow( x, p ) == ( p > 1 ? x * pow( x, p-1) : 
1) );

return v * s;
}

pragma( inline, true )
private
bool builtin_expect()( in bool test_cond, in bool expected_cond ) 
 pure nothrow @safe @nogc

{
version ( LDC )
	{// ldc.intrinsics.llvm_expect - didi not seem to work when 
tested in LDC 1.22

import ldc.intrinsics : llvm_expect;
return cast(bool) llvm_expect( test_cond, expected_cond );
}
version ( GDC )
{
import gcc.builtins : __builtin_expect;
	return cast(bool) __builtin_expect( test_cond, expected_cond 
);

}
return test_cond;
}


pragma( inline, true )
public
bool likely()( in bool test_cond ) pure nothrow @safe @nogc
/* Returns test_cond which makes it convenient to do assert( 
unlikely() )
 * Also emulates builtin_expect's return behaviour, by returning 
the argument

 */ {
return builtin_expect( test_cond, true );
}


pragma( inline, true )
public
bool unlikely()( in bool test_cond ) pure nothrow @safe @nogc
/* Returns test_cond which makes it convenient to do assert( 
unlikely() )
 * Also emulates builtin_expect's return behaviour, by returning 
the argument

 */
{
return builtin_expect( test_cond, false );
}
// ~~~ module likely - end.



This is not the whole of this .d file, I can of course give you 
the whole lot if you desire. I inspected the result in Matt 
Godbolt’s compiler explorer website godbolt.org.


An aside: LDC:: I need to look at LDC’s llvm_expect to see if it 
is controlling the branches the way I wish. Does anyone know if 
llvm_expect has any problems?


Regression - quality of generated x86-64 code between GDC v12.3 and v13.1

2023-06-07 Thread Cecil Ward via D.gnu
I wrote a very small procedure in D and the x86-64 asm code 
generated in GDC 12.3 was excellent whereas that from 13.1 was 
insanely bloated, totally different. Note: the badness is 
independent of the -On optimisation level (-O3 used initially.)


Here’s the D code and following it, two asm code snippets:





public
pragma( inline, true )
cpuid_abcd_t
cpuid_insn( in uint32_t eax ) pure nothrow @nogc @trusted
{ /* ecx arg omitted; absolutely minimal variant wrapper */
   	assert( ! is_ecx_needed( eax ) );	// since we are not 
providing an ecx, we had better not be needing to supply one


static assert( eax.sizeof * 8 == 32 );  // optional, exact
static assert( eax.sizeof * 8 >= 32 );   // essential min

   	const uint32_t in_eax = eax;	// really just for 
type-checking, and constness-assertion

static assert( in_eax.sizeof * 8 == 32 );

	cpuid_abcd_t ret = void;	/* undefined until the cpuid insn 
writes it */
	static assert(ret.eax.sizeof * 8 == 32 && ret.ebx.sizeof * 8 
== 32
   	   && ret.ecx.sizeof * 8 == 32 && ret.edx.sizeof 
* 8 == 32 );

asm pure nothrow @nogc
{
".intel_syntax   " ~ "\n\t" ~

"cpuid"  ~ "\n\t" ~

".att_syntax \n"

	: /* outputs : it is guaranteed that all bits 63…32 of 
rax/rbx/rcx/rdx etc are zeroed in output. */
		"=a" ( ret.eax ),	// an lhs ref, write-only; and only bits 
31…0 are significant

"=b" ( ret.ebx ), // ..  ..
"=c" ( ret.ecx ),
"=d" ( ret.edx )
:   /* inputs : */
"a"  ( in_eax )   // read.
	// /* no ecx input - this is the variant with input ecx 
omitted */

:   /* no clobbers apart from the outputs already listed */
	/* does cpuid set flags? - think not, so no "cc" clobber 
reqd */

;
}
return ret;
}

/*  */

GDC 12.3::  -O3 -frelease -march=native

pushrbx
mov eax, edi
cpuid
mov rsi, rdx
sal rbx, 32
mov eax, eax
mov edx, ecx
sal rsi, 32
or  rax, rbx
pop rbx
or  rdx, rsi
ret


GDC 13.1 = v. bad, same switches:  -O3 -frelease -march=native

pushbp
mov eax, edi
mov rbp, rsp
pushrbx
and rsp, -32
cpuid
vmovd   xmm3, eax
vmovd   xmm2, ecx
vpinsrd xmm1, xmm2, edx, 1
vpinsrd xmm0, xmm3, rbx, 1
vpunpcklqdq   xmm4, xmm0, xmm1
vmovdqa xmmword ptr [rsp-80], xmm4
mov rax, qword ptr [rsp-80]
mov rdx, qword ptr [rsp-72]
mov rbx, qword ptr [rbp-8]  
leave
ret
/*  */


Re: Hello world in AAarch64 Debian Buster

2019-08-27 Thread Cecil Ward via D.gnu

On Thursday, 8 August 2019 at 01:17:53 UTC, Cecil Ward wrote:

On Wednesday, 7 August 2019 at 05:48:49 UTC, Iain Buclaw wrote:


You could raise a Debian bug report, saying that aarch64 is in 
the libphobos supported list.


Thanks Iain, I went through the prompts in the Debian bug 
report program and I hope that that has emailed a report to 
them which makes some kind of sense.


I’m assuming that someone somewhere just needs to rerun the 
make properly, or whatever. Is it a lack of error checking in 
the makefile or associated tools, which does not ring a bell 
when things go seriously wrong and allows a bad object file to 
be created even though the build went wrong internally and 
became a twisted manky thing before birth.


Building it all myself from sources is certainly a nightmare, 
so much missing and things left unautomated wrt provision of 
(non-source) files associated with dependencies.


Any suggestions as to where I should go from here for a bit 
(actually a lot) if hand-holding as I An totally out of my death. 
Just trying to find a human who could build gdc arm64 here for me


Re: Hello world in AAarch64 Debian Buster

2019-08-07 Thread Cecil Ward via D.gnu

On Wednesday, 7 August 2019 at 05:48:49 UTC, Iain Buclaw wrote:


You could raise a Debian bug report, saying that aarch64 is in 
the libphobos supported list.


Thanks Iain, I went through the prompts in the Debian bug report 
program and I hope that that has emailed a report to them which 
makes some kind of sense.


I’m assuming that someone somewhere just needs to rerun the make 
properly, or whatever. Is it a lack of error checking in the 
makefile or associated tools, which does not ring a bell when 
things go seriously wrong and allows a bad object file to be 
created even though the build went wrong internally and became a 
twisted manky thing before birth.


Building it all myself from sources is certainly a nightmare, so 
much missing and things left unautomated wrt provision of 
(non-source) files associated with dependencies.


Re: Hello world in AAarch64 Debian Buster

2019-08-06 Thread Cecil Ward via D.gnu

On Tuesday, 6 August 2019 at 16:35:21 UTC, Johannes Pfau wrote:

Am Tue, 06 Aug 2019 05:13:11 + schrieb Cecil Ward:

I have a raspberry pi 3B+ running raspbian stretch 32-bit with 
a containerised guest o/s inside it using systemd-nspawn, the 
guest o/s being AAarch64 Debian Buster.


Inside AAarch64 Debian Buster, I run the following from the 
shell and get an error from the gdc compiler:


root@debian-buster-64:~#  gdc -O3 -frelease -S test.d cc1d: 
error:

cannot find source code for runtime library file 'object.d'
cc1d: note: dmd might not be correctly installed. Run 'dmd 
-man' for

installation instructions.

(null):0: confused by earlier errors, bailing out
root@debian-buster-64:~#

Any clues as to where I should head from here?


You're probably missing libgphobos-dev. However, I think on 
debian buster there is no arm64 port of libgphobos-dev yet.


Testing seems to have libgphobos-9-dev with arm64 support. If 
you want to use buster though, you probably have to build gcc 
by yourself. Just get the gcc 9 sources and use ./configure 
--enable-languages=d when configuring gcc.


Thank you very much Johannes. I have never done this before but I 
thought here goes, so I started out trying to build the whole of 
GCC including the D language from the sources.


I made a bit of a mess of this, because having written a bash 
script to set things going, I ran it from the wrong shell, the 
shell in the host o/s not the one in the guest o/s. So this 
started off building the wrong architecture variant. I then 
realised I don’t know how to get my few files into the guest o/s 
Debian buster’s filesystem (inside its chroot jail) so that’s 
more fun to work out. But worse than that, the make came up with 
an error, saying there are a few of dependencies that re not part 
of the download GMP for example. So I’m going to have to build 
all of those from the sources as well, and find out how to 
download them. I could easily get into a circular dependency 
thing here.


I am so far out of my depth here. Some person who has done this 
before will have had binaries / object files / targets for those 
dependencies. Unless the files are just in the wrong place and I 
need to tell it where they are - which is one suggestion from the 
error msg.


To recap: I was trying to simply get hold of gdc for aarch64- I 
didn’t really want to build anything from sources. I suspect this 
will create additional problems faster than it solves the 
original ones.


Perhaps I should look around for gdc AArch64 Debian prebuilt 
binaries or ask someone who knows what on earth they are doing to 
kindly bootstrap this problem for me.


Hello world in AAarch64 Debian Buster

2019-08-05 Thread Cecil Ward via D.gnu
I have a raspberry pi 3B+ running raspbian stretch 32-bit with a 
containerised guest o/s inside it using systemd-nspawn, the guest 
o/s being AAarch64 Debian Buster.


Inside AAarch64 Debian Buster, I run the following from the shell 
and get an error from the gdc compiler:


root@debian-buster-64:~#  gdc -O3 -frelease -S test.d
cc1d: error: cannot find source code for runtime library file 
'object.d'
cc1d: note: dmd might not be correctly installed. Run 'dmd -man' 
for installation instructions.


(null):0: confused by earlier errors, bailing out
root@debian-buster-64:~#

Any clues as to where I should head from here?



Trying to build GDC from sources (New fool - please be kind)

2018-03-26 Thread Cecil Ward via D.gnu
Got as far as downloading a huge .tar.xz file and extracting it, 
but U just guessed at a version of gcc sources to ftp-fetch in 
the first place.


I have a gcc-7.3.0 folder now. Is that the correct version number 
?


Next, the script file  setup-gcc.sh comes up with the error 
message

"found gcc version 7
 This version of GCC (7) is not supported."

It's looking for a patch file with a different name, based on 
version number. I have patch files patch-*8.patch currently, 
mismatch against the check for 7.


Not sure where to go from here.



Re: [Bug 288] strange nonsensical x86-64 code generation with -O3 - rats' nest of useless conditional jumps

2018-03-25 Thread Cecil Ward via D.gnu

On Friday, 23 March 2018 at 22:14:35 UTC, Iain Buclaw wrote:

On Friday, 23 March 2018 at 00:39:13 UTC, Cecil Ward wrote:

On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:

https://bugzilla.gdcproject.org/show_bug.cgi?id=288

--- Comment #1 from Iain Buclaw  ---
See the long list of useless conditional jumps towards the 
end of the first function in the asm output (whose demangled 
name is test.t1(unit))


Well, you'd never use -O3 if you care about speed anyway. :-)

And they are not useless jumps, it's just the foreach loop 
unrolled in its entirety.  You can see that it's a feature of 
the gcc-7 series and latter, irregardless of the target, they 
all produce the same unrolled loop.


https://explore.dgnu.org/g/vD3N4Y

It might be a nice experiment to add pragma(ivdep) and 
pragma(unroll) support

to give you more control.

https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html

I wouldn't hold my breath though (this is not strictly a bug).


Agreed. It is possibly not a bug, because I don't see that the 
code is dysfunctional, but I haven't looked through it. But 
since the backend is doing optimisation here with unrolling, 
that being sub-optimal given with this weird code is imho a 
bug in that the achievement of _optimisation_ is not attained.


No I understand this is nothing to do with D, and I understand 
that this is unrolling.


But notice the target of the jumps are all to the same 
location and finishes off with an unconditional jump to the 
same location.




Not quite, if you look a little closer, some jump to other 
branches hidden inbetween.


I feel this is just a quirk of unrolling, in part, but that's 
not all I feel as the jumps don't make sense

 cmp #n / jxx L3
cmp #m / jxx L3
 jmp L3

is what we have so it all basically does absolutely nothing, 
unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of 
testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very 
funny.




If you compile with -fdump-tree-optimized=stdout you will see 
that it's the middle-end that has lowered the code to a series 
of if jumps.


The backend consumer doesn't really have any chance for 
improving it.


I know this is merely debug-only code, but am wondering what 
else might happen if you are misguided enough to use the crazy 
-O3 with unrolled loops that have conditionals in them.


My other complaint about GCC back-end’' code generation is 
that it (sometimes) doesn't go for jump-less movcc-style 
operations when it can. For x86/x64, LDC sometimes generates 
jump-free code using conditional instructions where GCC does 
not.


I can fix the problem with GDC by using A single & instead of 
a &&, which happens to be legal here. Sometimes in the past I 
have needed to take steps to make sure that I can do such an 
operator substitution trick in order to get jump-free far far 
faster code, faster where the alternatives are extremely short 
(and side-effect-free) and branch prediction failure is a 
certainty.




You could also try compiling with -O2.  I couldn't really see 
this in your given example, but honestly, if you want to 
optimize really aggressively you must be willing to coax the 
compiler in strange ways anyway.


I don't know if there are ways in which the backend could try 
to ascertain whether the results of certain unrolling are 
really bad. In some cases they could be bad because the code 
is too long and generates problems with code cache size or 
won't fit into a loop buffer. A highly per-cpu sub-variant 
check would need to be carried out in the generated code size, 
at least in all cases where there is still a loop left (as 
opposed to full unrolling of known size), as every kind of AMD 
and Intel processor is different, as Agner Fog warns us. Here 
though I didn't even explicitly ask for unrolling, so you 
might harshly say that it is the compiler’s jib ti work out 
whether it is actually an anti-optimisation, regardless of the 
possible reasons why the result may be bad news, never mind 
just based on total generated code size not fitting into some 
per-CP limit.




Well again, from past experience -O3 doesn't really care about 
code size or cache line so much.  All optimizations passes 
which lower the code this way do so during SSA transformations, 
so irrespective of what is being targeted.


My reason for reporting this was to inquire about for loop 
unrolling behaves in later versions of the back end, ask about 
jump generation vs jump-free alternatives (LDC showing the 
correct way to do things) and to ask if there are any 
suboptimality nasties junking in code that does not merely 
come down to driving an assert.


I would hope for an optimisation that handles the case of 
dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially 
with no holes, in the case where _all the jumps go to the same 
target_, so this can get reduced down into a two-test range 
check and huge optimisation. I would also hope that 
conditional 

Re: [Bug 288] strange nonsensical x86-64 code generation with -O3 - rats' nest of useless conditional jumps

2018-03-22 Thread Cecil Ward via D.gnu

On Thursday, 22 March 2018 at 22:16:16 UTC, Iain Buclaw wrote:

https://bugzilla.gdcproject.org/show_bug.cgi?id=288

--- Comment #1 from Iain Buclaw  ---
See the long list of useless conditional jumps towards the end 
of the first function in the asm output (whose demangled name 
is test.t1(unit))


Well, you'd never use -O3 if you care about speed anyway. :-)

And they are not useless jumps, it's just the foreach loop 
unrolled in its entirety.  You can see that it's a feature of 
the gcc-7 series and latter, irregardless of the target, they 
all produce the same unrolled loop.


https://explore.dgnu.org/g/vD3N4Y

It might be a nice experiment to add pragma(ivdep) and 
pragma(unroll) support

to give you more control.

https://gcc.gnu.org/onlinedocs/gcc/Loop-Specific-Pragmas.html

I wouldn't hold my breath though (this is not strictly a bug).


Agreed. It is possibly not a bug, because I don't see that the 
code is dysfunctional, but I haven't looked through it. But since 
the backend is doing optimisation here with unrolling, that being 
sub-optimal given with this weird code is imho a bug in that the 
achievement of _optimisation_ is not attained.


No I understand this is nothing to do with D, and I understand 
that this is unrolling.


But notice the target of the jumps are all to the same location 
and finishes off with an unconditional jump to the same location.


I feel this is just a quirk of unrolling, in part, but that's not 
all I feel as the jumps don't make sense

 cmp #n / jxx L3
cmp #m / jxx L3
 jmp L3

is what we have so it all basically does absolutely nothing, 
unless cmp 1 cmp 2 cmp 3 cmp 4 is an incredibly bad way of 
testing ( x>=1 && x<=4 ) but with 30-odd tests it isn't very 
funny.


I know this is merely debug-only code, but am wondering what else 
might happen if you are misguided enough to use the crazy -O3 
with unrolled loops that have conditionals in them.


My other complaint about GCC back-end’' code generation is that 
it (sometimes) doesn't go for jump-less movcc-style operations 
when it can. For x86/x64, LDC sometimes generates jump-free code 
using conditional instructions where GCC does not.


I can fix the problem with GDC by using A single & instead of a 
&&, which happens to be legal here. Sometimes in the past I have 
needed to take steps to make sure that I can do such an operator 
substitution trick in order to get jump-free far far faster code, 
faster where the alternatives are extremely short (and 
side-effect-free) and branch prediction failure is a certainty.


I don't know if there are ways in which the backend could try to 
ascertain whether the results of certain unrolling are really 
bad. In some cases they could be bad because the code is too long 
and generates problems with code cache size or won't fit into a 
loop buffer. A highly per-cpu sub-variant check would need to be 
carried out in the generated code size, at least in all cases 
where there is still a loop left (as opposed to full unrolling of 
known size), as every kind of AMD and Intel processor is 
different, as Agner Fog warns us. Here though I didn't even 
explicitly ask for unrolling, so you might harshly say that it is 
the compiler’s jib ti work out whether it is actually an 
anti-optimisation, regardless of the possible reasons why the 
result may be bad news, never mind just based on total generated 
code size not fitting into some per-CP limit.


My reason for reporting this was to inquire about for loop 
unrolling behaves in later versions of the back end, ask about 
jump generation vs jump-free alternatives (LDC showing the 
correct way to do things) and to ask if there are any 
suboptimality nasties junking in code that does not merely come 
down to driving an assert.


I would hope for an optimisation that handles the case of 
dense-packed cmp #1 | cmp #2 | cmp #3 | cmp #4 etc, especially 
with no holes, in the case where _all the jumps go to the same 
target_, so this can get reduced down into a two-test range check 
and huge optimisation. I would also hope that conditional jumps 
followed by an unconditional jump could be spotted and handled 
too. (Peephole general low level optimisation then? ie jxx L1 / 
jmp L1 =  jmp L1)


Perhaps this is all being generated too late and optimisations 
have ready happened and they opportunities those optimisers 
provide have been and gone. Would it be possible for the backend 
to include a number of repeat optimisation passes of certain 
kinds after unrolled code is generated, or doesn't it work like 
that?


Anyway, this is not for you but for that particular backend, I 
suspect. I was wondering if someone could have a word, pass it in 
to the relevant people. I think it's worth making -O3 more 
generally usable rather than crazy because it features good ideas 
gone bad.


How to report code generation weirdness

2018-03-20 Thread Cecil Ward via D.gnu
How do I report some extremely weird (useless) code generated by 
GDC when the -O3 option is used? (bizarre rats’ nest of 
conditional jumps). [I am an experienced professional asm 
programmer, now retired.]


The D source is short, fortunately. The asm output I am looking 
at is seen through the telescope that is Matt GodBolt’s Compiler 
Explorer at http://explore.dgnu.org so I possibly would want to 
pull the asm text from that site (somehow).