Re: [fpc-devel] Question on updating FPC packages

2019-10-31 Thread J. Gareth Moreton
To get back on track with uComplex, I didn't change any routines to make 
them inline - they were that way already.  All I did was change the 
parameters to 'const', align the complex type so it is equivalent to 
__m128d so the System V ABI can pass it all in one register, and enable 
vectorcall on Win64 so the same thing can happen on that platform.  Is 
that really too much?


Changing the Win64 build of FPC to default to vectorcall is an option, 
although the option to fall back to the fastcall-based convention needs 
to exist for the sake of interfacing with third-party libraries, and it 
doesn't change the fact that the complex type still needs to be 
aligned.  Either way, it might break assembler code that calls the 
uComplex functions, but my argument still stands that I don't think this 
a realistic set-up in the wide scheme of things.


Gareth aka. Kit

On 31/10/2019 21:13, Florian Klämpfl wrote:

Am 31.10.19 um 20:11 schrieb Marco van de Voort:


Op 2019-10-30 om 23:02 schreef Florian Klämpfl:


Yes. And manually adding inline is only as good as the knowledge of 
the user doing so. If somebody implements it right (I did not, I 
used the easiest approach and used an existing function to estimate 
the complexity of a subroutine). The compiler can just count the 
number of the generate instructions or even calculate the length of 
the procedure and then decide to keep the node tree for inlining.


Well, it depends of course of what happens when. Would you really 
count final instructions or cycles after all optimization and 
peephole passes ?


This is not really an issue: actually for inlining mainly 
instructions/code length matters and e.g. the arm compiler even does 
this (actually something more complex) as it has to insert the 
constant tables at the right locations into the code because the 
relative offsets are limited.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-31 Thread Florian Klämpfl

Am 31.10.19 um 20:11 schrieb Marco van de Voort:


Op 2019-10-30 om 23:02 schreef Florian Klämpfl:


Yes. And manually adding inline is only as good as the knowledge of 
the user doing so. If somebody implements it right (I did not, I used 
the easiest approach and used an existing function to estimate the 
complexity of a subroutine). The compiler can just count the number of 
the generate instructions or even calculate the length of the 
procedure and then decide to keep the node tree for inlining.


Well, it depends of course of what happens when. Would you really count 
final instructions or cycles after all optimization and peephole passes ?


This is not really an issue: actually for inlining mainly 
instructions/code length matters and e.g. the arm compiler even does 
this (actually something more complex) as it has to insert the constant 
tables at the right locations into the code because the relative offsets 
are limited.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-31 Thread Marco van de Voort


Op 2019-10-30 om 23:02 schreef Florian Klämpfl:


Yes. And manually adding inline is only as good as the knowledge of 
the user doing so. If somebody implements it right (I did not, I used 
the easiest approach and used an existing function to estimate the 
complexity of a subroutine). The compiler can just count the number of 
the generate instructions or even calculate the length of the 
procedure and then decide to keep the node tree for inlining.


Well, it depends of course of what happens when. Would you really count 
final instructions or cycles after all optimization and peephole passes ?



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-30 Thread J. Gareth Moreton
Well, when it comes to the specific changes I made to uComplex... the 
compiler might be able to detect a kind of 'auto-const' system, but 
actually inserting 'const' into the formal parameters helps with syntax 
checking as well as generating more efficient code, namely modifying the 
parameter when you're perhaps not supposed to.


For vectorcall, I don't think the compiler will correctly guess when and 
when not to use the calling convention, and there are times where you 
may not want to use vectorcall, usually when interfacing with 
third-party programs or libraries.  In this case, it's more likely that 
the programmer may stumble upon unintented behaviour if it tries to 
enable vectorcall for something that is meant to be the default 
Microsoft ABI instead.


And using assembly language to directly call the uComplex routines I 
don't think is a realistic real-world example, considering that's a 
situation where you're more likely to be using the XMM registers 
directly to do such mathematics.  Besides, I think all bets are off when 
it comes to assembly language - in this instance I tried to make sure 
that Pascal code didn't have to change though (other than a 
recompilation maybe).


I could just say 'screw it' and write my own complex number library, but 
then that would just add to the growing collection of third-party 
libraries instead of a standard set of libraries that are antiquated and 
potentially sluggish on modern systems.


Gareth aka. Kit

On 30/10/2019 22:02, Florian Klämpfl wrote:

Am 29.10.19 um 14:06 schrieb Marco van de Voort:


Op 2019-10-27 om 10:46 schreef Florian Klämpfl:

Am 27.10.19 um 10:27 schrieb Michael Van Canneyt:
If you genuinely believe that micro-optimization changes can make a 
difference:


Submit patches. 


As said: I am against applying them. Why? They clutter code and 
after all, they make assumptions about the current target which not 
might be always valid. And time testing them is much better spent in 
improving the compiler and then all code benefits. Another point: 
for example explicit inline increases normally code size (not always 
but often), so it is against the use of -Os. Applying inline 
manually on umpteen subroutines makes no sense. Better improve auto 
inlining.


Auto inlining is also no panacea.   It only works with heuristics, 
and is thus only as good as a formula of the heuristic.


Yes. And manually adding inline is only as good as the knowledge of 
the user doing so. If somebody implements it right (I did not, I used 
the easiest approach and used an existing function to estimate the 
complexity of a subroutine). The compiler can just count the number of 
the generate instructions or even calculate the length of the 
procedure and then decide to keep the node tree for inlining.




Changing calling conventions, vectorizing, loops all complicates 
that, and it will never be perfect, and a change here will lead to a 
problem there etc.


See above.



If you know a routine can evaluate to one instruction in most cases, 
I don't see anything wrong with just marking it as such.




The compiler knows this as well as the compiler generated the code. 
Why should I guess if the compiler knows?

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-30 Thread Florian Klämpfl

Am 29.10.19 um 14:06 schrieb Marco van de Voort:


Op 2019-10-27 om 10:46 schreef Florian Klämpfl:

Am 27.10.19 um 10:27 schrieb Michael Van Canneyt:
If you genuinely believe that micro-optimization changes can make a 
difference:


Submit patches. 


As said: I am against applying them. Why? They clutter code and after 
all, they make assumptions about the current target which not might be 
always valid. And time testing them is much better spent in improving 
the compiler and then all code benefits. Another point: for example 
explicit inline increases normally code size (not always but often), 
so it is against the use of -Os. Applying inline manually on umpteen 
subroutines makes no sense. Better improve auto inlining.


Auto inlining is also no panacea.   It only works with heuristics, and 
is thus only as good as a formula of the heuristic.


Yes. And manually adding inline is only as good as the knowledge of the 
user doing so. If somebody implements it right (I did not, I used the 
easiest approach and used an existing function to estimate the 
complexity of a subroutine). The compiler can just count the number of 
the generate instructions or even calculate the length of the procedure 
and then decide to keep the node tree for inlining.




Changing calling conventions, vectorizing, loops all complicates that, 
and it will never be perfect, and a change here will lead to a problem 
there etc.


See above.



If you know a routine can evaluate to one instruction in most cases, I 
don't see anything wrong with just marking it as such.




The compiler knows this as well as the compiler generated the code. Why 
should I guess if the compiler knows?

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread Michael Van Canneyt



On Tue, 29 Oct 2019, Ben Grasset wrote:


On Sun, Oct 27, 2019 at 5:27 AM Michael Van Canneyt 
wrote:


Saying that the code is 'almost unusably slow' is the kind of statement
that does
not help. I use the code almost daily in production, no complaints about
performance, so clearly it is usable.

Instead, demonstrate your claim with facts, for example by creating a
patch that
demonstrably increases performance.



I was perhaps slightly exaggerating there. I use it as well in real life,
but in many cases have found myself altering the sources to perform more
optimally (some of which I could submit as patches, I suppose.


Please do.

As said, I rarely refuse patches for optimization for code I maintain,
exactly because I know I pay little attention to it.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread Ben Grasset
On Sun, Oct 27, 2019 at 5:27 AM Michael Van Canneyt 
wrote:

> Saying that the code is 'almost unusably slow' is the kind of statement
> that does
> not help. I use the code almost daily in production, no complaints about
> performance, so clearly it is usable.
>
> Instead, demonstrate your claim with facts, for example by creating a
> patch that
> demonstrably increases performance.
>

I was perhaps slightly exaggerating there. I use it as well in real life,
but in many cases have found myself altering the sources to perform more
optimally (some of which I could submit as patches, I suppose.

On Sun, Oct 27, 2019 at 5:27 AM Michael Van Canneyt 
wrote:

> If you genuinely believe that micro-optimization changes can make a
> difference:
>
> Submit patches. When focused and well explained, I doubt they will be
> refused.
>

The stuff that I'm particularly concerned about is usually more along the
lines of "small things that add up in significant ways in the context of
long-running programs", so while they might be "micro" on their own I
wouldn't necessarily call them that in context of larger overall situations.

On Sun, Oct 27, 2019 at 5:46 AM Florian Klämpfl 
wrote:

> Another point: for example
> explicit inline increases normally code size (not always but often)


I've had the opposite experience in most cases. The code FPC generates for
something like four un-inlined functions in a situation where each one
calls the next is generally significantly bigger due to the setup for the
parameters being passed in / etc. Whereas if it's inlining all of them it
seems to be able to do a much better job of combining "redundant" things
and optimizing based on that, which tends to give a much smaller result.

Again, in a world where robust autoinlining was the default I'd happily
rely on it exclusively, as it's not as though I specifically *want* to have
to add the "inline" modifier in particular places.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread J. Gareth Moreton

On 29/10/2019 14:24, Michael Van Canneyt wrote:



On Tue, 29 Oct 2019, J. Gareth Moreton wrote:

Please note that only Marco's e-mails are making the list.  I don't 
see Michael's responses.


That's probably because I am not responding ;-)

Michael.


Yep, just noticed that Marco was responding to your messages from a few 
days ago!  Perception fail!


In regards to passing everything into XMM0, try running 
"tests/test/cg/tvectorcall1.pp" on Linux.  It's a bit of a weird test 
because there's a lot of Win64 stuff that's not compiled since it tests 
aggregates, something that only vectorcall takes advantage of.


Nevertheless, if you get an error such as 'FAIL: 
HorizontalAddSingle(HVA) has the vector in the wrong register.', then 
the System V ABI is not passing the __m128 type properly. The way it 
tests this is via a pair of functions, one in Pascal and one in assembler:


function HorizontalAddSingle(V: TM128): Single; vectorcall;
begin
  HorizontalAddSingle := V.M128_F32[0] + V.M128_F32[1] + V.M128_F32[2] 
+ V.M128_F32[3];

end;

function HorizontalAddSingle_ASM(V: TM128): Single; vectorcall; 
assembler; nostackframe;

asm
  HADDPS XMM0, XMM0
  HADDPS XMM0, XMM0
end;

If the results are not equal, then the entire vector isn't in XMM0.  I 
haven't tested it on Linux as much as I would like because I have to 
boot into a virtual machine to do so, and I'm still a bit of a Linux 
novice.  I'm curious to know what the assembler dump is though.


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread J. Gareth Moreton

Oh, I just noticed you're replying to messages from a few days ago.  Oops!

There is no right way of going about optimisation.  I'm of the school 
that if you can give the compiler a helpful hint, without complicating 
the code, then do it.


In one way I compare it to the id Tech (Quake) and Unreal engines back 
in the 90s and early 2000s.  When making maps, the id Tech engines 
attempted to compile everything itself when it came to determining what 
was visible and what it should cull - as a result, the compilation 
process would take a long time and there were some situations where it 
could easily fall apart due to rounding errors or just some glitch in 
the tree.  The Unreal engine, on the other hand, had /you/, the map 
designer, decide what was visible and what wasn't, and had you decide 
where to place portals and other hints to the engine.  This was useful 
because it was much easier to subdivide areas if you were sensible about 
it and hence the Unreal engine could handle much more complex outdoor 
scenes, for example.  The cost though, especially with later versions of 
the Unreal engine that added more features, is that it was very hard for 
a novice to get started - for example, the 'terrain' feature didn't do 
any automatic visibility culling, so if you had a large hill, for 
example, you would have to insert an 'anti-portal' underneath it to give 
a hint to the engine that if it is within the viewport, any polygons 
behind it is invisible (which causes very weird artefacts if you place 
one in the middle of an open room).


I like to take a middle ground, especially as the Pascal compiler has a 
reputation of being fast.  A smart compiler is a good compiler, but 
expecting it to be able to know which procedures should be 
auto-vectorised, especially with old source code and no rules on memory 
alignment, it's either impossible or will take a disproportinately long 
time.  Other times it's an excuse for lazy programming!


As for the vectorcall tests, they should vectorise the entire argument 
on both x86_64-win64 and x86_64-linux.  If not, there's a bug 
somewhere.  I'll have a look.


Gareth aka. Kit



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread Michael Van Canneyt



On Tue, 29 Oct 2019, J. Gareth Moreton wrote:

Please note that only Marco's e-mails are making the list.  I don't see 
Michael's responses.


That's probably because I am not responding ;-)

Michael.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread J. Gareth Moreton
Please note that only Marco's e-mails are making the list.  I don't see 
Michael's responses.


Gareth aka. Kit

On 29/10/2019 13:41, Marco van de Voort wrote:


Op 2019-10-27 om 10:27 schreef Michael Van Canneyt:



Absolutely.

Personally, I don't have any concern for performance in this sense. 
Almost zero.
I invariably favour code simplicity over performance, for sake of 
maintenance.


But there is another kick-in-the-open-door statement about 
performance: That the most performance is gained in a relative small 
part of the code.


To tackle that you need tools to force the compiler to behave a 
certain way that might not (yet?) be doable on the compiler side. IMHO 
it is unfair to deem this all microoptimization just because it 
doesn't hurt you.




For good reason: for the kind of code which I create daily, the kind of
micro-optimizations that you seem to refer to, are utterly 
insignificant,
and I expect the compiler to handle them. If it currently does not, 
then I

think the compiler, rather than the code, must be improved.


Just the vectorizing will probably more than double the performance. 
Just look at the asm that I posted and imagine reducing it to one 
instruction.


And while set FFT unit is not yet a performance bottle neck  for us 
now, it has been marked as a relative large factor of the measurement 
time. (iirc it is about 1ms for a 400 sample array on somewhat older 
hardware)


And what is exactly needed might change at any given moment. If a new 
camera comes out, if processing can keep up you can process more 
samples which in turn reduces errors and improves the measurement 
nearly automatically


Doing the same purely algorithmically usually means  weeks-months of 
hard maths trying to improve signal quality, and after that validating 
that for umpteen products and customers etc etc. Believe me, 
"Microoptimization" then sounds very tempting.


If Gareth can get this running enough to show that the FFT reduces 
instructions, I can just stuff it in a DLL, and have it lying on a 
shelf to insert into the Delphi app when needed. Which would be great.


Code should not entirely disregard optimization, but then it should 
be on a

higher level: don't use bubble sort when you can use a better sort. No
amount of micro-optimization will make bubble sort outperform quickort.


(

Interesting example, I'm not really a hardcore algorithms man, but I 
can think of some potential problems with that statement:


1 that only goes for N->Infinity and that computers don't have 
infinite resources. If quicksort uses more memory (e.g. to track 
state) it might not apply in certain circumstances.


2   if your swap() function is extremely expensive, sorting an already 
sorted array is more expensive with quicksort because it is a non 
stable sort.


3 the non recursive bubble sort might be easier to unroll and then 
optimize by the compiler in cases of sorting a fixed number of items. 
(e.g. ordering the elements of a short vector)


)

Anyway, besides the fun, the "algorithms" mantra is only a first order 
guideline, not an absolute truth.


Saying that the code is 'almost unusably slow' is the kind of 
statement that does

not help. I use the code almost daily in production, no complaints about
performance, so clearly it is usable.


True. Claims should be proven, and with code that does something (not 
with simply a loop around a single operation)


But that is why I brought up the FFT unit. It is possible that that is 
such a case.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread Marco van de Voort


Op 2019-10-27 om 10:27 schreef Michael Van Canneyt:



Absolutely.

Personally, I don't have any concern for performance in this sense. 
Almost zero.
I invariably favour code simplicity over performance, for sake of 
maintenance.


But there is another kick-in-the-open-door statement about performance: 
That the most performance is gained in a relative small part of the code.


To tackle that you need tools to force the compiler to behave a certain 
way that might not (yet?) be doable on the compiler side. IMHO it is 
unfair to deem this all microoptimization just because it doesn't hurt you.




For good reason: for the kind of code which I create daily, the kind of
micro-optimizations that you seem to refer to, are utterly insignificant,
and I expect the compiler to handle them. If it currently does not, 
then I

think the compiler, rather than the code, must be improved.


Just the vectorizing will probably more than double the performance. 
Just look at the asm that I posted and imagine reducing it to one 
instruction.


And while set FFT unit is not yet a performance bottle neck  for us now, 
it has been marked as a relative large factor of the measurement time. 
(iirc it is about 1ms for a 400 sample array on somewhat older hardware)


And what is exactly needed might change at any given moment. If a new 
camera comes out, if processing can keep up you can process more samples 
which in turn reduces errors and improves the measurement nearly 
automatically


Doing the same purely algorithmically usually means  weeks-months of 
hard maths trying to improve signal quality, and after that validating 
that for umpteen products and customers etc etc. Believe me, 
"Microoptimization" then sounds very tempting.


If Gareth can get this running enough to show that the FFT reduces 
instructions, I can just stuff it in a DLL, and have it lying on a shelf 
to insert into the Delphi app when needed. Which would be great.


Code should not entirely disregard optimization, but then it should be 
on a

higher level: don't use bubble sort when you can use a better sort. No
amount of micro-optimization will make bubble sort outperform quickort.


(

Interesting example, I'm not really a hardcore algorithms man, but I can 
think of some potential problems with that statement:


1 that only goes for N->Infinity and that computers don't have infinite 
resources. If quicksort uses more memory (e.g. to track state) it might 
not apply in certain circumstances.


2   if your swap() function is extremely expensive, sorting an already 
sorted array is more expensive with quicksort because it is a non stable 
sort.


3 the non recursive bubble sort might be easier to unroll and then 
optimize by the compiler in cases of sorting a fixed number of items. 
(e.g. ordering the elements of a short vector)


)

Anyway, besides the fun, the "algorithms" mantra is only a first order 
guideline, not an absolute truth.


Saying that the code is 'almost unusably slow' is the kind of 
statement that does

not help. I use the code almost daily in production, no complaints about
performance, so clearly it is usable.


True. Claims should be proven, and with code that does something (not 
with simply a loop around a single operation)


But that is why I brought up the FFT unit. It is possible that that is 
such a case.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread Marco van de Voort


Op 2019-10-27 om 10:46 schreef Florian Klämpfl:

Am 27.10.19 um 10:27 schrieb Michael Van Canneyt:
If you genuinely believe that micro-optimization changes can make a 
difference:


Submit patches. 


As said: I am against applying them. Why? They clutter code and after 
all, they make assumptions about the current target which not might be 
always valid. And time testing them is much better spent in improving 
the compiler and then all code benefits. Another point: for example 
explicit inline increases normally code size (not always but often), 
so it is against the use of -Os. Applying inline manually on umpteen 
subroutines makes no sense. Better improve auto inlining.


Auto inlining is also no panacea.   It only works with heuristics, and 
is thus only as good as a formula of the heuristic.


Changing calling conventions, vectorizing, loops all complicates that, 
and it will never be perfect, and a change here will lead to a problem 
there etc.


If you know a routine can evaluate to one instruction in most cases, I 
don't see anything wrong with just marking it as such.



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread Marco van de Voort


Op 2019-10-29 om 12:23 schreef J. Gareth Moreton:
When it comes to testing vectorcall, uComplex isn't the best example 
actually because most of the operators are inlined.  There are a 
number of tests under "tests/test/cg" that test vectorcall and the 
System V ABI using a Pascal implementation of the opaque __m128 type 
(the two ABIs should behave exactly the same when dealing with simple 
vectors).


The last time I checked it didn't vector anything at all. So only the 
native vectorizing of the record of two singles would be nice.


Last time I checked in 2017, complexadd inlined looked something like this:

    leal    32(%eax),%edx
    leal    8(%eax),%ecx
    vmovss    (%ecx),%xmm0
    vaddss    (%edx),%xmm0,%xmm0
    vmovss    %xmm0,-8(%ebp)
    vmovss    4(%ecx),%xmm0
    vaddss    4(%edx),%xmm0,%xmm0
    vmovss    %xmm0,-4(%ebp)

And I realize quite some rearrangements must be done.



If anything though, the example function you gave (I'll need to 
double-check what ComplexScl does though, if it isn't a simple 
multiplication) 


It is simple multiplication of both real and imaginary with a scalar (as 
opposed to complex*complex which has more terms).


would be a pretty solid and heavy-duty test of the compiler attempting 
to vectorise the code - in an ideal world, individual calls to 
ComplexAdd and ComplexSub (which are simple + and - operations in 
uComplex) will compile into a single line of assembly language (ADDPD 
and SUBPD respectively).  Nevertheless, one could disable the inlining 
to see how well the compiler handles the function chaining, since with 
aligned data, the result from XMM0 should be easily transposed in one 
go to another XMM register if not just left alone as parameter data 
for the next function.



Yes, it is just a somewhat realworld codebase to play with. It is MPL even.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread J. Gareth Moreton
When it comes to testing vectorcall, uComplex isn't the best example 
actually because most of the operators are inlined.  There are a number 
of tests under "tests/test/cg" that test vectorcall and the System V ABI 
using a Pascal implementation of the opaque __m128 type (the two ABIs 
should behave exactly the same when dealing with simple vectors).


If anything though, the example function you gave (I'll need to 
double-check what ComplexScl does though, if it isn't a simple 
multiplication) would be a pretty solid and heavy-duty test of the 
compiler attempting to vectorise the code - in an ideal world, 
individual calls to ComplexAdd and ComplexSub (which are simple + and - 
operations in uComplex) will compile into a single line of assembly 
language (ADDPD and SUBPD respectively).  Nevertheless, one could 
disable the inlining to see how well the compiler handles the function 
chaining, since with aligned data, the result from XMM0 should be easily 
transposed in one go to another XMM register if not just left alone as 
parameter data for the next function.


Gareth aka. Kit


On 29/10/2019 11:06, Marco van de Voort wrote:


Op 2019-10-27 om 09:02 schreef Florian Klämpfl:
I guess you're right.  It just seems weird because the System V ABI 
was designed from the start to use the MM registers fully, so long as 
the data is aligned.  In effect, it had vectorcall wrapped into its 
design from the start. Granted, vectorcall has some advantages and 
can deal with relatively complex aggregates that the System V ABI 
cannot handle (for example, a record type that contains a normal 
vector and information relating to bump mapping).


I just hoped that making updates to uComplex, while ensuring 
existing Pascal code still compiles, would help take advantage of 
modern ABI designs.


Is there currently any example which shows that vectorcall has any 
advantage with FPC? Else I would propose first to make FPC able to 
take advantage of it and then talk about if we really add vectorcall. 
Currently I fear, FPC gets only into trouble when using vectorcall as 
it tries first to push everything into one xmm register and then 
splits this again in the callee.


Nils Haeck's FFT unit might be interesting. (same guy as nativejpg 
unit iirc, http://www.simdesign.nl)


It is a D7 language level unit that uses a complex record and simple 
procedures as options. It should be easy to transpose to ucomplex. It 
is quite hll and switchable between single and double. (I use it in 
single mode, but to test vectorcall, obviously double mode would be 
best?)


And it has routines that do a variety of complex operations.

procedure FFT_5(var Z: array of TComplex); // usage of open array is 
to make things generic. Could be solved differently.


var
  T1, T2, T3, T4, T5: TComplex;
  M1, M2, M3, M4, M5: TComplex;
  S1, S2, S3, S4, S5: TComplex;
begin
  T1 := ComplexAdd(Z[1], Z[4]);
  T2 := ComplexAdd(Z[2], Z[3]);
  T3 := ComplexSub(Z[1], Z[4]);
  T4 := ComplexSub(Z[3], Z[2]);

  T5   := ComplexAdd(T1, T2);
  Z[0] := ComplexAdd(Z[0], T5);
  M1   := ComplexScl(c51, T5);
  M2   := ComplexScl(c52, ComplexSub(T1, T2));

  M3.Re := -c53 * (T3.Im + T4.Im);  // replace by 
i*add(t3,t4).scale(c53-i*c53) ?

  M3.Im :=  c53 * (T3.Re + T4.Re);
  M4.Re := -c54 * T4.Im;
  M4.Im :=  c54 * T4.Re;
  M5.Re := -c55 * T3.Im;
  M5.Im :=  c55 * T3.Re;

  S3 := ComplexSub(M3, M4);
  S5 := ComplexAdd(M3, M5);;
  S1 := ComplexAdd(Z[0], M1);
  S2 := ComplexAdd(S1, M2);
  S4 := ComplexSub(S1, M2);

  Z[1] := ComplexAdd(S2, S3);
  Z[2] := ComplexAdd(S4, S5);
  Z[3] := ComplexSub(S4, S5);
  Z[4] := ComplexSub(S2, S3);
end;

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-29 Thread Marco van de Voort


Op 2019-10-27 om 09:02 schreef Florian Klämpfl:
I guess you're right.  It just seems weird because the System V ABI 
was designed from the start to use the MM registers fully, so long as 
the data is aligned.  In effect, it had vectorcall wrapped into its 
design from the start.  Granted, vectorcall has some advantages and 
can deal with relatively complex aggregates that the System V ABI 
cannot handle (for example, a record type that contains a normal 
vector and information relating to bump mapping).


I just hoped that making updates to uComplex, while ensuring existing 
Pascal code still compiles, would help take advantage of modern ABI 
designs.


Is there currently any example which shows that vectorcall has any 
advantage with FPC? Else I would propose first to make FPC able to 
take advantage of it and then talk about if we really add vectorcall. 
Currently I fear, FPC gets only into trouble when using vectorcall as 
it tries first to push everything into one xmm register and then 
splits this again in the callee.


Nils Haeck's FFT unit might be interesting. (same guy as nativejpg unit 
iirc, http://www.simdesign.nl)


It is a D7 language level unit that uses a complex record and simple 
procedures as options. It should be easy to transpose to ucomplex. It is 
quite hll and switchable between single and double. (I use it in single 
mode, but to test vectorcall, obviously double mode would be best?)


And it has routines that do a variety of complex operations.

procedure FFT_5(var Z: array of TComplex); // usage of open array is to 
make things generic. Could be solved differently.


var
  T1, T2, T3, T4, T5: TComplex;
  M1, M2, M3, M4, M5: TComplex;
  S1, S2, S3, S4, S5: TComplex;
begin
  T1 := ComplexAdd(Z[1], Z[4]);
  T2 := ComplexAdd(Z[2], Z[3]);
  T3 := ComplexSub(Z[1], Z[4]);
  T4 := ComplexSub(Z[3], Z[2]);

  T5   := ComplexAdd(T1, T2);
  Z[0] := ComplexAdd(Z[0], T5);
  M1   := ComplexScl(c51, T5);
  M2   := ComplexScl(c52, ComplexSub(T1, T2));

  M3.Re := -c53 * (T3.Im + T4.Im);  // replace by 
i*add(t3,t4).scale(c53-i*c53) ?

  M3.Im :=  c53 * (T3.Re + T4.Re);
  M4.Re := -c54 * T4.Im;
  M4.Im :=  c54 * T4.Re;
  M5.Re := -c55 * T3.Im;
  M5.Im :=  c55 * T3.Re;

  S3 := ComplexSub(M3, M4);
  S5 := ComplexAdd(M3, M5);;
  S1 := ComplexAdd(Z[0], M1);
  S2 := ComplexAdd(S1, M2);
  S4 := ComplexSub(S1, M2);

  Z[1] := ComplexAdd(S2, S3);
  Z[2] := ComplexAdd(S4, S5);
  Z[3] := ComplexSub(S4, S5);
  Z[4] := ComplexSub(S2, S3);
end;

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread J. Gareth Moreton

Another point to bring up...

I could easily write a cross-platform complex number library that is 
designed to take advantage of vector registers whenever possible for the 
absolute best performance, but there comes a problem of having multiple 
libraries that do the same thing and not really sticking to any 
standard.  People tend to stick to what they're familiar with as well, 
and if a tool already exists, no matter how inefficient it is, people 
will use that instead. That's why I opted to update an existing library 
while doing my best to ensure Pascal code isn't broken.  When it comes 
to assembly language, all bets tend to be off anyway, although once 
again, I argue that using assembly language to directly interface with 
the complex number routines is not a realistic situation, since if 
you're writing things in assembly language, complex numbers is one of 
those constructs that you would write in assembler as well for the sake 
of speed and efficiency.


Long story short... why would people use or update their code to use a 
new complex number library when one that's been tried and tested (albeit 
out of date) already exists?


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread Michael Van Canneyt



On Sun, 27 Oct 2019, Sven Barth via fpc-devel wrote:


Michael Van Canneyt  schrieb am So., 27. Okt. 2019,
10:58:


Best of all would IMHO be to abolish or even totally ignore 'inline'.
It is a hint, after all. The compiler is not forced to inline, even
when the modifier is there.



That would be a bit problematic: auto inlining needs to first parse the
routine to determine whether it can be inlined at all which would then
change the checksum of the interface section as now the routine would carry
the node information required for inlining which it didn't before thus
leading to the requirement of an additional compilation pass of dependent
units.


How does $autoinline then work ? Doesn't it have to do the same ?

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread Sven Barth via fpc-devel
Michael Van Canneyt  schrieb am So., 27. Okt. 2019,
10:58:

> Best of all would IMHO be to abolish or even totally ignore 'inline'.
> It is a hint, after all. The compiler is not forced to inline, even
> when the modifier is there.
>

That would be a bit problematic: auto inlining needs to first parse the
routine to determine whether it can be inlined at all which would then
change the checksum of the interface section as now the routine would carry
the node information required for inlining which it didn't before thus
leading to the requirement of an additional compilation pass of dependent
units.

Regards,
Sven

>
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread J. Gareth Moreton
Originally my patch just added the "vectorcall" calling convention to 
the functions; the "const" modifier was suggested by a third party and 
seemed sensible enough.  I weighed up the fact that it wouldn't change 
how to you call the function in Pascal code and accepted it.  The patch 
should be easily enough to split though, especially as the vectorcall 
thing is now just "{$calling vectorcall}" at the top of the file.


Gareth aka. Kit

On 27/10/2019 11:15, Michael Van Canneyt wrote:



On Sun, 27 Oct 2019, J. Gareth Moreton wrote:



I was more referring to the use of correct types, use const when 
possible etc.
Change classes to advanced records where appropriate, that kind of 
thing.


Michael.


Which is why I hoped my patches for uComplex were permissible, since 
it adds 'const' to make the compilation more efficient and sets the 
calling convention to 'vectorcall' for Win64, something that the 
compiler won't think to do unless explicitly told so, and maybe a 
slight incentive to improve the compiler as far as vectorisation is 
concerned (and complex numbers are a good candidate since for most 
basic operations, the components are modified in tandem).


Well, I can't comment on this, in such matters I trust Florian knows 
what he is talking

about.



I guess adding 'vectorcall' and 'const' are micro-optimisations, but 
I see it more as refactoring and good coding practice in the case of 
'const', while 'vectorcall' is more about knowing what kind of data 
you're dealing with.


I would not argue in the case of const and apply where appropriate, I 
don't know enough about vectorcall to comment.


Maybe the patch can be split into parts so const can already be applied.

It's not the first time we must advise to keep patches small and focused.

Although I am also often a sinner when it comes to mixing things in a
patch. When you're in the flow of things, that's the last thing on 
your mind :/


Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread Michael Van Canneyt



On Sun, 27 Oct 2019, J. Gareth Moreton wrote:



I was more referring to the use of correct types, use const when 
possible etc.

Change classes to advanced records where appropriate, that kind of thing.

Michael.


Which is why I hoped my patches for uComplex were permissible, since it 
adds 'const' to make the compilation more efficient and sets the calling 
convention to 'vectorcall' for Win64, something that the compiler won't 
think to do unless explicitly told so, and maybe a slight incentive to 
improve the compiler as far as vectorisation is concerned (and complex 
numbers are a good candidate since for most basic operations, the 
components are modified in tandem).


Well, I can't comment on this, in such matters I trust Florian knows what he is 
talking
about.



I guess adding 'vectorcall' and 'const' are micro-optimisations, but I 
see it more as refactoring and good coding practice in the case of 
'const', while 'vectorcall' is more about knowing what kind of data 
you're dealing with.


I would not argue in the case of const and apply where appropriate, 
I don't know enough about vectorcall to comment.


Maybe the patch can be split into parts so const can already be applied.

It's not the first time we must advise to keep patches small and focused.

Although I am also often a sinner when it comes to mixing things in a
patch. When you're in the flow of things, that's the last thing on your mind :/

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread J. Gareth Moreton


I was more referring to the use of correct types, use const when 
possible etc.

Change classes to advanced records where appropriate, that kind of thing.

Michael.


Which is why I hoped my patches for uComplex were permissible, since it 
adds 'const' to make the compilation more efficient and sets the calling 
convention to 'vectorcall' for Win64, something that the compiler won't 
think to do unless explicitly told so, and maybe a slight incentive to 
improve the compiler as far as vectorisation is concerned (and complex 
numbers are a good candidate since for most basic operations, the 
components are modified in tandem).


I guess adding 'vectorcall' and 'const' are micro-optimisations, but I 
see it more as refactoring and good coding practice in the case of 
'const', while 'vectorcall' is more about knowing what kind of data 
you're dealing with.


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread J. Gareth Moreton
Ideally, you should specify 'vectorcall' either when interfacing with 
third-party libraries, when the code can be vectorised by the compiler, 
or when doing it yourself in assembly language.  For example, if I 
wanted to write the cmod function in x86_64 assembler (Intel notation):


function cmod(z: Complex): Double; vectorcall; assembler; nostackframe;
asm
  MULPD XMM0, XMM0
  HADDPD XMM0, XMM0
  SQRTSD XMM0, XMM0
end;

Without vectorcall (or an unaligned type), where each field would be in 
a separate register, the code would instead be:


function cmod(z: Complex): Double; assembler; nostackframe;
asm
  MULSD XMM0, XMM0
  MULSD XMM1, XMM1
  ADDSD XMM0, XMM1
  SQRTSD XMM0, XMM0
end;

Admittedly the advantages are more obvious when using arrays of 
Singles.  I guess a good example would be a 4-component dot product (I 
know there's a dot product instruction in SSE4, but I'm ignoring it for 
now):


type
  TVector4 = record
    x, y, z, w: Single;
  end align 16; { hey, I can dream! }

function DotProduct(V: TVector4): Single; vectorcall; assembler; 
nostackframe;

begin
  MULPS XMM0, XMM0
  HADDPS XMM0, XMM0
  HADDPS XMM0, XMM0
  { Only the first component of XMM0 is considered for the result }
end;

And without vectorcall (or an unaligned type):

function DotProduct(V: TVector4): Single; vectorcall; assembler; 
nostackframe;

begin
  MULSS XMM0, XMM0
  MULSS XMM1, XMM1
  MULSS XMM2, XMM2
  MULSS XMM3, XMM3
  ADDSS XMM0, XMM1
  ADDSS XMM0, XMM2
  ADDSS XMM0, XMM3
end;

It's hard to say which function is more efficient here due to the 
latency of HADDPS and the multiple logic ports available (usually you 
can do at least two independent vector multiplications simultaneously), 
but the overhead of moving each field to a separate register will 
definitely add up.  At the very least though, for the first dot product 
example, if the compiler was able to produce such assembler from Pascal 
source, it would be much more efficient to inline because it only uses a 
single register throughout.  I'm not sure how the compiler would know to 
inline a function when it's reached the assembler stage though, even if 
the registers are still virtual.


To get back to the subject at hand... the advantages of vectorcall.  
Microsoft Visual C++ does have a compiler option where it automatically 
sets the calling convention to "vectorcall" rather than the default 
Microsoft calling convention (which is based off "fastcall"), since in 
most cases with integers, pointers and individual floating-point 
parameters, vectorcall doesn't behave any differently.  FPC would only 
be able to take full advantage of vectorcall and aligned types under 
Linux if the compiler was made better with vectorising instructions.


As a side-note, I would like to propose adding the "fastcall" calling 
convention for i386-win32 and x86_64-win64 (and maybe other i386 and 
x86_64 platforms).  Under Win32, fastcall uses ECX and EDX for its first 
two parameters and EAX for the result (it's a worse form of Pascal's 
default 'register' convention, but this was designed in the days when  
C++ functions pushed all their parameters to the stack), while under 
Win64 it would be equivalent to 'ms_abi_default' and force the default 
Microsoft calling convention regardless of whether there was a setting 
to default to vectorcall (I consider the default calling convention to 
be based off fastcall because it uses RCX and RDX for its first two 
parameters, then adds R8 and R9 for the next two, and the XMM registers 
for floating-point arguments).  More than anything it would just help to 
interface with third-party libraries again.


Gareth aka. Kit

On 27/10/2019 08:02, Florian Klämpfl wrote:


Am 27.10.19 um 07:32 schrieb J. Gareth Moreton:
I guess you're right.  It just seems weird because the System V ABI 
was designed from the start to use the MM registers fully, so long as 
the data is aligned.  In effect, it had vectorcall wrapped into its 
design from the start. Granted, vectorcall has some advantages and 
can deal with relatively complex aggregates that the System V ABI 
cannot handle (for example, a record type that contains a normal 
vector and information relating to bump mapping).


I just hoped that making updates to uComplex, while ensuring existing 
Pascal code still compiles, would help take advantage of modern ABI 
designs.


Is there currently any example which shows that vectorcall has any 
advantage with FPC? Else I would propose first to make FPC able to 
take advantage of it and then talk about if we really add vectorcall. 
Currently I fear, FPC gets only into trouble when using vectorcall as 
it tries first to push everything into one xmm register and then 
splits this again in the callee.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.

Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread Michael Van Canneyt



On Sun, 27 Oct 2019, Florian Klämpfl wrote:


Am 27.10.19 um 10:27 schrieb Michael Van Canneyt:
If you genuinely believe that micro-optimization changes can make a 
difference:


Submit patches. 


As said: I am against applying them. Why? They clutter code and after 
all, they make assumptions about the current target which not might be 
always valid. And time testing them is much better spent in improving 
the compiler and then all code benefits. Another point: for example 
explicit inline increases normally code size (not always but often), so 
it is against the use of -Os. Applying inline manually on umpteen 
subroutines makes no sense. Better improve auto inlining.


I am aware of your point of view, and I agree. Because, as I wrote:

As a rule, the programmer should not have to care about such things. 
The compiler must handle that. It knows better (well, it should :)).


Best of all would IMHO be to abolish or even totally ignore 'inline'. 
It is a hint, after all. The compiler is not forced to inline, even 
when the modifier is there.


I was more referring to the use of correct types, use const when possible etc.
Change classes to advanced records where appropriate, that kind of thing.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread Florian Klämpfl

Am 27.10.19 um 10:27 schrieb Michael Van Canneyt:
If you genuinely believe that micro-optimization changes can make a 
difference:


Submit patches. 


As said: I am against applying them. Why? They clutter code and after 
all, they make assumptions about the current target which not might be 
always valid. And time testing them is much better spent in improving 
the compiler and then all code benefits. Another point: for example 
explicit inline increases normally code size (not always but often), so 
it is against the use of -Os. Applying inline manually on umpteen 
subroutines makes no sense. Better improve auto inlining.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread Michael Van Canneyt



On Sat, 26 Oct 2019, Ben Grasset wrote:


On Sat, Oct 26, 2019 at 1:31 PM Florian Klämpfl 
wrote:


This is imo a waste of time and clutters only code. It is much more
beneficial to
improve the compiler to avoid a copying of the variable if it can prove
that it is not needed (or to improve auto inlining.)



While I absolutely agree that it would be nice if FPC auto-inlined *by
default*, as most compilers do (*without* the {$AUTOINLINE} optimization
directive that essentially nobody knows exists and thus never uses
anyways), FPC doesn't do so currently, and as far as I can tell probably
won't in the foreseeable future.


Clairvoyance is a rare gift.



At risk of sounding overly abrasive or rude, there is *enormous* amounts of
code in both the RTL and packages that is almost unusably slow due to what
seems like a general lack of *any kind* of concern for performance.


Absolutely.

Personally, I don't have any concern for performance in this sense. Almost zero.
I invariably favour code simplicity over performance, for sake of maintenance.

For good reason: for the kind of code which I create daily, the kind of
micro-optimizations that you seem to refer to, are utterly insignificant,
and I expect the compiler to handle them. If it currently does not, then I
think the compiler, rather than the code, must be improved.

Code should not entirely disregard optimization, but then it should be on a
higher level: don't use bubble sort when you can use a better sort. No
amount of micro-optimization will make bubble sort outperform quickort.

Saying that the code is 'almost unusably slow' is the kind of statement that 
does
not help. I use the code almost daily in production, no complaints about
performance, so clearly it is usable.

Instead, demonstrate your claim with facts, for example by creating a patch that
demonstrably increases performance.



Far too much of it is just un-inlined heap allocation on top of un-inlined
heap allocation on top of un-inlined heap-allocation on top of for-loop
that uses "Integer" when it should really use "SizeInt" on top of utter
avoidance of pointer arithmetic even though it's always faster on top of
methods that have no reason to be marked "virtual" but are anyways on top
of blah blah blah... I'm sure you get the point.


These are the kind of micro-optimizations that are irrelevant for me.

About virtual:
In general, don't condemn the use virtual unless you know why it was put there.
Extensability & compatibility with delphi are 2 important reasons.

Sizeint vs. Integer. 2 points:

1. A programmer should not have to care. The programmer must care
about 'what does the logic require', not 'what does the CPU require'.
It's the job of the compiler to make sure it creates the most suitable code
for a given type.

2. The current amount of Integer types is a historical mess. Many/Most of these
types did not exist when the RTL code was written. So if today with the
whole zoo of integers we have (it's like elementary particle physics
quadrupled) there is still a lot of code that uses suboptimal integer types:
it is only to be expected. I certainly don't go over the codebase whenever a
new integer type is invented.  Can this be improved ? Certainly. Do I want
to do this ? No, I think it is more important for me to add new functionality.



And of course I haven't even mentioned the fact that in reality, *anywhere*
that an advanced record (or even object) can be used instead of a class, it
should be, because it means you're avoiding an unnecessary allocation, but
good luck convincing anyone who matters of that!


Several points here.

Most of the code was written before advanced records existed.

There is backwards and/or delphi compatibility to be considered.

Advanced records also have a disadvantage: copying them is expensive.
So when advocating this change: make sure a record is not being passed
around and/or copied a lot.

That said, I haven't seen a single proposal where you personally would
change a class to an advanced record. But maybe I missed such cases?



I'm sure you get my point.


I think I do. I don't necessarily agree with all of what you say.

If you genuinely believe that micro-optimization changes can make a difference:

Submit patches. When focused and well explained, I doubt they will be refused.

When such patches appear for code that I wrote/maintain, I almost invariably
apply them. For most, I didn't even require explicit proof that they improve speed. 
It's not because I don't care about optimization that I deny

someone else the right to care and to submit patches.

Michael.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread Florian Klämpfl

Am 27.10.19 um 01:07 schrieb Ben Grasset:
FPC doesn't do so currently, and as far as I can tell probably 
won't in the foreseeable future.


Yes, people write only lengthy mails on fpc-devel instead of writing code.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread Florian Klämpfl

Am 27.10.19 um 07:32 schrieb J. Gareth Moreton:
I guess you're right.  It just seems weird because the System V ABI was 
designed from the start to use the MM registers fully, so long as the 
data is aligned.  In effect, it had vectorcall wrapped into its design 
from the start.  Granted, vectorcall has some advantages and can deal 
with relatively complex aggregates that the System V ABI cannot handle 
(for example, a record type that contains a normal vector and 
information relating to bump mapping).


I just hoped that making updates to uComplex, while ensuring existing 
Pascal code still compiles, would help take advantage of modern ABI designs.


Is there currently any example which shows that vectorcall has any 
advantage with FPC? Else I would propose first to make FPC able to take 
advantage of it and then talk about if we really add vectorcall. 
Currently I fear, FPC gets only into trouble when using vectorcall as it 
tries first to push everything into one xmm register and then splits 
this again in the callee.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-27 Thread J. Gareth Moreton
I guess you're right.  It just seems weird because the System V ABI was 
designed from the start to use the MM registers fully, so long as the 
data is aligned.  In effect, it had vectorcall wrapped into its design 
from the start.  Granted, vectorcall has some advantages and can deal 
with relatively complex aggregates that the System V ABI cannot handle 
(for example, a record type that contains a normal vector and 
information relating to bump mapping).


I just hoped that making updates to uComplex, while ensuring existing 
Pascal code still compiles, would help take advantage of modern ABI designs.


Gareth aka. Kit

On 27/10/2019 01:12, Sven Barth via fpc-devel wrote:


I don't think the compiler can be made smart and safe enough to
auto-align something like the complex type to take full advantage
of the
System V ABI, and vectorcall is not the default Win64 calling
convention
(and the default convention is a little badly-designed if I'm
allowed to
say, since it doesn't vectorise anything at all).


It's not badly designed, it's a child of its time. Back when Win64 was 
conceived it wasn't expected that the use of SSE would become as 
widespread as it is now. And one doesn't simply change a platform ABI 
on a whim. That's why Microsoft introduced vectorcall after all...


Regards,
Sven


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-26 Thread Sven Barth via fpc-devel
>
> I don't think the compiler can be made smart and safe enough to
> auto-align something like the complex type to take full advantage of the
> System V ABI, and vectorcall is not the default Win64 calling convention
> (and the default convention is a little badly-designed if I'm allowed to
> say, since it doesn't vectorise anything at all).
>

It's not badly designed, it's a child of its time. Back when Win64 was
conceived it wasn't expected that the use of SSE would become as widespread
as it is now. And one doesn't simply change a platform ABI on a whim.
That's why Microsoft introduced vectorcall after all...

Regards,
Sven

>
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-26 Thread Ben Grasset
On Sat, Oct 26, 2019 at 1:31 PM Florian Klämpfl 
wrote:

> This is imo a waste of time and clutters only code. It is much more
> beneficial to
> improve the compiler to avoid a copying of the variable if it can prove
> that it is not needed (or to improve auto inlining.)
>

While I absolutely agree that it would be nice if FPC auto-inlined *by
default*, as most compilers do (*without* the {$AUTOINLINE} optimization
directive that essentially nobody knows exists and thus never uses
anyways), FPC doesn't do so currently, and as far as I can tell probably
won't in the foreseeable future.

At risk of sounding overly abrasive or rude, there is *enormous* amounts of
code in both the RTL and packages that is almost unusably slow due to what
seems like a general lack of *any kind* of concern for performance.

Far too much of it is just un-inlined heap allocation on top of un-inlined
heap allocation on top of un-inlined heap-allocation on top of for-loop
that uses "Integer" when it should really use "SizeInt" on top of utter
avoidance of pointer arithmetic even though it's always faster on top of
methods that have no reason to be marked "virtual" but are anyways on top
of blah blah blah... I'm sure you get the point.

And of course I haven't even mentioned the fact that in reality, *anywhere*
that an advanced record (or even object) can be used instead of a class, it
should be, because it means you're avoiding an unnecessary allocation, but
good luck convincing anyone who matters of that!

I'm sure you get my point.

And no, I'm not advocating for "micro-optimization", or as I constantly
hear "stuff that doesn't matter except in contrived benchmarks", I'm
advocating for the bare minimum standards that average people would and do
expect from the "standard" library and packages of a modern programming
language.

People are of course free to pretend like it doesn't matter that *each and
every* use of the "inline" modifier in the Classes unit is hidden behind a
"CLASSESINLINE" define never set to true in any makefile (which yes, indeed
mean that absolutely nothing in Classes is inlined, under any
circumstances, ever!) but I at the same time am free to realize that
incurring the cost of *two* function calls for every single indexed access
to a TFPList instead of zero via inlining is utterly insane, and modify my
local makefiles to define CLASSESINLINE.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-26 Thread J. Gareth Moreton
With my experiments on i386 anx x86_64 (without the alignment changes) 
the complex record is always passed by reference, but without const, the 
function prologue then makes a copy of it on the function's local stack, 
which is then referenced to in the rest of the function.  Whether or not 
const is present or not, the same reference is passed into the function 
unmodified (the compiled assembly language is no different).


I think in a way, Florian and I have slightly different views.  I don't 
trust the compiler to make the most optimal code (i.e. a lazy 
compiler... I didn't want to say that Florian's compiler was inefficient 
until he himself said it!), so I try to give it hints where I can, and 
inserting "const" modifiers seems harmless enough since this has been a 
documented Pascal feature for decades, and most of the functions don't 
modify the parameter, so adding "const" just enforces it on the 
compiler's side.


Granted, I do seek to make improvements to the compiler where possible, 
and it's something I enjoy doing.  In the case of 'auto-const', I 
imagine it could be done at the node level, detecting that a parameter 
is only read from and never written to, but there may still be traps 
where you modify it without meaning to and causing inefficiencies.  Case 
in point, I had to make one small change to the "cth" function because 
it reused the parameter as a temporary variable.  Originally, it was this:


  function cth (z : complex) : complex;
    { hyberbolic complex tangent }
    { th(x) = sinh(x) / cosh(x) }
    { cosh(x) > 1 qq x }
    var temp: complex;
    begin
   temp := cch(z);
   z := csh(z);
   cth := z / temp;
    end;

I changed it to the following because specifying "const" caused a 
compiler error:


  function cth (const z : complex) : complex;
    { hyberbolic complex tangent }
    { th(x) = sinh(x) / cosh(x) }
    { cosh(x) > 1 qq x }
    var temp, hsinz : complex;
    begin
   temp := cch(z);
   hsinz := csh(z);
   cth := hsinz / temp;
    end;

I'm assuming there's a good reason as to why it can't simply be written 
as "cth := csh(z) / cch(z);" (and it looks easier to auto-inline), 
although currently that reason eludes me.


I don't think the compiler can be made smart and safe enough to 
auto-align something like the complex type to take full advantage of the 
System V ABI, and vectorcall is not the default Win64 calling convention 
(and the default convention is a little badly-designed if I'm allowed to 
say, since it doesn't vectorise anything at all).  Plus other platforms 
may have more restrictive memory availability and coarse alignment is 
not desired since it causes wastage.  Granted, when it comes to 
increased maintainability, the little tricks required to align the 
complex type while keeping the same field names is very tricky to 
understand and get correct (hence my suggestion of a distinct "align ##" 
modifier at the end of the type declaration, but that's another story).


I think the question of whether a micro-optimisation increases 
maintainability is fairly subjective and can only be determined on a 
case-by-case basis.  In my mind, if someone has done the optimisation 
and the code is still relatively clean, then it's okay to merge so long 
as everyone accepts it and it's fully tested.


Gareth aka. Kit


On 26/10/2019 18:02, Sven Barth via fpc-devel wrote:

Am 26.10.2019 um 18:51 schrieb J. Gareth Moreton:
The "const" suggestion was made by a third party, and while I went 
out of my way to ensure the functions aren't changed in Pascal code, 
Florian pointed out that it could break existing assembler code. 
Maybe I'm being a bit stubborn or unreasonable, I'm not sure, but in 
my eyes, using assembly language to directly call the uComplex 
functions and operators seems rather unrealistic.  I figured if 
you're writing in assembly language, especially if you're using 
vector registers, you'd be using your own code to play around with 
complex numbers.  Plus I figured that if you're developing on a 
non-x86_64 platform, the only thing that's different are the 'const' 
modifiers, which I don't think changes the way you actually call the 
function, regardless of platform.  Am I right in this?


It totally depends on how "const" is implemented for the particular 
target. On some there might not be any difference on others there 
might be a similar difference as for x86, namely that something is 
passed as a reference instead of a copy.


I guess a more fundamental question I should ask, and this might be 
terribly naïve of me, is this: when you call some function F(x: 
TType), is there a situation where calling F(const x: TType) produces 
different machine code or where a particular actual parameter becomes 
illegal? Note I'm talking about how you call the function, not how 
the function itself is compiled.


Didn't you provide the example yourself with your changes to the 
uComplex unit? There are cases (especially with 

Re: [fpc-devel] Question on updating FPC packages

2019-10-26 Thread Florian Klämpfl

Am 26.10.19 um 18:51 schrieb J. Gareth Moreton:


The "const" suggestion was made by a third party, and while I went out 
of my way to ensure the functions aren't changed in Pascal code, Florian 
pointed out that it could break existing assembler code.  Maybe I'm 
being a bit stubborn or unreasonable, I'm not sure, but in my eyes, 
using assembly language to directly call the uComplex functions and 
operators seems rather unrealistic.  I figured if you're writing in 
assembly language, especially if you're using vector registers, you'd be 
using your own code to play around with complex numbers.  Plus I figured 
that if you're developing on a non-x86_64 platform, the only thing 
that's different are the 'const' modifiers, which I don't think changes 
the way you actually call the function, regardless of platform.  Am I 
right in this?


The intention was to make the lightweight unit even more lightweight and 
optimal, without breaking backwards compatibility.  


I do not like such (micro-)optimziations working around a lazy compiler. 
I saw similar patches in lazarus recently (adding inline). This is imo a 
waste of time and clutters only code. It is much more beneficial to 
improve the compiler to avoid a copying of the variable if it can prove 
that it is not needed (or to improve auto inlining if it does not work 
in certain cases). And in this case it would probably possible to find 
out that a copy is not needed.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Question on updating FPC packages

2019-10-26 Thread Sven Barth via fpc-devel

Am 26.10.2019 um 18:51 schrieb J. Gareth Moreton:
The "const" suggestion was made by a third party, and while I went out 
of my way to ensure the functions aren't changed in Pascal code, 
Florian pointed out that it could break existing assembler code. Maybe 
I'm being a bit stubborn or unreasonable, I'm not sure, but in my 
eyes, using assembly language to directly call the uComplex functions 
and operators seems rather unrealistic.  I figured if you're writing 
in assembly language, especially if you're using vector registers, 
you'd be using your own code to play around with complex numbers.  
Plus I figured that if you're developing on a non-x86_64 platform, the 
only thing that's different are the 'const' modifiers, which I don't 
think changes the way you actually call the function, regardless of 
platform.  Am I right in this?


It totally depends on how "const" is implemented for the particular 
target. On some there might not be any difference on others there might 
be a similar difference as for x86, namely that something is passed as a 
reference instead of a copy.


I guess a more fundamental question I should ask, and this might be 
terribly naïve of me, is this: when you call some function F(x: 
TType), is there a situation where calling F(const x: TType) produces 
different machine code or where a particular actual parameter becomes 
illegal? Note I'm talking about how you call the function, not how the 
function itself is compiled.


Didn't you provide the example yourself with your changes to the 
uComplex unit? There are cases (especially with records) where "x" is 
passed as a copy on the stack and "const x" is passed as a reference.


Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel