Re: [fpc-devel] Vectorization

2018-02-07 Thread J. Gareth Moreton
 Hi John,

 I am on the mailing list.  I don't actually have write access to the SVN
repository, so I can only submit patches for review and testing, especially
as Florian has his own plans in the works.

 Currently, vectorcall only really benefits assembly language programmers
because the vectorisation system in the Free Pascal Compiler is shaky at
best, although slowly improving.  The idea is that small blocks of
floating-point numbers, the most frequent example being a 4-component
vector, are passed into a function more efficiently in a single register
designed to handle such data.

 The advantage of the 'vector stuff' is that you can perform the same
mathematical routine on multiple units of data at once - a simple example
would be adding two vectors together.  Traditionally, you would have to
load, compute and store each floating-point component sequentially, one at
a time, while SIMD (Single Instruction, Multiple Data) collapses this into
a single set of instructions, thereby giving a very large performance
boost.  The vectorisation system in the compiler attempts to take
advantage of this with, for example, unwinding for-loops so it can
calculate several iterations at once (not always possible if the outcome of
one iteration depends on the ones run immediately prior).

 On Linux, 'vectorcall' will be ignored, while on 64-bit Windows, there
should be no difference in performance if none of the parameters or the
return value are vector types, but if they are, there's a performance gain
because of using a CPU register instead of passing it on the stack, among
other things.

 To answer your questions:

 1) For the most part, AMD processors follow the same standard set of
opcodes as Intel.  There may be some minor differences in performance and
optimisation at the lowest level, but normally you don't have to worry
about it.

 2) Any Intel system running a 64-bit version of Windows is guaranteed to
have at least SSE2 available, because Windows simply refuses to install
otherwise.  For Linux I imagine this is the same thing, especially as the
standard for passing parameters into procedures under Linux utilises the
relevant registers for floating-point arguments.

 3) Basically, the implentation of Linux' calling convention in the Free
Pascal Compiler wasn't perfect.  For the most part, it conforms to the
standard, but doesn't handle vector data properly as per the System V ABI
(https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
- the specification).  Generally, it only passes parameters of type Single
into XMM registers in blocks of two, rather than four as is supported by
SSE.  If you had a vector of length 4 (very common in graphical
applications), the compiler would split its values across two XMM
registers, even if the data was originally aligned on a 16-byte boundary
(it's far faster to access data aligned this way).  This was inefficient
in that it required twice as many reads and writes to pass the data, but
also it used up an extra register, of which there are only 8 available
(more modern processors have 16 or sometimes 32, but the System V ABI, and
vectorcall, don't permit these to be used for parameters).  If you have a
function that passes a lot of parameters, you have to pass them on the
stack if there are no more XMM registers free.  Also, using as few
registers as possible leaves more free for the compiler to play with.

 It's a bit complicated, I'm afraid, but I hope that helps.

 Gareth aka. Kit

 On Wed 07/02/18 11:44 , John Lee johnel...@gmail.com sent:
 Hi Kit/Gareth
 Thanks for this work - I've been following all your changes to win64.
Can't say I understand all of the vector ones. Be good to commit to trunk
asap - tho' gather your are waiting for some Florian mods.
 Just few qs - maybe these could be clarified in comments/example/tests
 1) How does this work on amd processors that have vector stuff?
 2)exactly which processors amd/intel have this stuff? 
 3) don't understand why this stuff doesn't work w/o mods on linux, which I
think you say somewhere.
 Thanks again john
 PS assume you are on devel list tho' doing a reply to this email in gmail
copies you and devel? Doh.

  

  On 7 February 2018 at 08:23, J. Gareth Moreton  wrote:
 Hi everyone,

 After a lot of work, I have implemented 'vectorcall' into Win64, and made
a patch for Lazarus to recognise
 the keyword in the IDE and highlight it accordingly.

 FPC vectorcall patch:

 https://bugs.freepascal.org/view.php?id=32781 [2]

 Lazarus vectorcall support patch:

 https://bugs.freepascal.org/view.php?id=33134 [3]

 The vectorcall patch also contains the code in the patch for issue #27870,
since they share a lot in common.
 So far, I have confirmed that FPC and Lazarus successfully compile on
Win32 and Win64, but I know for a fact
 that the code changes affect Linux 64-bit as well in that the SSEUP_CLASS
is now properly supported
 (vectorcall reuses the System V ABI code for convenience and

Re: [fpc-devel] Vectorization

2018-02-07 Thread John Lee
Hi Kit/Gareth

Thanks for this work - I've been following all your changes to win64. Can't
say I understand all of the vector ones. Be good to commit to trunk asap -
tho' gather your are waiting for some Florian mods.

Just few qs - maybe these could be clarified in comments/example/tests

1) How does this work on amd processors that have vector stuff?

2)exactly which processors amd/intel have this stuff?

3) don't understand why this stuff doesn't work w/o mods on linux, which I
think you say somewhere.

Thanks again
john

PS assume you are on devel list tho' doing a reply to this email in gmail
copies you and devel? Doh.









On 7 February 2018 at 08:23, J. Gareth Moreton 
wrote:

> Hi everyone,
>
> After a lot of work, I have implemented 'vectorcall' into Win64, and made
> a patch for Lazarus to recognise
> the keyword in the IDE and highlight it accordingly.
>
> FPC vectorcall patch:
>
> https://bugs.freepascal.org/view.php?id=32781
>
> Lazarus vectorcall support patch:
>
> https://bugs.freepascal.org/view.php?id=33134
>
> The vectorcall patch also contains the code in the patch for issue #27870,
> since they share a lot in common.
> So far, I have confirmed that FPC and Lazarus successfully compile on
> Win32 and Win64, but I know for a fact
> that the code changes affect Linux 64-bit as well in that the SSEUP_CLASS
> is now properly supported
> (vectorcall reuses the System V ABI code for convenience and
> compatibility), so FPC's implementation of the
> System V ABI should now properly support 128-bit SSE vectors.
>
> Note that 256-bit and 512-bit vectors are currently disabled in the code,
> since the compiler does not fully
> support vectors of this size yet, and Florian is working on this himself.
>
> I have provided 3 test programs in #32781 that should compile under both
> Win64 and Linux 64-bit (it will
> throw a custom $FATAL error if it's not one of these two platforms) in
> order to test correct code production
> and register allocation.  However, testing will have to be very extensive
> for this addition.
>
> I hope this will serve the x86-64 assembly programmers well - have fun!
>
> Gareth aka. Kit
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
>
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2018-02-07 Thread Adriaan van Os

J. Gareth Moreton wrote:

Hi everyone,

After a lot of work, I have implemented 'vectorcall' into Win64, and made a patch for Lazarus to recognise 
the keyword in the IDE and highlight it accordingly.


Thanks !

Adriaan van Os

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-23 Thread Sven Barth via fpc-devel
Am 23.12.2017 11:01 schrieb "Adriaan van Os" :

J. Gareth Moreton wrote:

> Hey Adriaan,
>
> I dare ask - did the patch help out your issue at all?  I did supply it to
> Florian as well, although he has his own work in progress for
> vectorization, so whether my code is compatible or not waits to be seen.
>

Didn't get to it yet. Svn needs fpc 3.0.4 to build, which isn't the
production compiler on my development system. On another OS X, I tried to
checkout svn, but it was (on that system) extremely slow. After two
multi-hour sessions, I postponed it to later. But I am very interested in
the patch, so I will follow up later.


Trunk can also be build with 3.0.2 in case you have that installed.
Everything older is not supported however.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-23 Thread Adriaan van Os

J. Gareth Moreton wrote:

Hey Adriaan,

I dare ask - did the patch help out your issue at all?  I did supply it to Florian as well, although he has 
his own work in progress for vectorization, so whether my code is compatible or not waits to be seen.


Didn't get to it yet. Svn needs fpc 3.0.4 to build, which isn't the production compiler on my 
development system. On another OS X, I tried to checkout svn, but it was (on that system) extremely 
slow. After two multi-hour sessions, I postponed it to later. But I am very interested in the 
patch, so I will follow up later.


Regards,

Adriaan van Os

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-22 Thread J. Gareth Moreton
Hey Adriaan,

I dare ask - did the patch help out your issue at all?  I did supply it to 
Florian as well, although he has 
his own work in progress for vectorization, so whether my code is compatible or 
not waits to be seen.

Gareth aka. Kit


On Thu 14/12/17 20:29 , "J. Gareth Moreton" gar...@moreton-family.com sent:
> https://bugs.freepascal.org/view.php?id=27870
> 
> 
> I've made a patch that hopefully fixes this bug, as well as provide some
> future expansion for vectorization.
> 
> 
> There are a few new internal sizes such as "OS_MF128" that serve
> to ensure the most optimal move command is 
> used (out of MOVAPS, MOVAPD and MOVDQA), since using the wrong one incurs a
> performance penalty.
> 
> 
> If the data is unaligned, the compiler will use MOVUPS/MOVUPD/MOVDQU
> instead, but if it detects the correct 
> byte alignment (16) on a variable, it will use the aligned commands.
> 
> 
> 
> Let me know how it works out.
> 
> 
> 
> Gareth aka. Kit
> 
> ___
> 
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> 
> 
> 
> 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-14 Thread J. Gareth Moreton
https://bugs.freepascal.org/view.php?id=27870

I've made a patch that hopefully fixes this bug, as well as provide some future 
expansion for vectorization.

There are a few new internal sizes such as "OS_MF128" that serve to ensure the 
most optimal move command is 
used (out of MOVAPS, MOVAPD and MOVDQA), since using the wrong one incurs a 
performance penalty.

If the data is unaligned, the compiler will use MOVUPS/MOVUPD/MOVDQU instead, 
but if it detects the correct 
byte alignment (16) on a variable, it will use the aligned commands.

Let me know how it works out.

Gareth aka. Kit
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-12 Thread J. Gareth Moreton
Thanks for the extra information - that should help.

Indeed, the SEE routines are very advanced, but the thing here is that for a 
Pascal programmer, the produced 
machine code should be completely transparent to them, and things like byte 
alignment for things that are 
not explicitly vectors, like a static array of Singles, should not cause random 
and potentially hard-to-find 
crashes.  In this situation, the compiler should automatically put said 
variables on a 16-byte boundary in 
the stack.

For variables external to the procedure, one might have to use an unaligned 
move unless explicit byte 
alignment is available and the compiler can be sure of a variable's byte 
alignment.  For this I proposed the 
"align ##" modifier for type declarations, and this would allow the equivalent 
of __m128 to be cleanly 
defined, for example, although this hasn't been approved yet.

With things like loop unrolling and optimisation as well, this will likely be a 
long-term area of research 
on my part.

Kit


On Tue 12/12/17 10:21 , "Adriaan van Os" adri...@microbizz.nl sent:
> J. Gareth Moreton wrote:
> 
> > I created a Wiki page to plan things out: 
> http://wiki.lazarus.freepascal.org/Vectorization
> 
> 
> Note that Intel compilers can optimize for different processor
> architectures (and different vector 
> size), as follows 
> 
>  -software-developers-intel-compiler-options-for-sse-generation-and-processo
> r-specific-optimizations>
> 
> 
> Intel high-performance libraries use runtime-dispatching, e.g. for IPP
> 
> 
> 
> 
> Regards,
> 
> 
> 
> Adriaan van Os
> 
> 
> 
> 
> 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-12 Thread Adriaan van Os

J. Gareth Moreton wrote:

I created a Wiki page to plan things out: 
http://wiki.lazarus.freepascal.org/Vectorization


Note that Intel compilers can optimize for different processor architectures (and different vector 
size), as follows 



Intel high-performance libraries use runtime-dispatching, e.g. for IPP


Regards,

Adriaan van Os

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-12 Thread Adriaan van Os

J. Gareth Moreton wrote:

- There is no desire to include MOVUPS instructions because, while they will work for unaligned memory, are 
much slower than MOVAPS, but MOVAPS will cause a segmentation fault if the memory is not aligned.


Memory should be aligned when using vector code. And developers should know that. The objective of 
vectorization is speed, not to be nice to developers that don't know what they are doing. So, if 
they get a crash, it's their fault. Maybe, the compiler can issue a warning if data used in vector 
code is not aligned.


See e.g. Section 5.3 DATA ALIGNMENT of the Intel® 64 and IA-32 Architectures Optimization Reference 
Manual (where movaps and palignr are used for SSE3 optimized code).


I suggest an FPC runtime function to get an aligned block of memory on the heap. I use 
posix_memalign, but note that Mac OS X has a severe bug where posix_memalign with size 0 causes 
memory corruption ! For size 0, malloc can be used (or a nil pointer returned).


Note that 64-bit AVX-512   instructions require 
64-byte alignment.


Regards,

Adriaan van Os
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-12 Thread Adriaan van Os

J. Gareth Moreton wrote:

I created a Wiki page to plan things out: 
http://wiki.lazarus.freepascal.org/Vectorization


As a side bar, note what Intel writes about Optimization in the Intel Math Kernel Library Developer 
Reference



Performance Enhancements

The Intel Math Kernel Library has been optimized by exploiting both processor and system features 
and capabilities. Special care has been given to those routines that most profit from 
cache-management techniques. These especially include matrix-matrix operation routines such as dgemm().


In addition, code optimization techniques have been applied to minimize dependencies of scheduling 
integer and floating-point units on the results within the processor.


The major optimization techniques used throughout the library include:

• Loop unrolling to minimize loop management costs

• Blocking of data to improve data reuse opportunities

• Copying to reduce chances of data eviction from cache

• Data prefetching to help hide memory latency

• Multiple simultaneous operations (for example, dot products in dgemm) to 
eliminate stalls due to
arithmetic unit pipelines

• Use of hardware features such as the SIMD arithmetic units, where appropriate

Regards,

Adriaan van Os
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread Michael Schnell

On 10.12.2017 20:01, J. Gareth Moreton wrote:

Starting at the 4th command, it looks back
to find a match in the 1st command,

What about loops (such as using arrays of n elements as obvious Vectors).

Supposedly, here some higher level optimization would be necessary.

-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread J. Gareth Moreton
Okay, sit back everyone - this is a long read!



I'm starting with the problem as listed in 
https://bugs.freepascal.org/view.php?id=27870 with the source 
code provided, although with {$codealign varmin=16} and {$codealign 
localmin=16} at the top.

I'm running the latest version of the compiler with the following parameters 
"-O3 -va -CfSSE64 -a -Sv".  Find attached the source file and the generated 
assembly.

First thing to note is that no vectorisation occurs for the individual setting 
of elements - e.g. the v1[ 0] 
:= 0.2 lines are assembled as follows: 

movl_$TESTFILE$_Ld1(%rip),%eax
movl%eax,48(%rsp)
movl_$TESTFILE$_Ld1(%rip),%eax
movl%eax,52(%rsp)
movl_$TESTFILE$_Ld1(%rip),%eax
movl%eax,56(%rsp)
movl_$TESTFILE$_Ld1(%rip),%eax
movl%eax,60(%rsp)

(_$TESTFILE$_Ld1 refers to the 32-bit representation of 0.2, namely $CDCC4C3E, 
and I'm surprised the 
optimizer doesn't notice the redundant setting of %eax)

For the line "v3 := v1 + v2;", this is vectorised because the compiler can 
identify all the operands as 
vector types, but as already suspected, there is a missing command to write 
%xmm0 to the stack.

movdqa  48(%rsp),%xmm0
addps   64(%rsp),%xmm0

The next operation is "call fpc_get_output" that begins a call to "WriteLn".

Also, there is a very slight bug with the generated code.  "movdqa" is an 
integer move, not a floating-point 
move.  With the floating-point "addps" that follows, this incurs a performance 
penalty due to switching 
between the two modes - "movaps" should be used instead.

Regarding alignment, the stack is correctly aligned because, while no stack 
frame is set up, the command 
"pushq %rbx" aligns the stack to a 16-byte boundary. Depending on how easy or 
tricky it is to enforce the 
stack alignment, it might be possible to not have to switch to using the 
unaligned move commands.

Once I've figured out how it emits the vector commands, I'll see that it 
includes the missing movaps 
command.  Initially I'll probably switch to using movups to ensure no 
segmentation faults occur, and then 
migrate back to movaps if I can automatically enforce the correct byte 
alignment with no input from the 
programmer.  This might be due to seeing the variables are vector types and 
aligning them to a 16-byte 
boundary if SSE is selected.  I'll let you know how it goes.


Kit



P.S. Depending on how the optimizer is structured, I might suggest a kind of 
"Deep Optimizer" that is a part 
of -O3 (or -O4 if it's a little risky) and is done after all of the other 
compilation and optimisation 
stages and immediately prior to writing the assembler/object file, which does 
things like remove the 
redundant writes to %eax and also other optimizations that the peephole 
optimizer misses.  In the .s file, 
there are snippets of code akin to the following:

movq%rax,%rbx
leaq_$TESTFILE$_Ld3(%rip),%r8
movq%rbx,%rdx

Because of the leaq command in between, the peephole optimizer doesn't notice 
the performance penalty that 
comes from writing to %rbx and then immediately reading it again to copy into 
%rdx.  If it were detected and 
changed to the following:

movq%rax,%rbx
leaq_$TESTFILE$_Ld3(%rip),%r8
movq%rax,%rdx

Changing %rbx to %rax in the second movq command removes the performance 
penalty and takes advantage of 
modern processors' multiple ALUs (leaq does not modify any of the registers 
other than the unrelated %r8 in 
this instance, so it's safe), thus likely collapsing this group of three 
commands into a single CPU cycle 
instead of 2.

testfile.pp
Description: Binary data


testfile.s
Description: Binary data
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread J. Gareth Moreton
I created a Wiki page to plan things out: 
http://wiki.lazarus.freepascal.org/Vectorization

It's a stub currently.

Kit

On Mon 11/12/17 20:34 , "J. Gareth Moreton" gar...@moreton-family.com sent:
> P.S. For design ideas and patches that need collaboration, is the Wiki
> usually the way of going about it?
> 
> 
> Kit
> 
> ___
> 
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
> 
> 
> 
> 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread J. Gareth Moreton
P.S. For design ideas and patches that need collaboration, is the Wiki usually 
the way of going about it?

Kit
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread J. Gareth Moreton
I might need to set up a Wiki page of some kind containing a design spec, 
because even fixing that bug will 
require some extra features.

Case in point:
- There is no desire to include MOVUPS instructions because, while they will 
work for unaligned memory, are 
much slower than MOVAPS, but MOVAPS will cause a segmentation fault if the 
memory is not aligned.
- There currently no easy way to enforce memory alignment outside of a compiler 
directive, and that includes 
forcing the programmer to use it whenever they declare a variable of a vector 
type, even outside of the unit 
that contains the actual vectorised code.
- If a compiler feature will easily cause crashes unless some additional 
compiler directives are carefully 
specified, most users will simply not use it or pass it off as buggy or, at the 
very least, not friendly.
- Only static arrays of Singles or Doubles are identified as vectors by the 
compiler.  Record types that 
contain, for example, 4 packed Singles (representing co-ordinates) are not 
identified as such.

I'll dwell on this a bit, and see if a plan can be drawn up that Florian and 
the other senior developers 
agree on.

I think I found my life's calling!!

Kit


On Mon 11/12/17 13:44 , "Adriaan van Os" f...@microbizz.nl sent:
> J. Gareth Moreton wrote:
> 
> > I guess fixing that might be a good 
> 
> > starting point. There's also the issue of 
> 
> > memory alignment causing crashes.
> 
> 
> 
> I (and others I think) would much welcome the fix. WIth regard to memory
> aligment, you would have 
> to test what is already fixed in the current (svn) compiler (and what not)
> as the report is from 
> some time ago.
> 
> 
> 
> Regards,
> 
> 
> 
> Adriaan van Os
> 
> 
> 
> 
> 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread Adriaan van Os

J. Gareth Moreton wrote:
I guess fixing that might be a good 
starting point. There's also the issue of 
memory alignment causing crashes.


I (and others I think) would much welcome the fix. WIth regard to memory aligment, you would have 
to test what is already fixed in the current (svn) compiler (and what not) as the report is from 
some time ago.


Regards,

Adriaan van Os
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread J. Gareth Moreton
I guess fixing that might be a good 
starting point. There's also the issue of 
memory alignment causing crashes.

Kit

On Mon 11/12/17 12:19 , mar...@stack.nl 
(Marco van de Voort) sent:
> In our previous episode, Adriaan van Os 
said:
> 
> > > Since I'm masochistic in my desire 
to
> understand and improve the Free Pascal 
Compiler, I would like to add 
> > > some vectorisation support in its
> optimisation cycle, since that is one 
thing that many other compilers 
> > > attempt to do these days.  But 
before I
> begin, does FPC support any kind of 
vectorisation already?  If it 
> > > does I haven't been able to find it 
yet,
> and I don't want to end up reinventing 
the wheel.
> > 
> 
> > See e.g. 

> 
> 
> I had the same problem recently that is 
mentioned in the notes of this
> 
> report, sometimes the last store (from 
register to mem) is missing.
> 
> 
__
_
> 
> fpc-devel maillist  -  fpc-
de...@lists.freepascal.org
> http://lists.freepascal.org/cgi-
bin/mailman/listinfo/fpc-devel
> 
> 
> 
> 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread Marco van de Voort
In our previous episode, Adriaan van Os said:
> > Since I'm masochistic in my desire to understand and improve the Free 
> > Pascal Compiler, I would like to add 
> > some vectorisation support in its optimisation cycle, since that is one 
> > thing that many other compilers 
> > attempt to do these days.  But before I begin, does FPC support any kind of 
> > vectorisation already?  If it 
> > does I haven't been able to find it yet, and I don't want to end up 
> > reinventing the wheel.
> 
> See e.g. 

I had the same problem recently that is mentioned in the notes of this
report, sometimes the last store (from register to mem) is missing.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread Adriaan van Os

J. Gareth Moreton wrote:

Hi everyone,

Since I'm masochistic in my desire to understand and improve the Free Pascal Compiler, I would like to add 
some vectorisation support in its optimisation cycle, since that is one thing that many other compilers 
attempt to do these days.  But before I begin, does FPC support any kind of vectorisation already?  If it 
does I haven't been able to find it yet, and I don't want to end up reinventing the wheel.


See e.g. 

Regards,

Adriaan van Os
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-11 Thread Sven Barth via fpc-devel
Am 10.12.2017 20:01 schrieb "J. Gareth Moreton" :

The idea I had currently (this is without
looking at any previous theory) was to use
a kind of sliding window, similar to how
ZIP and other LZ77-based algorithms work
when compressing repeating strings, to
look backwards in the current block for a
matching command and then scan forward. If
the scan gets up to the instruction right
before the starting point, then it's
potential for vectorisable code. Using the
previous example:

movss 16(%rsp),%xmm0
addss 32(%rsp),%xmm0
movss %xmm0,(%rax)
movss 20(%rsp),%xmm0
addss 36(%rsp),%xmm0
movss %xmm0,4(%rax)

Starting at the 4th command, it looks back
to find a match in the 1st command, albeit
with Ann address that differs only by 4.
As it scans forward, it finds similar
matches in subsequent commands, and
eventually realises the entire block could
potentially be vectorised. If it
continues, it finds the code fragment
repeats 4 times and can be vectorised with
little difficulty. Being only SSE commands
helps too.


The preferred way would be to detect this on the parser side (with the
AST)  not on the code generator side as then this can be more easily
implemented for different platforms.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-10 Thread J. Gareth Moreton
The idea I had currently (this is without 
looking at any previous theory) was to use 
a kind of sliding window, similar to how 
ZIP and other LZ77-based algorithms work 
when compressing repeating strings, to 
look backwards in the current block for a 
matching command and then scan forward. If 
the scan gets up to the instruction right 
before the starting point, then it's 
potential for vectorisable code. Using the 
previous example:

movss 16(%rsp),%xmm0
addss 32(%rsp),%xmm0
movss %xmm0,(%rax)
movss 20(%rsp),%xmm0
addss 36(%rsp),%xmm0
movss %xmm0,4(%rax)

Starting at the 4th command, it looks back 
to find a match in the 1st command, albeit 
with Ann address that differs only by 4. 
As it scans forward, it finds similar 
matches in subsequent commands, and 
eventually realises the entire block could 
potentially be vectorised. If it 
continues, it finds the code fragment 
repeats 4 times and can be vectorised with 
little difficulty. Being only SSE commands 
helps too.

Kit

P.S. I did look at the loop unrolling 
code, but it almost never triggers due to 
the small instruction cache that's 
assumed. For x86-64, is it safe to assume 
a cache length of 60 instead of 30, since 
almost all modern Intel and AMD processors 
have 56+ elements in their queues.

On Sun 10/12/17 13:50 , "Florian Klämpfl" 
flor...@freepascal.org sent:
> Am 10.12.2017 um 02:29 schrieb J. Gareth 
Moreton:
> 
> > Hi everyone,
> 
> > 
> 
> > Since I'm masochistic in my desire to 
understand
> and improve the Free Pascal Compiler, I 
would like to add 
> > some vectorisation support in its 
optimisation
> cycle, since that is one thing that many 
other compilers 
> > attempt to do these days.  But before 
I begin,
> does FPC support any kind of 
vectorisation already?  If it 
> > does I haven't been able to find it 
yet, and I
> don't want to end up reinventing the 
wheel.
> 
> 
> I started once to work on this, but 
never merged it into fpc trunk, it
> might be even only in my
> local git check out, I can look for it.
> 
> 
> 
> > 
> 
> > I'm sure it's a mammoth task, but I 
would like
> to start somewhere with it - however, 
are there any design 
> > plans that I should be adhering to so 
I don't
> end up designing something that is 
disliked?
> > 
> 
> 
> 
> Well, basically it means that another 
pass (like e.g. unroll_loop in
> optloop.pas) of the tree must
> be added which generated operations as 
they can be encoded by -Sv. To do
> this efficiently, probably
> some previous simplification of the tree 
is needed. But this is something
> for later.
> 
__
_
> 
> fpc-devel maillist  -  fpc-
de...@lists.freepascal.org
> http://lists.freepascal.org/cgi-
bin/mailman/listinfo/fpc-devel
> 
> 
> 
> 

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-10 Thread Florian Klämpfl
Am 10.12.2017 um 02:29 schrieb J. Gareth Moreton:
> Hi everyone,
> 
> Since I'm masochistic in my desire to understand and improve the Free Pascal 
> Compiler, I would like to add 
> some vectorisation support in its optimisation cycle, since that is one thing 
> that many other compilers 
> attempt to do these days.  But before I begin, does FPC support any kind of 
> vectorisation already?  If it 
> does I haven't been able to find it yet, and I don't want to end up 
> reinventing the wheel.

I started once to work on this, but never merged it into fpc trunk, it might be 
even only in my
local git check out, I can look for it.

> 
> I'm sure it's a mammoth task, but I would like to start somewhere with it - 
> however, are there any design 
> plans that I should be adhering to so I don't end up designing something that 
> is disliked?
> 

Well, basically it means that another pass (like e.g. unroll_loop in 
optloop.pas) of the tree must
be added which generated operations as they can be encoded by -Sv. To do this 
efficiently, probably
some previous simplification of the tree is needed. But this is something for 
later.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-10 Thread Florian Klämpfl
Am 10.12.2017 um 14:03 schrieb Marco van de Voort:
> In our previous episode, J. Gareth Moreton said:
>> Since I'm masochistic in my desire to understand and improve the Free Pascal 
>> Compiler, I would like to add 
>> some vectorisation support in its optimisation cycle, since that is one 
>> thing that many other compilers 
>> attempt to do these days.  But before I begin, does FPC support any kind of 
>> vectorisation already?  
> 
> Yes, -Sv, but it is buggy.

Well, -Sv means that operations on vectors are supported but it does not mean 
that FPC detects
vectorizable operations by itself.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Vectorization

2017-12-10 Thread Marco van de Voort
In our previous episode, J. Gareth Moreton said:
> Since I'm masochistic in my desire to understand and improve the Free Pascal 
> Compiler, I would like to add 
> some vectorisation support in its optimisation cycle, since that is one thing 
> that many other compilers 
> attempt to do these days.  But before I begin, does FPC support any kind of 
> vectorisation already?  

Yes, -Sv, but it is buggy.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] Vectorization

2017-12-09 Thread J. Gareth Moreton
Hi everyone,

Since I'm masochistic in my desire to understand and improve the Free Pascal 
Compiler, I would like to add 
some vectorisation support in its optimisation cycle, since that is one thing 
that many other compilers 
attempt to do these days.  But before I begin, does FPC support any kind of 
vectorisation already?  If it 
does I haven't been able to find it yet, and I don't want to end up reinventing 
the wheel.

I recall things, for example, where the following is not optimised even if the 
compiler is set to use SSE:

type
  TVector4f = packed record
X, Y, Z, W: Single;
  end;

function VectorAdd(A, B: TVector4f): TVector4f;
begin
  Result.X := A.X + B.X;
  Result.Y := A.Y + B.Y;
  Result.Z ;= A.Z + B.Z;
  Result.W := A.W + B.W;
end;

The resultant assembler code yields an individual "MOVSS" and arithmetic for 
each element rather than 
combining the reads and writes into a MOVUPS instruction and reducing the 
number of arithmetic instructions 
by a factor of 4.  For clarity, this is the assembler produced with '-CfSSE64':

.section .text.n_p$testfile_$$_addvector$tvector4f$tvector4f$$tvector4f,"x"
.balign 16,0x90
.globl  P$TESTFILE_$$_ADDVECTOR$TVECTOR4F$TVECTOR4F$$TVECTOR4F
P$TESTFILE_$$_ADDVECTOR$TVECTOR4F$TVECTOR4F$$TVECTOR4F:
.Lc1:
.seh_proc P$TESTFILE_$$_ADDVECTOR$TVECTOR4F$TVECTOR4F$$TVECTOR4F
leaq-56(%rsp),%rsp
.Lc3:
.seh_stackalloc 56
.seh_endprologue
movq%rcx,%rax
movq%rdx,(%rsp)
movq%r8,8(%rsp)
movq(%rsp),%rdx
movq(%rdx),%rcx
movq%rcx,16(%rsp)
movq8(%rdx),%rdx
movq%rdx,24(%rsp)
movq8(%rsp),%rdx
movq(%rdx),%rcx
movq%rcx,32(%rsp)
movq8(%rdx),%rdx
movq%rdx,40(%rsp)
movss   16(%rsp),%xmm0
addss   32(%rsp),%xmm0
movss   %xmm0,(%rax)
movss   20(%rsp),%xmm0
addss   36(%rsp),%xmm0
movss   %xmm0,4(%rax)
movss   24(%rsp),%xmm0
addss   40(%rsp),%xmm0
movss   %xmm0,8(%rax)
movss   28(%rsp),%xmm0
addss   44(%rsp),%xmm0
movss   %xmm0,12(%rax)
leaq56(%rsp),%rsp
ret
.seh_endproc
.Lc2:

A good vectoriser (for lack of a better name!) would be able to optimise the 12 
movss/addss routines to just 
"movups 16(%rsp),%xmm0  addps 32(%rsp),%xmm0  movups %xmm0,(%rax)" - since the 
stack is aligned to a 16-byte 
boundary, it can swap out the first movups to a movaps too.  Not sure what to 
do regarding moving everything 
to the stack first though.

I'm sure it's a mammoth task, but I would like to start somewhere with it - 
however, are there any design 
plans that I should be adhering to so I don't end up designing something that 
is disliked?

Kit
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel