Re: [fpc-devel] x86_64 question

2020-10-15 Thread J. Gareth Moreton via fpc-devel

Hi Nikolay,

I've simplified my test as much as I can, and hopefully I have something 
that properly tests whether TEST has a false dependency or not.  I'm 
willing to admit that I may have been mistaken and the slowdown was 
caused by something else.


The test functions effectively do a population count on the lowest 8 
bits of a 32-bit integer... very inefficiently by calling TEST on each 
of the 8 bits in turn and adding to a running total! I also have a 
handful of versions that call POPCNT for comparison (needless to say it 
wipes the floor with the TEST versions).


The program appears to run correctly on Win64 and Linux64, although I 
can't vouch for the performance on Linux because of the overhead caused 
by my copy being on a virtual machine.  The program does some extensive 
error checking on the results generated, and so far they haven't thrown 
anything up.


Let me know how it goes.

Gareth aka. Kit

On 05/10/2020 14:39, Nikolay Nikolov via fpc-devel wrote:


On 10/4/20 2:01 PM, J. Gareth Moreton via fpc-devel wrote:

Hi Nikolay,

I've got some good code to test, but I need to double-check with 
someone to see if the licensing agreements allow (the code is rather 
complex, but showcases the effect of the TEST instructions quite 
nicely).


Is your platform a Windows or a Unix machine?  I ask because I don't 
want to send you functions that use the wrong calling convention!


I dual boot Linux and Windows, but prefer testing on Linux.

Best regards,


Nikolay



Gareth aka. Kit

On 02/10/2020 14:13, Nikolay Nikolov via fpc-devel wrote:


On 10/2/20 2:13 PM, J. Gareth Moreton via fpc-devel wrote:
Confirmed my suspicions.  if I zero the upper bits of the register 
(I used something akin to "AND RCX, $F"), there is no speed loss.


Therefore, I can make the hypothesis, on my Intel(R) Core(TM) 
i7-10750H, that using TEST on a sub-register causes a false 
dependency if the bits outside of the subset are not zero, even 
though the register isn't being modified.


If you send me a test program, I can run it on my Ryzen 5 2500U to 
see how AMD behaves. We don't specifically optimize for AMD (yet), 
but it's interesting to know.


Nikolay



Gareth aka. Kit

On 02/10/2020 11:57, J. Gareth Moreton via fpc-devel wrote:
So... I've done some tests, replacing TEST RCX, $4 with TEST CL, 
$4 and the like in a number-crunching function, and it seems to 
cause a notable penalty, even though none of the instructions are 
in my critical loop.  So I think it's something that needs to be 
avoided in most cases.  I think the reason why it worked in my Int 
and Frac functions is because the processor knows the upper 48 
bits of the register are zero.


Long story short... best not to do it unless you have some 
additional insight into what the registers contain.


Gareth aka. Kit



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel




___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel




--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
<>
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-05 Thread Nikolay Nikolov via fpc-devel


On 10/4/20 2:01 PM, J. Gareth Moreton via fpc-devel wrote:

Hi Nikolay,

I've got some good code to test, but I need to double-check with 
someone to see if the licensing agreements allow (the code is rather 
complex, but showcases the effect of the TEST instructions quite nicely).


Is your platform a Windows or a Unix machine?  I ask because I don't 
want to send you functions that use the wrong calling convention!


I dual boot Linux and Windows, but prefer testing on Linux.

Best regards,


Nikolay



Gareth aka. Kit

On 02/10/2020 14:13, Nikolay Nikolov via fpc-devel wrote:


On 10/2/20 2:13 PM, J. Gareth Moreton via fpc-devel wrote:
Confirmed my suspicions.  if I zero the upper bits of the register 
(I used something akin to "AND RCX, $F"), there is no speed loss.


Therefore, I can make the hypothesis, on my Intel(R) Core(TM) 
i7-10750H, that using TEST on a sub-register causes a false 
dependency if the bits outside of the subset are not zero, even 
though the register isn't being modified.


If you send me a test program, I can run it on my Ryzen 5 2500U to 
see how AMD behaves. We don't specifically optimize for AMD (yet), 
but it's interesting to know.


Nikolay



Gareth aka. Kit

On 02/10/2020 11:57, J. Gareth Moreton via fpc-devel wrote:
So... I've done some tests, replacing TEST RCX, $4 with TEST CL, $4 
and the like in a number-crunching function, and it seems to cause 
a notable penalty, even though none of the instructions are in my 
critical loop.  So I think it's something that needs to be avoided 
in most cases.  I think the reason why it worked in my Int and Frac 
functions is because the processor knows the upper 48 bits of the 
register are zero.


Long story short... best not to do it unless you have some 
additional insight into what the registers contain.


Gareth aka. Kit



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel




___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-04 Thread J. Gareth Moreton via fpc-devel

Hi Nikolay,

I've got some good code to test, but I need to double-check with someone 
to see if the licensing agreements allow (the code is rather complex, 
but showcases the effect of the TEST instructions quite nicely).


Is your platform a Windows or a Unix machine?  I ask because I don't 
want to send you functions that use the wrong calling convention!


Gareth aka. Kit

On 02/10/2020 14:13, Nikolay Nikolov via fpc-devel wrote:


On 10/2/20 2:13 PM, J. Gareth Moreton via fpc-devel wrote:
Confirmed my suspicions.  if I zero the upper bits of the register (I 
used something akin to "AND RCX, $F"), there is no speed loss.


Therefore, I can make the hypothesis, on my Intel(R) Core(TM) 
i7-10750H, that using TEST on a sub-register causes a false 
dependency if the bits outside of the subset are not zero, even 
though the register isn't being modified.


If you send me a test program, I can run it on my Ryzen 5 2500U to see 
how AMD behaves. We don't specifically optimize for AMD (yet), but 
it's interesting to know.


Nikolay



Gareth aka. Kit

On 02/10/2020 11:57, J. Gareth Moreton via fpc-devel wrote:
So... I've done some tests, replacing TEST RCX, $4 with TEST CL, $4 
and the like in a number-crunching function, and it seems to cause a 
notable penalty, even though none of the instructions are in my 
critical loop.  So I think it's something that needs to be avoided 
in most cases.  I think the reason why it worked in my Int and Frac 
functions is because the processor knows the upper 48 bits of the 
register are zero.


Long story short... best not to do it unless you have some 
additional insight into what the registers contain.


Gareth aka. Kit



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-02 Thread J. Gareth Moreton via fpc-devel
Sure, I can send you something.  It might have to be to a personal 
e-mail though depending on how big the attachments are. Watch this space.


I may be a bit of a mad scientist when it comes to my testing and 
research (and sometimes I make a stupid mistake like with the recent 
nested function proposal, completely forgetting about the return 
address), but I hope I find the occasional useful gem that can be used 
and polished by others!


Gareth aka. Kit

On 02/10/2020 14:13, Nikolay Nikolov via fpc-devel wrote:


On 10/2/20 2:13 PM, J. Gareth Moreton via fpc-devel wrote:
Confirmed my suspicions.  if I zero the upper bits of the register (I 
used something akin to "AND RCX, $F"), there is no speed loss.


Therefore, I can make the hypothesis, on my Intel(R) Core(TM) 
i7-10750H, that using TEST on a sub-register causes a false 
dependency if the bits outside of the subset are not zero, even 
though the register isn't being modified.


If you send me a test program, I can run it on my Ryzen 5 2500U to see 
how AMD behaves. We don't specifically optimize for AMD (yet), but 
it's interesting to know.


Nikolay



Gareth aka. Kit

On 02/10/2020 11:57, J. Gareth Moreton via fpc-devel wrote:
So... I've done some tests, replacing TEST RCX, $4 with TEST CL, $4 
and the like in a number-crunching function, and it seems to cause a 
notable penalty, even though none of the instructions are in my 
critical loop.  So I think it's something that needs to be avoided 
in most cases.  I think the reason why it worked in my Int and Frac 
functions is because the processor knows the upper 48 bits of the 
register are zero.


Long story short... best not to do it unless you have some 
additional insight into what the registers contain.


Gareth aka. Kit



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-02 Thread Nikolay Nikolov via fpc-devel


On 10/2/20 2:13 PM, J. Gareth Moreton via fpc-devel wrote:
Confirmed my suspicions.  if I zero the upper bits of the register (I 
used something akin to "AND RCX, $F"), there is no speed loss.


Therefore, I can make the hypothesis, on my Intel(R) Core(TM) 
i7-10750H, that using TEST on a sub-register causes a false dependency 
if the bits outside of the subset are not zero, even though the 
register isn't being modified.


If you send me a test program, I can run it on my Ryzen 5 2500U to see 
how AMD behaves. We don't specifically optimize for AMD (yet), but it's 
interesting to know.


Nikolay



Gareth aka. Kit

On 02/10/2020 11:57, J. Gareth Moreton via fpc-devel wrote:
So... I've done some tests, replacing TEST RCX, $4 with TEST CL, $4 
and the like in a number-crunching function, and it seems to cause a 
notable penalty, even though none of the instructions are in my 
critical loop.  So I think it's something that needs to be avoided in 
most cases.  I think the reason why it worked in my Int and Frac 
functions is because the processor knows the upper 48 bits of the 
register are zero.


Long story short... best not to do it unless you have some additional 
insight into what the registers contain.


Gareth aka. Kit



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-02 Thread J. Gareth Moreton via fpc-devel
Confirmed my suspicions.  if I zero the upper bits of the register (I 
used something akin to "AND RCX, $F"), there is no speed loss.


Therefore, I can make the hypothesis, on my Intel(R) Core(TM) i7-10750H, 
that using TEST on a sub-register causes a false dependency if the bits 
outside of the subset are not zero, even though the register isn't being 
modified.


Gareth aka. Kit

On 02/10/2020 11:57, J. Gareth Moreton via fpc-devel wrote:
So... I've done some tests, replacing TEST RCX, $4 with TEST CL, $4 
and the like in a number-crunching function, and it seems to cause a 
notable penalty, even though none of the instructions are in my 
critical loop.  So I think it's something that needs to be avoided in 
most cases.  I think the reason why it worked in my Int and Frac 
functions is because the processor knows the upper 48 bits of the 
register are zero.


Long story short... best not to do it unless you have some additional 
insight into what the registers contain.


Gareth aka. Kit


On 02/10/2020 08:15, J. Gareth Moreton via fpc-devel wrote:

Ah brilliant, thank you.

I have used Agner Fog's material before for cycle counting. When I 
implemented my 3 MOV -> XCHG optimisation 
(https://bugs.freepascal.org/view.php?id=36511), I used Agner Fog's 
empirical results to determine when it's best to apply this 
optimisation where speed is concerned (on a lot of older processors, 
it's not worth it because XCHG took 3 cycles and the 3 MOVs generally 
took only 2 (due to how the dependency chain is set up).  Only when 
XCHG's cycle count dropped to 1 or 2, or when optimising for size, 
does it pay off.


So it looks like a partial read of the lower bits is absolutely fine, 
since you're not changing anything.


Gareth aka. Kit

On 02/10/2020 01:40, Nikolay Nikolov via fpc-devel wrote:


On 10/1/20 11:36 PM, J. Gareth Moreton via fpc-devel wrote:
I thought that might be the case - thanks Nikolay.  And I meant to 
say lower bits of a REGISTER, not an instruction!


Admittedly I'm cycle-counting and byte-counting again!  I was 
looking for ways to reduce 13 bytes of padding in one of my pure 
assembly language routines and realised I could make a saving 
there.  The only thing I can think of that I have to watch out for 
logically is if I change, say, TEST EAX, $80 to TEST AL, $80, the 
latter will set the sign flag if the most-significant bit is 1 
after the 'and' operation) while the former always clears the sign 
flag.


I have used such subregisters before in the FPC RTL, in 
fpc_int_real and fpc_frac_real in rtl/x86_64/math.inc, where I read 
AX instead of the larger RAX, but that's only after a call to "SHR 
RAX, 48" that guarantees that everything above the 16th bit is 
zero, and after testing other implementation candidates a kind of 
informal competition. (Surprisingly, I think "shr $48, %rax; and 
$0x7ff0,%ax; cmp $0x4330,%ax" runs faster than moving 64-bit 
constants into temporary registers (since 64-bit immediates aren't 
supported outside of MOV) and using 'and' and 'cmp' on %rax directly)


I think you always get a read penalty when using the high-byte 
registers because the processor has to do an implicit shift operation.


I don't remember the reason, but I recall reading they are less 
efficient in Agner Fog's optimization manual. Here's the relevant 
quote:


"Any use of the high 8-bit registers AH, BH, CH, DH should be 
avoided because it can cause false dependences and less efficient 
code."


It's from the chapter "Partial registers" (page 61) of this document:

https://www.agner.org/optimize/optimizing_assembly.pdf

Highly recommended reading, as it addresses exactly the topic of 
partial registers. In general, it is the partial register writes of 
16-bit or 8-bit subregisters that cause problems - either false read 
dependencies (usually on AMD) or extra penalties for 
joining/splitting registers (on Intel, at least in the P6 era).


Best regards,

Nikolay

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel





--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-02 Thread J. Gareth Moreton via fpc-devel
So... I've done some tests, replacing TEST RCX, $4 with TEST CL, $4 and 
the like in a number-crunching function, and it seems to cause a notable 
penalty, even though none of the instructions are in my critical loop.  
So I think it's something that needs to be avoided in most cases.  I 
think the reason why it worked in my Int and Frac functions is because 
the processor knows the upper 48 bits of the register are zero.


Long story short... best not to do it unless you have some additional 
insight into what the registers contain.


Gareth aka. Kit


On 02/10/2020 08:15, J. Gareth Moreton via fpc-devel wrote:

Ah brilliant, thank you.

I have used Agner Fog's material before for cycle counting.  When I 
implemented my 3 MOV -> XCHG optimisation 
(https://bugs.freepascal.org/view.php?id=36511), I used Agner Fog's 
empirical results to determine when it's best to apply this 
optimisation where speed is concerned (on a lot of older processors, 
it's not worth it because XCHG took 3 cycles and the 3 MOVs generally 
took only 2 (due to how the dependency chain is set up).  Only when 
XCHG's cycle count dropped to 1 or 2, or when optimising for size, 
does it pay off.


So it looks like a partial read of the lower bits is absolutely fine, 
since you're not changing anything.


Gareth aka. Kit

On 02/10/2020 01:40, Nikolay Nikolov via fpc-devel wrote:


On 10/1/20 11:36 PM, J. Gareth Moreton via fpc-devel wrote:
I thought that might be the case - thanks Nikolay.  And I meant to 
say lower bits of a REGISTER, not an instruction!


Admittedly I'm cycle-counting and byte-counting again!  I was 
looking for ways to reduce 13 bytes of padding in one of my pure 
assembly language routines and realised I could make a saving 
there.  The only thing I can think of that I have to watch out for 
logically is if I change, say, TEST EAX, $80 to TEST AL, $80, the 
latter will set the sign flag if the most-significant bit is 1 after 
the 'and' operation) while the former always clears the sign flag.


I have used such subregisters before in the FPC RTL, in fpc_int_real 
and fpc_frac_real in rtl/x86_64/math.inc, where I read AX instead of 
the larger RAX, but that's only after a call to "SHR RAX, 48" that 
guarantees that everything above the 16th bit is zero, and after 
testing other implementation candidates a kind of informal 
competition. (Surprisingly, I think "shr $48, %rax; and $0x7ff0,%ax; 
cmp $0x4330,%ax" runs faster than moving 64-bit constants into 
temporary registers (since 64-bit immediates aren't supported 
outside of MOV) and using 'and' and 'cmp' on %rax directly)


I think you always get a read penalty when using the high-byte 
registers because the processor has to do an implicit shift operation.


I don't remember the reason, but I recall reading they are less 
efficient in Agner Fog's optimization manual. Here's the relevant quote:


"Any use of the high 8-bit registers AH, BH, CH, DH should be avoided 
because it can cause false dependences and less efficient code."


It's from the chapter "Partial registers" (page 61) of this document:

https://www.agner.org/optimize/optimizing_assembly.pdf

Highly recommended reading, as it addresses exactly the topic of 
partial registers. In general, it is the partial register writes of 
16-bit or 8-bit subregisters that cause problems - either false read 
dependencies (usually on AMD) or extra penalties for 
joining/splitting registers (on Intel, at least in the P6 era).


Best regards,

Nikolay

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-02 Thread J. Gareth Moreton via fpc-devel

Ah brilliant, thank you.

I have used Agner Fog's material before for cycle counting.  When I 
implemented my 3 MOV -> XCHG optimisation 
(https://bugs.freepascal.org/view.php?id=36511), I used Agner Fog's 
empirical results to determine when it's best to apply this optimisation 
where speed is concerned (on a lot of older processors, it's not worth 
it because XCHG took 3 cycles and the 3 MOVs generally took only 2 (due 
to how the dependency chain is set up).  Only when XCHG's cycle count 
dropped to 1 or 2, or when optimising for size, does it pay off.


So it looks like a partial read of the lower bits is absolutely fine, 
since you're not changing anything.


Gareth aka. Kit

On 02/10/2020 01:40, Nikolay Nikolov via fpc-devel wrote:


On 10/1/20 11:36 PM, J. Gareth Moreton via fpc-devel wrote:
I thought that might be the case - thanks Nikolay.  And I meant to 
say lower bits of a REGISTER, not an instruction!


Admittedly I'm cycle-counting and byte-counting again!  I was looking 
for ways to reduce 13 bytes of padding in one of my pure assembly 
language routines and realised I could make a saving there.  The only 
thing I can think of that I have to watch out for logically is if I 
change, say, TEST EAX, $80 to TEST AL, $80, the latter will set the 
sign flag if the most-significant bit is 1 after the 'and' operation) 
while the former always clears the sign flag.


I have used such subregisters before in the FPC RTL, in fpc_int_real 
and fpc_frac_real in rtl/x86_64/math.inc, where I read AX instead of 
the larger RAX, but that's only after a call to "SHR RAX, 48" that 
guarantees that everything above the 16th bit is zero, and after 
testing other implementation candidates a kind of informal 
competition. (Surprisingly, I think "shr $48, %rax; and $0x7ff0,%ax; 
cmp $0x4330,%ax" runs faster than moving 64-bit constants into 
temporary registers (since 64-bit immediates aren't supported outside 
of MOV) and using 'and' and 'cmp' on %rax directly)


I think you always get a read penalty when using the high-byte 
registers because the processor has to do an implicit shift operation.


I don't remember the reason, but I recall reading they are less 
efficient in Agner Fog's optimization manual. Here's the relevant quote:


"Any use of the high 8-bit registers AH, BH, CH, DH should be avoided 
because it can cause false dependences and less efficient code."


It's from the chapter "Partial registers" (page 61) of this document:

https://www.agner.org/optimize/optimizing_assembly.pdf

Highly recommended reading, as it addresses exactly the topic of 
partial registers. In general, it is the partial register writes of 
16-bit or 8-bit subregisters that cause problems - either false read 
dependencies (usually on AMD) or extra penalties for joining/splitting 
registers (on Intel, at least in the P6 era).


Best regards,

Nikolay

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-01 Thread Nikolay Nikolov via fpc-devel


On 10/1/20 11:36 PM, J. Gareth Moreton via fpc-devel wrote:
I thought that might be the case - thanks Nikolay.  And I meant to say 
lower bits of a REGISTER, not an instruction!


Admittedly I'm cycle-counting and byte-counting again!  I was looking 
for ways to reduce 13 bytes of padding in one of my pure assembly 
language routines and realised I could make a saving there.  The only 
thing I can think of that I have to watch out for logically is if I 
change, say, TEST EAX, $80 to TEST AL, $80, the latter will set the 
sign flag if the most-significant bit is 1 after the 'and' operation) 
while the former always clears the sign flag.


I have used such subregisters before in the FPC RTL, in fpc_int_real 
and fpc_frac_real in rtl/x86_64/math.inc, where I read AX instead of 
the larger RAX, but that's only after a call to "SHR RAX, 48" that 
guarantees that everything above the 16th bit is zero, and after 
testing other implementation candidates a kind of informal 
competition. (Surprisingly, I think "shr $48, %rax; and $0x7ff0,%ax; 
cmp $0x4330,%ax" runs faster than moving 64-bit constants into 
temporary registers (since 64-bit immediates aren't supported outside 
of MOV) and using 'and' and 'cmp' on %rax directly)


I think you always get a read penalty when using the high-byte 
registers because the processor has to do an implicit shift operation.


I don't remember the reason, but I recall reading they are less 
efficient in Agner Fog's optimization manual. Here's the relevant quote:


"Any use of the high 8-bit registers AH, BH, CH, DH should be avoided 
because it can cause false dependences and less efficient code."


It's from the chapter "Partial registers" (page 61) of this document:

https://www.agner.org/optimize/optimizing_assembly.pdf

Highly recommended reading, as it addresses exactly the topic of partial 
registers. In general, it is the partial register writes of 16-bit or 
8-bit subregisters that cause problems - either false read dependencies 
(usually on AMD) or extra penalties for joining/splitting registers (on 
Intel, at least in the P6 era).


Best regards,

Nikolay

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-01 Thread J. Gareth Moreton via fpc-devel
I thought that might be the case - thanks Nikolay.  And I meant to say 
lower bits of a REGISTER, not an instruction!


Admittedly I'm cycle-counting and byte-counting again!  I was looking 
for ways to reduce 13 bytes of padding in one of my pure assembly 
language routines and realised I could make a saving there.  The only 
thing I can think of that I have to watch out for logically is if I 
change, say, TEST EAX, $80 to TEST AL, $80, the latter will set the sign 
flag if the most-significant bit is 1 after the 'and' operation) while 
the former always clears the sign flag.


I have used such subregisters before in the FPC RTL, in fpc_int_real and 
fpc_frac_real in rtl/x86_64/math.inc, where I read AX instead of the 
larger RAX, but that's only after a call to "SHR RAX, 48" that 
guarantees that everything above the 16th bit is zero, and after testing 
other implementation candidates a kind of informal competition. 
(Surprisingly, I think "shr $48, %rax; and $0x7ff0,%ax; cmp $0x4330,%ax" 
runs faster than moving 64-bit constants into temporary registers (since 
64-bit immediates aren't supported outside of MOV) and using 'and' and 
'cmp' on %rax directly)


I think you always get a read penalty when using the high-byte registers 
because the processor has to do an implicit shift operation.


Thanks again for the answer.

Gareth aka. Kit

On 01/10/2020 19:43, Nikolay Nikolov via fpc-devel wrote:


On 10/1/20 8:17 PM, J. Gareth Moreton via fpc-devel wrote:

Hi everyone,

I have a small question with assembler size optimisation that maybe 
one of you guys can give me a second opinion on:


If you are using the "test" instruction to test some of the lower 
bits of an instruction, e.g. TEST RCX, $2, is there a penalty with 
calling TEST CL, $2 instead? The instruction size is a lot smaller on 
account of the immediate only being 1 byte long instead of 4 bytes, 
and are mathematically equivalent.  I know you have to be careful 
with partial write penalties, but partial reads seem to be a bit more 
nebulous (the register is not modified with TEST).


Yes, I think the shorter TEST CL, $2 is preferred over TEST RCX, $2 on 
every x86_64 CPU. AFAIK, there's no penalty for using 8-bit 
subregisters (except perhaps AH, BH, CH and DH, but the FPC code 
generator doesn't use them). Others can correct me if I'm wrong.


Nikolay

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] x86_64 question

2020-10-01 Thread Nikolay Nikolov via fpc-devel


On 10/1/20 8:17 PM, J. Gareth Moreton via fpc-devel wrote:

Hi everyone,

I have a small question with assembler size optimisation that maybe 
one of you guys can give me a second opinion on:


If you are using the "test" instruction to test some of the lower bits 
of an instruction, e.g. TEST RCX, $2, is there a penalty with calling 
TEST CL, $2 instead? The instruction size is a lot smaller on account 
of the immediate only being 1 byte long instead of 4 bytes, and are 
mathematically equivalent.  I know you have to be careful with partial 
write penalties, but partial reads seem to be a bit more nebulous (the 
register is not modified with TEST).


Yes, I think the shorter TEST CL, $2 is preferred over TEST RCX, $2 on 
every x86_64 CPU. AFAIK, there's no penalty for using 8-bit subregisters 
(except perhaps AH, BH, CH and DH, but the FPC code generator doesn't 
use them). Others can correct me if I'm wrong.


Nikolay

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] x86_64 question

2020-10-01 Thread J. Gareth Moreton via fpc-devel

Hi everyone,

I have a small question with assembler size optimisation that maybe one 
of you guys can give me a second opinion on:


If you are using the "test" instruction to test some of the lower bits 
of an instruction, e.g. TEST RCX, $2, is there a penalty with calling 
TEST CL, $2 instead? The instruction size is a lot smaller on account of 
the immediate only being 1 byte long instead of 4 bytes, and are 
mathematically equivalent.  I know you have to be careful with partial 
write penalties, but partial reads seem to be a bit more nebulous (the 
register is not modified with TEST).


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel