m...@horizon.com wrote:

>>Note 3: Can anybody tell me why MSPGCC "forgets" some of the lines of 
>>the following code (same solution like #2)? E.g. the last two 
>>asm-instructions disappear.

> Becuase you aren't using "asm volatile" and you aren't using the
> results.  So GCC throws them away.

ACK - got it. Especially the 2nd reason is important.


asm( "<assembler_code>" : <destination_decl> : <source_decl>);

In the last days I monitored, that if one uses inline assembly with
<assembler_code> beeing more than one instruction /and reads the
destination/, one has to declare destination not only in <destination_decl>
but also in <source_decl> (again, because it is also read). Otherwise in the
preceding C code the compiler will optimize destination away, because it
thinks, it is not read any more - a bad pit to fall in (as I did ;-)).


 

> Remember, asm() does NOT disable GCC's optimizer.  It WILL rearrange
> the statements based on the register dependencies.  If you have any
> dependencies that you do NOT tell GCC about (like needing to write to
> __MAC before reading __RESLO), you are breaking the rules and your code
> is not guaranteed to work.

ACK - thanks for the explanation.
 

> You ARE allowed to put multiple lines in a single asm().  Just stick
> a semicolon or a \n\t (the latter makes for prettier assembly code) in
> the string.  If you need a block of code to be emitted together, use that.

Yes, I know, but if you use objdump to check the result the output is
formatted nicer (every asm line is followed by its machine code.)
 

> I'm not sure how you expected to return anything from the above

Simply, because R12 to R15 are the registers for function result passing.
They are used to function parameter passing as well as for the result.
Therefore I tried to avoid using other registers, because they have to be
PUSHed/POPed.

Have a look at the function calling convention
<http://mspgcc.sourceforge.net/manual/x1248.html>.


> , but
> the right way to write it is:
> 
> static uint64_t mul32(uint32_t x, uint32_t y)
> {
>       uint64_t product;
> 
>       asm("mov %A[x], &__MPY\n\t"
>           "mov %A[y], &__OP2\n\t"             // Form xl*yl
>           "mov %A[x], &__MAC\n\t"
>           "mov &__RESLO, %A[p]\n\t"           // Copy low word to product
>           "mov &__RESHI, &__RESLO\n\t"        // Shift result down
>           "mov #0, &__RESHI\n\t"
>           "mov %B[y], &__OP2\n\t"             // Add xl*yh
>           "mov %A[y], &__MAC\n\t"
>           "mov %B[x], &__OP2\n\t"             // Add yl*xh
>           "mov %B[x], &__MAC\n\t"
>           "mov &__RESLO, %B[p]\n\t"           // Copy second-lowest word to 
> product
>           "mov &__RESHI, &__RESLO\n\t"        // Shift result down
>           "mov &__SUMEXT, &__RESHI\n\t"
>           "mov %B[y], &__OP2\n\t"             // Add xh*yh
>           "mov &__RESLO, %C[p]\n\t"
>           "mov &__RESHI, %D[p]" : [p] "=&r" (product) : [x] "%r" (x), [y] "y"
(b));
>       return product;
> }

It looks good, but it is /really/ bad in terms of speed and program code.
This version needs 105 clocks per multiplication, because it PUSHes / POPs
R6 to R11. One simple reason for this is, that product is a different
variable than x and y. In my solution I re-use the registers from x and y
and replace them step by step with the result.

My 2nd solution (which has the same algorithm like the 3rd and therefore the
same like your one) uses 69 clocks. No PUSH / POP is nessecary, because I
don't use other registers than the parameter/result passing registers.

Better is my 1st solution, that does not use MAC for the 1st multiplication
and was inspired by David Browns nice algorithm. This 1st solution needs 64
clocks.

Davids solution is even better again, because he has more registers free to
play with, which is not possible for a function.

 
> I wonder if you could speed the above up using %D0 as a temporary
> pointer and knowing that RESLO, RESHI and SUMEXT are consecutive in
> memory:
> 
>       asm("mov %A1, &__MPY\n\t"
>           "mov %A2, &__OP2\n\t"               // Form al*bl
>           "mov %A1, &__MAC\n\t"
>           "mov #__RESLO, %D0\n\t"
>           "mov @%DO+, %A0\n\t"                // Copy low word to product
>           "mov @%D0,-2(%D0)\n\t"              // Shift result down
>           "mov #0, @%D0\n\t"
>           "mov %B2, &__OP2\n\t"               // Add  al*bh
>           "mov %A2, &__MAC\n\t"
>           "mov %B1, &__OP2\n\t"               // Add bl*ah
>           "mov %B1, &__MAC\n\t"
>           "mov #__RESLO, %D0\n\t"
>           "mov @%DO+, %B0\n\t"                // Copy second-lowest word to 
> product
>           "mov @%D0+,-4(%D0)\n\t"             // Shift result down
>           "mov @%D0,-2(%D0)\n\t"
>           "mov %B, &__OP2\n\t"                // Add ah*bh
>           "mov &__RESLO, %C0\n\t"
>           "mov &__RESHI, %D0" : "=&r" (product) : "%r" (x), "r" (y));
> 
> In the first bunch, it adds a two-word instruction and saves two
> instruction words, making a net no-op, but in the second, it saves
> three and adds two.

Hmmm ... this code does not work - you did not test it. ;-)
If you use a local pointer for accessing RESLO, RESHI and SUMEXT (using
@Rn+) you need to store that pointer somewhere -> PUSH / POP of a register
for it. I am sceptic, if this would be faster than my 1st solution.

Ralf

Reply via email to