On Tue, 18 Sep 2012, Richard Sandiford wrote:

> > Have you had time to think about this some more?  I am not sure I can 
> > guess how you'd like me to fix this patch now without some more specific 
> > review and/or suggestions about where the optimization should happen and 
> > what cases it should be extended to detect in addition to the dsp 
> > accumulator multiplies.
> 
> The patch below is the one I've been testing.  But I got sidetracked
> by looking into the possibility of removing the MD0_REG and MD1_REG
> classes, in order to get more sensible costs.  I think that was needed
> for the madd-9.c test to pass.

 Sorry to come up with this so late -- I have only now noticed this being 
discussed.

> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
>    return simplify_gen_subreg (word_mode, op, mode, byte);
>  }
>  
> -/* Return true if a 64-bit move from SRC to DEST should be split into two.  
> */
> +/* Return true if SRC can be moved into DEST using MULT $0, $0.  */
> +
> +static bool
> +mips_mult_move_p (rtx dest, rtx src)
> +{
> +  return (src == const0_rtx
> +       && REG_P (dest)
> +       && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
> +       && (ISA_HAS_DSP_MULT
> +           ? ACC_REG_P (REGNO (dest))
> +           : MD_REG_P (REGNO (dest))));
> +}
> +
> +/* Return true if a move from SRC to DEST should be split into two.  */

 Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower 
than MTHI $0/MTLO $0?  The latency of multiplication varies among 
implementations, for example the original R3000 took 12 cycles (of course 
the R3000 itself is not relevant for this change, but you see the 
picture!).  On the other hand in some (but not all!) processors 
multiplication runs in parallel to the main pipeline so it is the 
difference, if positive, between the number of cycles consumed by other 
instructions up to the next HI/LO access instruction and the latency of 
MULT run in the background that matters.

 From the context I am assuming none of this matters for the 74K (and 
presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
isn't it something that should be decided based on instruction costs from 
DFA schedulers?  Is there anything that I've missed here?  It doesn't 
appear to me your (and neither the original) proposal takes instruction 
cost calculation into consideration.

  Maciej

Reply via email to