> Hi Richard, Jan and H.J.,
> 
> Thanks for all the quick responses and suggestions.
> 
> I had tested my patch when tuning for an arch without the LCP stalls,
> but it didn't hit an issue in reload because it didn't require
> rematerialization. Thanks for pointing out this issue.
> 
> Regarding the penalty, it can be >=6 cycles for core2/corei7 so I

6 cycles is indeed quite serve and may pay for extra spill. I guess
easiest way is to benchmark peephole variant and see what comes first.
You may be able to see the differences better in 32bit mode due to
register pressure issues.

> thought it would be best to force the splitting even when that would
> force the use of a new register, but it is possible that the peephole2
> approach will work just fine in the majority of the cases. Thanks for
> the peephole2 patch, H.J., I will test that solution out for the case
> I was trying to solve.
> 
> Regarding the penalty on AMD, reading Agner's guide suggested that
> this could be a problem on Bulldozer, but only if there are >3
> prefixes, and I'm not sure how often that will occur for this type of

I can not think of case where MOV instruction in question would have 3
prefixes. It can have size overload and REX prefix, but REX usually do not
count.  You may try to benchmark Buldozer, but I would be surprised if there
was any benefits.

We need to run some benchmarks for generic/generic32 models on AMD machine
anyway.  I would guess that this transformation should be safe. Cost of extra
register move is not high compared to the 16bit store overhead.
Harsha?

Honza

Reply via email to