> Hi Richard, Jan and H.J.,
> Thanks for all the quick responses and suggestions.
> I had tested my patch when tuning for an arch without the LCP stalls,
> but it didn't hit an issue in reload because it didn't require
> rematerialization. Thanks for pointing out this issue.
> Regarding the penalty, it can be >=6 cycles for core2/corei7 so I
6 cycles is indeed quite serve and may pay for extra spill. I guess
easiest way is to benchmark peephole variant and see what comes first.
You may be able to see the differences better in 32bit mode due to
register pressure issues.
> thought it would be best to force the splitting even when that would
> force the use of a new register, but it is possible that the peephole2
> approach will work just fine in the majority of the cases. Thanks for
> the peephole2 patch, H.J., I will test that solution out for the case
> I was trying to solve.
> Regarding the penalty on AMD, reading Agner's guide suggested that
> this could be a problem on Bulldozer, but only if there are >3
> prefixes, and I'm not sure how often that will occur for this type of
I can not think of case where MOV instruction in question would have 3
prefixes. It can have size overload and REX prefix, but REX usually do not
count. You may try to benchmark Buldozer, but I would be surprised if there
was any benefits.
We need to run some benchmarks for generic/generic32 models on AMD machine
anyway. I would guess that this transformation should be safe. Cost of extra
register move is not high compared to the 16bit store overhead.