> Hi Richard, Jan and H.J., > > Thanks for all the quick responses and suggestions. > > I had tested my patch when tuning for an arch without the LCP stalls, > but it didn't hit an issue in reload because it didn't require > rematerialization. Thanks for pointing out this issue. > > Regarding the penalty, it can be >=6 cycles for core2/corei7 so I
6 cycles is indeed quite serve and may pay for extra spill. I guess easiest way is to benchmark peephole variant and see what comes first. You may be able to see the differences better in 32bit mode due to register pressure issues. > thought it would be best to force the splitting even when that would > force the use of a new register, but it is possible that the peephole2 > approach will work just fine in the majority of the cases. Thanks for > the peephole2 patch, H.J., I will test that solution out for the case > I was trying to solve. > > Regarding the penalty on AMD, reading Agner's guide suggested that > this could be a problem on Bulldozer, but only if there are >3 > prefixes, and I'm not sure how often that will occur for this type of I can not think of case where MOV instruction in question would have 3 prefixes. It can have size overload and REX prefix, but REX usually do not count. You may try to benchmark Buldozer, but I would be surprised if there was any benefits. We need to run some benchmarks for generic/generic32 models on AMD machine anyway. I would guess that this transformation should be safe. Cost of extra register move is not high compared to the 16bit store overhead. Harsha? Honza