On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <rol...@openjdk.org> wrote:
>> @rwestrel @galderz >> >>> It seems overall, we likely win more than we loose with this intrinsic, so >>> I would integrate this change as it is and file a bug to keep track of >>> remaining issues. >> >> I'm a little scared to just accept the regressions, especially for this >> "most average looking case": >> Imagine you have an array with random numbers. Or at least numbers in a >> random order. If we take the max, then we expect the first number to be max >> with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the >> average branch probability is `n / (sum_i 1/i)`. This goes closer and closer >> to zero, the larger the array. This means that the "average" case has an >> extreme probability. And so if we do not vectorize, then this gets us a >> regression with the current patch. And vectorization is a little fragile, it >> only takes very little for vectorization not to kick in. >> >>> The Min/Max nodes are floating nodes. They can hoist out of loop and common >>> reliably in ways that are not guaranteed otherwise. >> >> I suppose we could write an optimization that can hoist out loop independent >> if-diamonds out of a loop. If the condition and all phi inputs are loop >> invariant, you could just cut the diamond out of the loop, and paste it >> before the loop entry. >> >>> Shouldn't int min/max be affected the same way? >> >> I think we should be able to see the same issue here, actually. Yes. Here a >> quick benchmark below: >> >> >> java -XX:CompileCommand=compileonly,TestIntMax::test* >> -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors >> TestIntMax.java >> CompileCommand: compileonly TestIntMax.test* bool compileonly = true >> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = >> true >> Warmup >> 5225 93 % 3 TestIntMax::test1 @ 5 (27 bytes) >> 5226 94 3 TestIntMax::test1 (27 bytes) >> 5226 95 % 4 TestIntMax::test1 @ 5 (27 bytes) >> 5238 96 4 TestIntMax::test1 (27 bytes) >> Run >> Time: 542056319 >> Warmup >> 6320 101 % 3 TestIntMax::test2 @ 5 (34 bytes) >> 6322 102 % 4 TestIntMax::test2 @ 5 (34 bytes) >> 6329 103 4 TestIntMax::test2 (34 bytes) >> Run >> Time: 166815209 >> >> That's a 4x regression on random input data! >> >> With: >> >> import java.util.Random; >> >> public class TestIntMax { >> private static Random RANDOM = new Random(); >> >> public static void main(String[] args) { >> int[] a = new int[64 * 1024]; >> for (int i = 0; i < a.length; i++) { >>... > >> I think we should be able to see the same issue here, actually. Yes. Here a >> quick benchmark below: > > I observe the same: > > > Warmup > 751 3 b TestIntMax::test1 (27 bytes) > Run > Time: 360 550 158 > Warmup > 1862 15 b TestIntMax::test2 (34 bytes) > Run > Time: 92 116 170 > > > But then with this: > > > diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad > index 8cc4a970bfd..9abda8f4178 100644 > --- a/src/hotspot/cpu/x86/x86_64.ad > +++ b/src/hotspot/cpu/x86/x86_64.ad > @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, > rFlagsReg cr) > %} > > > -instruct maxI_rReg(rRegI dst, rRegI src) > +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr) > %{ > match(Set dst (MaxI dst src)); > + effect(KILL cr); > > ins_cost(200); > - expand %{ > - rFlagsReg cr; > - compI_rReg(cr, dst, src); > - cmovI_reg_l(dst, src, cr); > + ins_encode %{ > + Label done; > + __ cmpl($src$$Register, $dst$$Register); > + __ jccb(Assembler::less, done); > + __ mov($dst$$Register, $src$$Register); > + __ bind(done); > %} > + ins_pipe(pipe_cmov_reg); > %} > > // > ============================================================================ > > > the performance gap narrows: > > > Warmup > 770 3 b TestIntMax::test1 (27 bytes) > Run > Time: 94 951 677 > Warmup > 1312 15 b TestIntMax::test2 (34 bytes) > Run > Time: 70 053 824 > > > (the number of test2 fluctuates quite a bit). Does it ever make sense to > implement `MaxI` with a conditional move then? To make it more explicit: implementing long min/max in ad files as cmp will likely remove all the 100% regressions that are observed here. I'm going to repeat the same MinMaxVector int min/max reduction test above with the ad changes @rwestrel suggested to see what effect they have. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664903731