On Feb 24, 2015, at 2:48 PM, Andrew Haley <[email protected]> wrote:
>>
>> I am all for keeping more code in Java if we can. I don't know enough about
>> assembler-based optimizations to determine if it might be possible to do
>> better on certain CPU architectures.
>
> Me either, but I have tested this on the architectures I have, and I
> suspect that C2 optimization is good enough. And we'd have to write
> assembly code for machines we haven't got; something for the future, I
> think.
>
Agreed.
>> One advantage, AFAIU, to intrinsics is they are not subject to the vagaries
>> of inlining thresholds. It's important that the loops operating over the
>> arrays to be compiled efficiently otherwise performance can drop off the
>> cliff if thresholds are reached within the loop. Perhaps these methods are
>> small enough it is not an issue? and also perhaps that is not a sufficient
>> argument to justify the cost of an intrinsic (and we should be really
>> tweaking the inlining mechanism)?
>
> Maybe so. There are essentially two ways to do this: new compiler
> node types which punt everything to the back end (and therefore
> require back-end authors to write them) or generic expanders, which is
> how many of the existing intrinsics are done. Generic C2 code would,
> I suspect, be worse than writing this in Java bacause it would be
> lacking profile data.
>
>> With that in mind is there any need to intrinsify the new methods at all
>> given those new Java methods can defer to the older ones based on a constant
>> check? Also should that anyway be done for the interpreter?
>>
>>
>> private static final boolean IS_UNALIGNED = theUnsafe.unalignedAccess();
>>
>> public void putIntUnaligned(Object o, long offset, int x) { if (IS_UNALIGNED
>> || (offset & 3) == 0) { putInt(o, offset, x); } else if (byteOrder ==
>> BIG_ENDIAN) { putIntB(o, offset, x); } else { putIntL(o, offset, x); } }
>
> Yes. It certainly could be done like this but I think C1 doesn't do
> the optimization to remove the IS_UNALIGNED test, so we'd still want
> the C1 builtins.
Hmm.... if i run the following code on my Mac:
import java.nio.ByteOrder;
public class Y {
static final boolean X = ByteOrder.nativeOrder() ==
ByteOrder.LITTLE_ENDIAN; // Just get a constant from somewhere...
public static void main(String[] args) {
Y y = new Y();
int s = 0;
for (int i = 0; i < 1000000; i++) {
s += y.x(i);
}
System.out.println(s);
}
public int x(int i) {
if (X || (i & 3) == 0) {
return i + i;
}
else {
return Integer.sum(i, i);
}
}
}
With the options:
-XX:TieredStopAtLevel=1 -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly
Then Y.x is compiled to:
0x000000011155ea80: mov %eax,-0x14000(%rsp)
0x000000011155ea87: push %rbp
0x000000011155ea88: sub $0x30,%rsp ;*getstatic X
; - Y::x@0 (line 43)
0x000000011155ea8c: mov %rdx,%rax
0x000000011155ea8f: add %edx,%eax
0x000000011155ea91: add $0x30,%rsp
0x000000011155ea95: pop %rbp
0x000000011155ea96: test %eax,-0x62ca9c(%rip) # 0x0000000110f32000
; {poll_return}
0x000000011155ea9c: retq
For tiered compilation i do see code generated with a test, jump and de-opt:
0x00000001112fe020: mov %eax,-0x14000(%rsp)
0x00000001112fe027: push %rbp
0x00000001112fe028: sub $0x30,%rsp
0x00000001112fe02c: mov $0x125633860,%rax ; {metadata(method data for
{method} {0x0000000125633508} 'x' '(I)I' in 'Y')}
0x00000001112fe036: mov 0xdc(%rax),%edi
0x00000001112fe03c: add $0x8,%edi
0x00000001112fe03f: mov %edi,0xdc(%rax)
0x00000001112fe045: mov $0x125633508,%rax ; {metadata({method}
{0x0000000125633508} 'x' '(I)I' in 'Y')}
0x00000001112fe04f: and $0x1ff8,%edi
0x00000001112fe055: cmp $0x0,%edi
0x00000001112fe058: je 0x00000001112fe07f ;*getstatic X
; - Y::x@0 (line 43)
0x00000001112fe05e: mov $0x125633860,%rax ; {metadata(method data for
{method} {0x0000000125633508} 'x' '(I)I' in 'Y')}
0x00000001112fe068: incl 0x118(%rax) ;*ifne
; - Y::x@3 (line 43)
0x00000001112fe06e: mov %rdx,%rax
0x00000001112fe071: add %edx,%eax
0x00000001112fe073: add $0x30,%rsp
0x00000001112fe077: pop %rbp
0x00000001112fe078: test %eax,-0x62f07e(%rip) # 0x0000000110ccf000
; {poll_return}
0x00000001112fe07e: retq
0x00000001112fe07f: mov %rax,0x8(%rsp)
0x00000001112fe084: movq $0xffffffffffffffff,(%rsp)
0x00000001112fe08c: callq 0x0000000110e63c60 ; OopMap{rsi=Oop off=145}
;*synchronization entry
; - Y::x@-1 (line 43)
; {runtime_call}
But the method gets compiled later on to the shape of the former code e.g.:
471 16 3 Y::x (22 bytes)
472 17 4 Y::x (22 bytes)
472 16 3 Y::x (22 bytes) made not entrant
> Perhaps we could do without the C2 builtins but they
> cost very little, they save C2 a fair amount of work, and they remove
> the vagaries of inlining.
Tis true, the changes are small.
> I take your point about the interpreter,
> though.
>
>> I see you optimized the unaligned getLong by reading two aligned longs and
>> then bit twiddled. It seems harder to optimize the putLong by straddling an
>> aligned putInt with one to three required putByte.
>
> Sure, that's always a possibility. I have code to do it but it was
> all getting rather complicated for my taste.
I thought it might.
>
>>> Also, these methods have the additional benefit that they are always atomic
>>> as long as the data are naturally aligned.
>>
>> We should probably document that in general access is not guaranteed to be
>> atomic and an implementation detail that it currently is when naturally so.
>
> I think that's a good idea. The jcstress tests already come up with a
> warning that the implementation is not atomic; this is not required,
> but a high-quality implementation will be.
>
>>> This does result in rather a lot of code for the methods for all sizes and
>>> endiannesses, but none of it will be used on machines with unaligned
>>> hardware support except in the interpreter. (Perhaps the interpreter too
>>> could have intrinsics?)
>>>
>>> I have changed HeapByteBuffer to use these methods, with a major
>>> performance improvement. I've also provided Unsafe methods to query
>>> endianness and alignment support.
>>
>> If we expose the endianness query via a new method in unsafe we should reuse
>> that in java.nio.Bits and get rid of the associated static code block.
>
> Sure, I already did that.
>
Locally i guess? (just in case i missed something in the current webrev).
Paul.