On Mon, 6 Feb 2023 17:39:42 GMT, Paul Sandoz <[email protected]> wrote:
>> Xiaohong Gong has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Add smaller array size for benchmark tests
>
> I think it would be useful to adjust the naming and comments of some methods
> to make it clearer the method parameter constraints.
>
> `indexInRange0Helper` is now called if the index is partially or totally out
> of range at the lower or upper ends and `indexInRange0` is called if
> partially or totally out of range at the upper end.
> e.g. a more literal naming could be:
> `AbstractMask::indexInRange0Helper` ->
> `AbstractMask::indexPartiallyInRangeHelper`
> `VectorSupport::indexInRange` -> VectorSupport::indexPartiallyInUpperRange`
> ?
>
> IIUC the performance numbers show that when the array is not a multiple of
> the vector size there is still quite an impact overall to calling
> `VectorSupport::indexInRange` for the last loop iteration. I am guessing the
> overall loop shape is different which impacts other optimizations?
>
> To do this more optimally likely requires a loop transformation where the
> last loop iteration is peeled off, but that's a harder transformation in one
> of the more complicated areas of C2 (although it already supports pre/post
> loop, so maybe its possible to leverage that?).
Thanks for looking at this PR @PaulSandoz !
> I think it would be useful to adjust the naming and comments of some methods
> to make it clearer the method parameter constraints.
>
> `indexInRange0Helper` is now called if the index is partially or totally out
> of range at the lower or upper ends and `indexInRange0` is called if
> partially or totally out of range at the upper end. e.g. a more literal
> naming could be: `AbstractMask::indexInRange0Helper` ->
> `AbstractMask::indexPartiallyInRangeHelper` `VectorSupport::indexInRange` ->
> VectorSupport::indexPartiallyInUpperRange` ?
The renaming looks good to me. Thanks!
> IIUC the performance numbers show that when the array is not a multiple of
> the vector size there is still quite an impact overall to calling
> `VectorSupport::indexInRange` for the last loop iteration. I am guessing the
> overall loop shape is different which impacts other optimizations?
I think the main influence of the benchmark result comes from the masked `
fromArray()/intoArray()` APIs, especially the masked intoArray() API. For the
tail loop part, there is the vector boxing needed on all architectures, with
the following reasons:
- If the architecture doesn't support predicate feature, it cannot be
intrinsified.
- The `checkMaskFromIndexSize` called inside the `else->if` branch may not be
inlined, and the `indexInRange()` generated mask `m` needs the boxing before it.
public final
void intoArray(double[] a, int offset,
VectorMask<Double> m) {
if (m.allTrue()) {
intoArray(a, offset);
} else {
DoubleSpecies vsp = vspecies();
if (!VectorIntrinsics.indexInRange(offset, vsp.length(), a.length))
{
checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
}
intoArray0(a, offset, m);
}
}
If the array size is aligned with the vector size, the generated `m` is all
true. Hence, the non-masked `intoArray()` is called instead, which improves the
performance a lot.
Regarding to the `indexInRange()` API implementation, if the array size is the
multiple num of vector size, the branch for the tail loop part is optimized out
to an uncommon-trap by C2 compiler, which may improves the performance as well.
Regarding to the added benchmark, since it is a testing for `indexInRange`,
maybe we can remove the calling to the masked `fromArray()/intoArray()` APIs
and directly save the mask into a boolean array instead. I guess this may
reduce the overall performance gap.
>
> To do this more optimally likely requires a loop transformation where the
> last loop iteration is peeled off, but that's a harder transformation in one
> of the more complicated areas of C2 (although it already supports pre/post
> loop, so maybe its possible to leverage that?).
Yes, it is!
-------------
PR: https://git.openjdk.org/jdk/pull/12064