On Fri, 25 Oct 2024 15:02:03 GMT, Jorn Vernee <[email protected]> wrote:
>> This PR fixes an issue where passing many by-reference parameters to
>> downcall results in escape analysis failures.
>> The problem is that, as the parameters grow, the generated code in the
>> trampoline stub we generate also grows.
>> When it reaches a certain threshold, it becomes too big, and it is no longer
>> inlined in the caller.
>> When this happens, allocations for by-reference parameters (e.g. a segment
>> constructed from `MemorySegment::ofAddress`) can no longer be eliminated.
>>
>> The solution is two-fold. First, we annotate the generated trampoline with
>> `@ForceInline`. After all, it is rather critical, to guarantee optimal
>> performance, that this code can be always inlined.
>> Second, to keep the size of the generated code under control, we also put a
>> limit on the max number of comparisons that are generated in order to
>> "deduplicate" scope acquire/release calls.
>> The deduplication logic is a bit finicky -- it was put in place because,
>> when confined and shared are passed by-ref, we need to prevent them from
>> being closed in the middle of a native call.
>> So, we save all the seen scopes in a bunch of locals, and then we compare
>> each new scope with _all_ the previous cached locals, and skip acquire if we
>> can.
>>
>> While this strategy work it does not scale when there are many by-ref
>> parameters - as a by-ref parameter N will need N - 1 comparisons - which
>> means a quadratic number of comparisons is generated.
>> This is fixed in this PR by putting a lid on the maximum number of
>> comparisons that are generated. We also make the comparisons a bit smarter,
>> by always skipping the very first by-ref argument -- the downcall address.
>> It is in fact very common for the downcall address to be in a different
>> scope than that of the other by-ref arguments anyway.
>>
>> A nice property of the new logic is that by configuring the max number of
>> comparisons we can effectively select between different strategies:
>> * max = 0, means no dedup
>> * max = 1, means one-element cache
>> * max = N, means full dedup (like before)
>>
>> Thanks to Ioannis (@spasi) for the report and the benchmark. I've beefed the
>> benchmark up by adding a case for 10 arguments, and also adding support for
>> critical downcalls, so we could also test passing by-ref heap segments.
>> Benchmark result will be provided in a separate comment.
>
> test/micro/org/openjdk/bench/java/lang/foreign/libCallByRefHighArity.c line
> 33:
>
>> 31: EXPORT void noop_params5(int param0, int param1, void *param2, void
>> *param3, void *param4) {}
>> 32: EXPORT void noop_params10(int param0, int param1, void *param2, void
>> *param3, void *param4,
>> 33: int param5, int param6, void *param7, void
>> *param8, void *param9) {}
>
> Some of these are `int` which mismatches the `ADDRESS` layout used to create
> the function descriptor.
ugh - true! Will replace with `void*`
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21706#discussion_r1817027739