Re: RFR: 8264774: Implementation of Foreign Function and Memory API (Incubator)

Maurizio Cimadamore Tue, 27 Apr 2021 11:03:00 -0700

On Mon, 26 Apr 2021 17:10:13 GMT, Maurizio Cimadamore <mcimadam...@openjdk.org> 
wrote:

> This PR contains the API and implementation changes for JEP-412 [1]. A more
> detailed description of such changes, to avoid repetitions during the review
> process, is included as a separate comment.
>
> [1] - https://openjdk.java.net/jeps/412

Here we list the main changes introduced in this PR. As usual, a big thank to
all who helped along the way: @ChrisHegarty, @iwanowww, @JornVernee,
@PaulSandoz and @sundararajana.

### Managing memory segments: `ResourceScope`

This PR introduces a new abstraction (first discussed
[here](https://inside.java/2021/01/25/memory-access-pulling-all-the-threads/)),
namely `ResourceScope` which is used to manage the lifecycle of resources
associated with off-heap memory (such as `MemorySegment`, `VaList`, etc). This
is probably the biggest change in the API, as `MemorySegment` is no longer
`AutoCloseable`: it instead features a *scope accessor* which can be used to
access the memory segment's `ResourceScope`; the `ResourceScope` is the new
`AutoCloseable`. In other words, code like this:

try (MemorySegment segment = MemorySegment.allocateNative(100)) {
...
}

Now becomes like this:

try (ResourceScope scope = ResourceScope.ofConfinedScope()) {
MemorySegment segment = MemorySegment.allocateNative(100, scope);
...
}

While simple cases where only one segment is allocated become a little more
verbose, this new API idiom obviously scales much better when multiple segments
are created with the same lifecycle. Another important fact, which is captured
by the name of the `ResourceScope` factory in the above snippet, is that
segments no longer feature dynamic ownership changes. These were cool, but
ultimately too expensive to support in the shared case. Instead, the API now
requires clients to make a choice upfront (confined, shared or *implicit* -
where the latter means GC-managed, like direct buffers).

Implementation-wise, `ResourceScope` is implemented by a bunch of internal
classes: `ResourceScopeImpl`, `ConfinedScope` and `SharedScope`. A resource
scope impl has a so called *resource list* which can be also shared or
confined. This is where cleanup actions are added; the resource list can be
attached to a `Cleaner` to get implicit deallocation. There is a new test
`TestResourceScope` to stress test the behavior of resource scopes, as well as
a couple of microbenchmarks to assess the cost of creating/closing scopes
(`ResourceScopeClose`) and acquiring/releasing them (`BulkMismatchAcquire`).

### IO operation on shared buffer views

In the previous iteration of the Memory Access API we have introduced the
concept of *shared* segments. Shared segments are as easy to use as confined
ones, and they are as fast. One problem with shared segments was that it wasn't
clear how to support IO operations on byte buffers derived from such segments:
since the segment memory could be released at any time, there was simply no way
to guarantee that a shared segment could not be closed in the middle of a
(possibly async) IO operation.

In this iteration, shared segments are just segments backed by a *shared
resource scope*. The new API introduces way to manage the new complexity, in
the form of two methods `ResourceScope::acquire` and `ResourceScope::release`,
respectively, which can be used to *acquire* a resource scope. When a resource
scope is in the acquired state, it cannot be closed (you can think of it as
some slightly better and asymmetric form of an atomic reference counter).

This means we are now finally in a position to add support for IO operations on
all byte buffers, including those derived from shared segments. A big thank to
@ChrisHegarty who lead the effort here. More info on this work are included in
his [writeup](https://inside.java/2021/04/21/fma-and-nio-channels/).

Most of the implementation for this feature occurs in the internal NIO
packages; a new method on `Buffer` has been added to facilitate acquiring from
NIO code - most of the logic associated with acquiring is in the `IOUtil`
class. @ChrisHegarty has added many good tests for scoped IO operations under
the `foreign/channels` folder, to check for all possible buffer/scope flavors.

### Allocating at speed: `SegmentAllocator`

Another abstraction introduced in this JEP is that of `SegmentAllocator`. A
segment allocator is a functional interface which can be used to tell other
APIs (and, crucially, the `CLinker` API) how segments should be allocated, if
the need arise. For instance, think about some code which turns a Java string
into a C string. Such code will invariably:

1. allocate a memory segment off heap
2. bulk copy (where possible) the content of the Java string into the off-heap
segment
3. add a NULL terminator

So, in (1) such string conversion routine need to allocate a new off-heap
segment; how is that done? Is that a call to malloc? Or something else? In the
previous iteration of the Foreign Linker API, the method `CLinker::toCString`
had two overloads: a simple version, only taking a Java string parameter; and a
more advanced version taking a `NativeScope`. A `NativeScope` was, at its core,
a custom segment allocator - but the allocation scheme was fixed in
`NativeScope` as that class always acted as an arena-style allocator.

`SegmentAllocator` is like `NativeScope` in spirit, in that it helps programs
allocating segments - but it does so in a more general way than `NativeScope`,
since a `SegmentAllocator` is not tied to any specific allocation strategy: in
fact the strategy is left there to be defined by the user. As before,
`SegmentAllocator` does provide some common factories, e.g. to create an arena
allocator similar to `NativeScope` - but the `CLinker` API is now free to work
with _any_ implementations of the `SegmentAllocator` interface. This
generalization is crucial, given that, when operating with off-heap memory,
allocation performance is often the bottleneck.

Not only is `SegmentAllocator` accepted by all methods in the `CLinker` API
which need to allocate memory: even the behavior of downcall method handle can
now be affected by segment allocators: when linking a native function which
returns a struct by value, the `CLinker` API will in fact need to dynamically
allocate a segment to hold the result. In such cases, the method handle
generated by `CLinker` will now accept an additional *prefix* parameter of type
`SegmentAllocator` which tells `CLinker` *how* should memory be allocated for
the result value. For instance, now clients can tell `CLinker` to return
structs by value in *heap* segments, by using a `SegmentAllocator` which
allocates memory on the heap; this might be useful if the segment is quickly
discarded after use.

There's not much implementation for `SegmentAllocator` as most of it is defined
in terms of `default` methods in the interface itself. However we do have
implementation classes for the arena allocation scheme (`ArenaAllocator.java`).
We support confined allocation and shared allocation. The shared allocation
achieves lock-free by using a `ThreadLocal` of confined arena allocators.
`TestSegmentAllocators` is the test which checks most of the arena allocation
flavors.

### `MemoryAddress` as scoped entities

A natural consequence of introducing the `ResourceScope` abstraction is that
now not only `MemorySegment` are associated with a scope, but even instances of
`MemoryAddress` can be. This means extra safety, because passing addresses
which are associated with a closed scope to a native function will issue an
exception. As before, it is possible to have memory addresses which the runtime
knows nothing about (those returned by native calls, or those created via
`MemoryAddress::ofLong`); these addresses are simply associated with the so
called *global scope* - meaning that they are not actively managed by the user
and are considered to be "always alive" by the runtime (as before).

Implementation-wise, you will now see that `MemoryAddressImpl` is no longer a
pair of `Object`/`long`. It is now a pair of `MemorySegment`/`long`. The
`MemorySegment`, if present, tells us which segment this address has been
obtained from (and hence which scope is associated with the address). If null,
if means that the address has no segment, and therefore is associated with the
global scope. The `long` part acts as an offset into the segment (if segment is
non-null), or as an absolute address. A new test `SafeFunctionAccessTest`
attempts to call native functions with (closed) scoped addresses to see if
exceptions are thrown.

### *Virtual* downcall method handles

There are cases where the address of a downcall handle cannot be specified when
a downcall method handle is linked, but can only be known subsequently, by
doing more native calls. To better support these use cases, `CLinker` now
provides a factory for downcall method handles which does *not* require any
function entry point. Instead, such entry point will be provided *dynamically*,
via an additional prefix parameter (of type `MemoryAddress`). Many thanks to
@JornVernee who implemented this improvement.

The implementation changes for this range from tweaking the Java ABI support
(to make sure that the prefix argument is handled as expected), to low-level
hotspot changes to parameterize the generated compiled stub to use the address
(dynamic) parameter. Note that regular downcall method handles (the ones that
are bound to an address) are now simply obtained by getting a "virtual" method
handle, and inserting a `MemoryAddress` coordinate in the first argument
position. `TestVirtualCalls` has been written explicitly to test dynamic
passing of address parameters but, in reality, all existing downcall tests are
stressing the new implementation logic (since, as said before, the old logic is
expressed as an adaptation of the new virtual method handles). The benchmark we
used to test downcall performances `CallOverhead` has now been split into two:
`CallOverheadConstant` vs. `CallOverheadVirtual`.

### Optimized upcall support

The previous iteration of the Foreign Linker API supported intrinsification of
downcall handles, which allows calls to downcall method handles to perform as
fast as a regular JNI call. The dual case, calling Java code from native code
(upcalls) was left unoptimized. In this iteration, @JornVernee has added
intrinsics support for upcalls as well as downcalls, based on some prior work
from @iwanowww. As for downcalls, a lot of the adaptation now happens in Java
code, before we jump into the target method handle. As for the code which calls
such target handle, changes have been made so that the native code can jump to
the optimized entry point (if one exists) for such method handle more directly.
The performance improvements with this new approach are rather nice, with
`CLinker` upcalls being 3x-4x faster compared with regular upcalls via JNI.

Again, here we have changes in the guts of the Java ABI support, as we needed
to adjust the method handle specialization logic to be able to work in two
directions (both from Java to native and from native to Java). On the Hotspot
front, the optimization changes are in `universalUpcallHandler_x86_64.cpp`.

### Accessing restricted methods

It is still the case that some of the methods in the API are "restricted" and
access to these methods is disabled by default. In previous iterations, access
to such methods was granted by setting a JDK read-only runtime property:
`-Dforeign.restricted=permit`. In this iteration we have refined the story for
accessing restricted methods (thanks @sundararajana ), by introducing a new
experimental command line option in the Java launcher, namely
`--enable-native-access=<module list>`. This options accepts a list of modules
(separated by commas), where a module name can also be `ALL-UNNAMED` (for the
unnamed module). Adding this command line flag to the launcher has the effect
of allowing access to restricted methods to a given set of modules (the list of
modules specified in the command line option). Access to restricted methods
from any other module not in the list is disallowed and will result in an
`IllegalAccessException`.

When implementing this flag we considered two options: adding some
resolution-time checks in the JVM (e.g. in `linkResolver`); or use
`@CallerSensitive` methods. In the end we opted for the latter given that
`@CallerSensitive` are generally well understood and optimized, and the general
feeling was that inventing another form of callsite-dependent check might have
been unnecessarily risky, given that the same checks can be implemented in Java
using `@CallerSensitive`. We plan (not in 17) to add `javadoc` support by means
of an annotation (like we do for preview API methods) so that the text that is
currently copied and pasted in all restricted methods can be inferred
automagically by javadoc.

### GitHub testing status

Most platforms build and tests pass. There are a bunch of *additional* Linux
platforms which do not yet work correctly:

* Zero
* arm
* ppc
* s390

The first two can be addresses easily by stubbing out few functions (I'll do
that shortly). The last two are harder, as this patch moves some static
functions (e.g. `long_move`, `float_move`) up to `SharedRuntime`;
unfortunately, while most platforms use the same signatures for these function,
on ppc and s390 that's not the case and function with same name, but
incompatible signatures are defined there, leading to build issues. We will try
to tweak the code around this, to make sure that these platforms remain
buildable.

Javadoc:
http://cr.openjdk.java.net/~mcimadamore/JEP-412/v1/javadoc/jdk/incubator/foreign/package-summary.html
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/JEP-412/v1/specdiff/overview-summary.html

-------------

PR: https://git.openjdk.java.net/jdk/pull/3699

Re: RFR: 8264774: Implementation of Foreign Function and Memory API (Incubator)

Reply via email to