On Mon, 26 Apr 2021 17:10:13 GMT, Maurizio Cimadamore <mcimadam...@openjdk.org> 
wrote:

> This PR contains the API and implementation changes for JEP-412 [1]. A more 
> detailed description of such changes, to avoid repetitions during the review 
> process, is included as a separate comment.
> 
> [1] - https://openjdk.java.net/jeps/412

Here we list the main changes introduced in this PR. As usual, a big thank to 
all who helped along the way: @ChrisHegarty, @iwanowww, @JornVernee, 
@PaulSandoz and @sundararajana.

### Managing memory segments: `ResourceScope`

This PR introduces a new abstraction (first discussed 
[here](https://inside.java/2021/01/25/memory-access-pulling-all-the-threads/)), 
namely `ResourceScope` which is used to manage the lifecycle of resources 
associated with off-heap memory (such as `MemorySegment`, `VaList`, etc). This 
is probably the biggest change in the API, as `MemorySegment` is no longer 
`AutoCloseable`: it instead features a *scope accessor* which can be used to 
access the memory segment's `ResourceScope`; the `ResourceScope` is the new  
`AutoCloseable`. In other words, code like this:


try (MemorySegment segment = MemorySegment.allocateNative(100)) {
   ...
}   


Now becomes like this:


try (ResourceScope scope = ResourceScope.ofConfinedScope()) {
   MemorySegment segment = MemorySegment.allocateNative(100, scope);
   ...
}   


While simple cases where only one segment is allocated become a little more 
verbose, this new API idiom obviously scales much better when multiple segments 
are created with the same lifecycle. Another important fact, which is captured 
by the name of the `ResourceScope` factory in the above snippet, is that 
segments no longer feature dynamic ownership changes. These were cool, but 
ultimately too expensive to support in the shared case. Instead, the API now 
requires clients to make a choice upfront (confined, shared or *implicit* - 
where the latter means GC-managed, like direct buffers).

Implementation-wise, `ResourceScope` is implemented by a bunch of internal 
classes: `ResourceScopeImpl`, `ConfinedScope` and `SharedScope`. A resource 
scope impl has a so called *resource list* which can be also shared or 
confined. This is where cleanup actions are added; the resource list can be 
attached to a `Cleaner` to get implicit deallocation. There is a new test 
`TestResourceScope` to stress test the behavior of resource scopes, as well as 
a couple of microbenchmarks to assess the cost of creating/closing scopes 
(`ResourceScopeClose`) and acquiring/releasing them (`BulkMismatchAcquire`).

### IO operation on shared buffer views

In the previous iteration of the Memory Access API we have introduced the 
concept of *shared* segments. Shared segments are as easy to use as confined 
ones, and they are as fast. One problem with shared segments was that it wasn't 
clear how to support IO operations on byte buffers derived from such segments: 
since the segment memory could be released at any time, there was simply no way 
to guarantee that a shared segment could not be closed in the middle of a 
(possibly async) IO operation.

In this iteration, shared segments are just segments backed by a *shared 
resource scope*. The new API introduces way to manage the new complexity, in 
the form of two methods `ResourceScope::acquire` and `ResourceScope::release`, 
respectively, which can be used to *acquire* a resource scope. When a resource 
scope is in the acquired state, it cannot be closed (you can think of it as 
some slightly better and asymmetric form of an atomic reference counter).

This means we are now finally in a position to add support for IO operations on 
all byte buffers, including those derived from shared segments. A big thank to 
@ChrisHegarty who lead the effort here. More info on this work are included in 
his [writeup](https://inside.java/2021/04/21/fma-and-nio-channels/).

Most of the implementation for this feature occurs in the internal NIO 
packages; a new method on `Buffer` has been added to facilitate acquiring from 
NIO code - most of the logic associated with acquiring is in the `IOUtil` 
class. @ChrisHegarty has added many good tests for scoped IO operations under 
the `foreign/channels` folder, to check for all possible buffer/scope flavors.

### Allocating at speed: `SegmentAllocator`

Another abstraction introduced in this JEP is that of `SegmentAllocator`. A 
segment allocator is a functional interface which can be used to tell other 
APIs (and, crucially, the `CLinker` API) how segments should be allocated, if 
the need arise. For instance, think about some code which turns a Java string 
into a C string. Such code will invariably:

1. allocate a memory segment off heap
2. bulk copy (where possible) the content of the Java string into the off-heap 
segment
3. add a NULL terminator

So, in (1) such string conversion routine need to allocate a new off-heap 
segment; how is that done? Is that a call to malloc? Or something else? In the 
previous iteration of the Foreign Linker API, the method `CLinker::toCString` 
had two overloads: a simple version, only taking a Java string parameter; and a 
more advanced version taking a `NativeScope`. A `NativeScope` was, at its core, 
a custom segment allocator - but the allocation scheme was fixed in 
`NativeScope` as that class always acted as an arena-style allocator.

`SegmentAllocator` is like `NativeScope` in spirit, in that it helps programs 
allocating segments - but it does so in a more general way than `NativeScope`, 
since a `SegmentAllocator` is not tied to any specific allocation strategy: in 
fact the strategy is left there to be defined by the user. As before, 
`SegmentAllocator` does provide some common factories, e.g. to create an arena 
allocator similar to `NativeScope` - but the `CLinker` API is now free to work 
with _any_ implementations of the `SegmentAllocator` interface. This 
generalization is crucial, given that, when operating with off-heap memory, 
allocation performance is often the bottleneck.

Not only is `SegmentAllocator` accepted by all methods in the `CLinker` API 
which need to allocate memory: even the behavior of downcall method handle can 
now be affected by segment allocators: when linking a native function which 
returns a struct by value, the `CLinker` API will in fact need to dynamically 
allocate a segment to hold the result. In such cases, the method handle 
generated by `CLinker` will now accept an additional *prefix* parameter of type 
`SegmentAllocator` which tells `CLinker` *how* should memory be allocated for 
the result value. For instance, now clients can tell `CLinker` to return 
structs by value in *heap* segments, by using a `SegmentAllocator` which 
allocates memory on the heap; this might be useful if the segment is quickly 
discarded after use.

There's not much implementation for `SegmentAllocator` as most of it is defined 
in terms of `default` methods in the interface itself. However we do have 
implementation classes for the arena allocation scheme (`ArenaAllocator.java`). 
We support confined allocation and shared allocation. The shared allocation 
achieves lock-free by using a `ThreadLocal` of confined arena allocators. 
`TestSegmentAllocators` is the test which checks most of the arena allocation 
flavors.

### `MemoryAddress` as scoped entities

A natural consequence of introducing the `ResourceScope` abstraction is that 
now not only `MemorySegment` are associated with a scope, but even instances of 
`MemoryAddress` can be. This means extra safety, because passing addresses 
which are associated with a closed scope to a native function will issue an 
exception. As before, it is possible to have memory addresses which the runtime 
knows nothing about (those returned by native calls, or those created via 
`MemoryAddress::ofLong`); these addresses are simply associated with the so 
called *global scope* - meaning that they are not actively managed by the user 
and are considered to be "always alive" by the runtime (as before).

Implementation-wise, you will now see that `MemoryAddressImpl` is no longer a 
pair of `Object`/`long`. It is now a pair of `MemorySegment`/`long`. The 
`MemorySegment`, if present, tells us which segment this address has been 
obtained from (and hence which scope is associated with the address). If null, 
if means that the address has no segment, and therefore is associated with the 
global scope. The `long` part acts as an offset into the segment (if segment is 
non-null), or as an absolute address. A new test `SafeFunctionAccessTest` 
attempts to call native functions with (closed) scoped addresses to see if 
exceptions are thrown.

### *Virtual* downcall method handles

There are cases where the address of a downcall handle cannot be specified when 
a downcall method handle is linked, but can only be known subsequently, by 
doing more native calls. To better support these use cases, `CLinker` now 
provides a factory for downcall method handles which does *not* require any 
function entry point. Instead, such entry point will be provided *dynamically*, 
via an additional prefix parameter (of type `MemoryAddress`). Many thanks to 
@JornVernee who implemented this improvement.

The implementation changes for this range from tweaking the Java ABI support 
(to make sure that the prefix argument is handled as expected), to low-level 
hotspot changes to parameterize the generated compiled stub to use the address 
(dynamic) parameter. Note that regular downcall method handles (the ones that 
are bound to an address) are now simply obtained by getting a "virtual" method 
handle, and inserting a `MemoryAddress` coordinate in the first argument 
position. `TestVirtualCalls` has been written explicitly to test dynamic 
passing of address parameters but, in reality, all existing downcall tests are 
stressing the new implementation logic (since, as said before, the old logic is 
expressed as an adaptation of the new virtual method handles). The benchmark we 
used to test downcall performances `CallOverhead` has now been split into two: 
`CallOverheadConstant` vs. `CallOverheadVirtual`.

### Optimized upcall support

The previous iteration of the Foreign Linker API supported intrinsification of 
downcall handles, which allows calls to downcall method handles to perform as 
fast as a regular JNI call. The dual case, calling Java code from native code 
(upcalls) was left unoptimized. In this iteration, @JornVernee has added 
intrinsics support for upcalls as well as downcalls, based on some prior work 
from @iwanowww. As for downcalls, a lot of the adaptation now happens in Java 
code, before we jump into the target method handle. As for the code which calls 
such target handle, changes have been made so that the native code can jump to 
the optimized entry point (if one exists) for such method handle more directly. 
The performance improvements with this new approach are rather nice, with 
`CLinker` upcalls being 3x-4x faster compared with regular upcalls via JNI.

Again, here we have changes in the guts of the Java ABI support, as we needed 
to adjust the method handle specialization logic to be able to work in two 
directions (both from Java to native and from native to Java). On the Hotspot 
front, the optimization changes are in `universalUpcallHandler_x86_64.cpp`.

### Accessing restricted methods

It is still the case that some of the methods in the API are "restricted" and 
access to these methods is disabled by default. In previous iterations, access 
to such methods was granted by setting a JDK read-only runtime property: 
`-Dforeign.restricted=permit`. In this iteration we have refined the story for 
accessing restricted methods (thanks @sundararajana ), by introducing a new 
experimental command line option in the Java launcher, namely 
`--enable-native-access=<module list>`. This options accepts a list of modules 
(separated by commas), where a module name can also be `ALL-UNNAMED` (for the 
unnamed module). Adding this command line flag to the launcher has the effect 
of allowing access to restricted methods to a given set of modules (the list of 
modules specified in the command line option). Access to restricted methods 
from any other module not in the list is disallowed and will result in an 
`IllegalAccessException`.

When implementing this flag we considered two options: adding some 
resolution-time checks in the JVM (e.g. in `linkResolver`); or use 
`@CallerSensitive` methods. In the end we opted for the latter given that 
`@CallerSensitive` are generally well understood and optimized, and the general 
feeling was that inventing another form of callsite-dependent check might have 
been unnecessarily risky, given that the same checks can be implemented in Java 
using `@CallerSensitive`. We plan (not in 17) to add `javadoc` support by means 
of an annotation (like we do for preview API methods) so that the text that is 
currently copied and pasted in all restricted methods can be inferred 
automagically by javadoc.

### GitHub testing status

Most platforms build and tests pass. There are a bunch of *additional* Linux 
platforms which do not yet work correctly:

* Zero
* arm
* ppc
* s390

The first two can be addresses easily by stubbing out few functions (I'll do 
that shortly). The last two are harder, as this patch moves some static 
functions (e.g. `long_move`, `float_move`) up to `SharedRuntime`; 
unfortunately, while most platforms use the same signatures for these function, 
on ppc and s390 that's not the case and function with same name, but 
incompatible signatures are defined there, leading to build issues. We will try 
to tweak the code around this, to make sure that these platforms remain 
buildable.

Javadoc: 
http://cr.openjdk.java.net/~mcimadamore/JEP-412/v1/javadoc/jdk/incubator/foreign/package-summary.html
Specdiff: 
http://cr.openjdk.java.net/~mcimadamore/JEP-412/v1/specdiff/overview-summary.html

-------------

PR: https://git.openjdk.java.net/jdk/pull/3699

Reply via email to