gf2121 opened a new issue, #16044:
URL: https://github.com/apache/lucene/issues/16044
### Description
> Following is generally written by LLM but benchmark is run by myself :)
## 1. Motivation: `NIOFSDirectory` is still relevant
In recent memory-constrained deployments (cgroup-limited containers with
large indices), `MMapDirectory` triggered severe page-fault storms —
`pgmajfault` rates spiking by an order of magnitude once the working set
exceeded the cgroup limit, with sharply degraded query latency. Switching to
`NIOFSDirectory` helps us resolve it.
## 2. Problem: a JDK monitor caps `NIOFSDirectory` at ~4 threads
After moving more workloads onto `NIOFSDirectory`, we hit a hard scaling
ceiling. The bottleneck is **not** the kernel — it's a synchronized block in
`sun.nio.ch.FileChannelImpl`. Every positioned read registers the calling
thread into a `NativeThreadSet` (so a concurrent `close()` can interrupt it via
`pthread_kill`), and that registration takes a global monitor on every read.
```java
// sun.nio.ch.FileChannelImpl
private int readInternal(ByteBuffer dst, long position) throws IOException {
int n = 0;
int ti = -1;
try {
beginBlocking();
// ↓↓↓ contention point — monitor-protected, on every single read ↓↓↓
ti = threads.add();
if (!isOpen()) return -1;
do {
// ... Blocker.begin / IOUtil.read(fd, dst, position, ...) /
Blocker.end ...
} while ((n == IOStatus.INTERRUPTED) && isOpen());
return IOStatus.normalize(n);
} finally {
threads.remove(ti); // takes the same monitor again
endBlocking(n > 0);
}
}
```
```java
// sun.nio.ch.NativeThreadSet — the monitor every reader fights for
int add() {
long th = NativeThread.current();
synchronized (this) { // ← global monitor
per channel
// ... grow array, find free slot, write thread handle ...
}
}
```
Past ~4 threads, this monitor's cache-line bouncing dominates the cost of
`pread64` itself, and throughput stops scaling. This is structurally tied to
the `Channel.close()` interruption contract and unlikely to be removed from the
JDK in the near term.
## 3. Benchmark: native `pread(2)` via Panama FFI scales 4× higher
JMH on Java 25, Linux x86_64, NVMe; 1 GiB file, 16 KiB random reads, 16
reads/op. Throughput in **ops/ms** (higher is better):
| Benchmark | 1 thr | 2 thr | 4 thr | 8 thr | 16 thr | 32 thr |
|---|---:|---:|---:|---:|---:|---:|
| `ffiPread` | 371.8 | 633.8 | 1104.5 | **1854.5** | **2838.1** | **2862.5**
|
| `fileChannelReadDirect` | 358.9 | 428.1 | 683.4 | 637.3 | 737.0 | 737.4 |
| `fileChannelReadHeap` | 318.1 | 495.4 | 668.2 | 596.0 | 757.4 | 712.8 |
- 1 thread: FFI is ~4% faster — same syscall, less Java overhead.
- `FileChannel` plateaus at ~700 ops/ms from 4 threads onward; profiling
shows time inside `NativeThreadSet`'s monitor.
- FFI scales near-linearly to 16 threads, then hits the hardware ceiling at
32.
## 4. Proposal: `PreadDirectory`
A new `Directory` that performs random reads via `pread(2)` through Panama
FFI:
- **POSIX** → FFI `pread`. No `NativeThreadSet`, no monitor, stateless
syscall.
- **Non-POSIX** → fallback to `NIOFSDirectory`. Behavior never worse than
today;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]