On Thu, 26 Mar 2026 17:42:05 GMT, Alan Bateman <[email protected]> wrote:

>> Francesco Nigro has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Disable edge-triggered epoll for POLLER_PER_CARRIER mode
>>   
>>   Per-carrier sub-pollers have carrier affinity, which creates a
>>   scheduling conflict with edge-triggered registrations: the sub-poller
>>   competes with user VTs for the same carrier. By the time the sub-poller
>>   runs, user VTs have often already consumed data via tryRead(), causing
>>   the sub-poller to find a POLLED sentinel and waste a full park/unpark
>>   cycle on the master (each costing an epoll_ctl). Under load this
>>   causes a 2x throughput regression.
>>   
>>   VTHREAD_POLLERS mode is unaffected because its sub-pollers have no
>>   carrier affinity and can run on any available carrier, processing
>>   events before user VTs consume the data.
>
> This looks like a 1% improvement in ops/sec. I think we'll need to get a more 
> real-world benchmark. Do you have something other than the micro.
> 
> Do you agree with the proposal to put this in its own branch so that we can 
> iterate on it?

@AlanBateman first round of results of the existing benchmark and the 
explanation of a JMH bug I have found.

## Benchmark: Edge-triggered epoll vs EPOLLONESHOT for VT read sub-pollers

Machine: AMD Ryzen 9 7950X 16-Core, Linux 6.19.8


JVM args: -Djdk.pollerMode=2 -Djdk.virtualThreadScheduler.parallelism=P
          -Djdk.virtualThreadScheduler.maxPoolSize=2*P -Djdk.readPollers=P
          -Djmh.executor=VIRTUAL

JMH:      -f 3 -wi 3 -w 5s -i 3 -r 10s -t 100 -p readSize=1


### Throughput (ops/s, higher is better)

<details>
<summary>Raw JMH output</summary>


EPOLLONESHOT (baseline):

Benchmark                              (readSize) (serverCount)   Mode  Cnt     
  Score       Error  Units
# parallelism=1, readPollers=1
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  
123365.250 ±  1783.282  ops/s
# parallelism=2, readPollers=2
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  
221708.710 ±  4651.978  ops/s
# parallelism=4, readPollers=4
SocketReadPollerBench.rpcRoundTrip              1             8  thrpt    9  
436302.988 ± 11788.896  ops/s

EPOLLET (edge-triggered):

Benchmark                              (readSize) (serverCount)   Mode  Cnt     
  Score       Error  Units
# parallelism=1, readPollers=1
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  
134129.081 ±  1199.593  ops/s
# parallelism=2, readPollers=2
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  
244213.173 ±  3760.140  ops/s
# parallelism=4, readPollers=4
SocketReadPollerBench.rpcRoundTrip              1             8  thrpt    9  
467931.437 ± 17245.761  ops/s


</details>


Ratio (ET / baseline):

              parallelism=1:  1.087x  (+8.7%)   non-overlapping CI
              parallelism=2:  1.102x  (+10.2%)  non-overlapping CI
              parallelism=4:  1.072x  (+7.2%)   non-overlapping CI


### async-profiler CPU breakdown (parallelism=1, 30s, ~58K samples)

| Component | EPOLLONESHOT | EPOLLET | Delta |
|:---|---:|---:|:---|
| `epoll_ctl` path | 2,183 (3.8%) | 0 | **eliminated** |
| Poller loop (carrier) | 3,747 (6.5%) | 1,589 (2.7%) | **-57%** |
| Continuation mount/unmount | 29,399 | 29,417 | unchanged |

### Note: `maxPoolSize=2*P` workaround

JMH's VIRTUAL executor uses VTs for both benchmark workers and iteration 
control (timing/warmdown signaling). At `parallelism>=2` with 100 busy VTs 
doing tight blocking I/O loops, the iteration control VT can get starved and 
never signal iteration end (`awaitWarmdownReady` hangs). Setting 
`maxPoolSize=2*parallelism` provides enough carrier headroom for the JMH 
control VTs to get scheduled. This is a JMH/scheduler interaction issue, not a 
Loom bug.

-------------

PR Comment: https://git.openjdk.org/loom/pull/223#issuecomment-4141801767

Reply via email to