Re: RFR: Use edge-triggered epoll for VT read sub-pollers

Francesco Nigro Thu, 26 Mar 2026 09:45:54 -0700

On Thu, 26 Mar 2026 14:40:16 GMT, Alan Bateman <[email protected]> wrote:


>> Ported from 7ac9ca128885c5dd561e6fbd6bbeaddb86d6264c to the latest upstream 
>> fibers branch. Adapted to the current API which renamed 
>> implRegister/implDeregister to implStartPoll/implStopPoll and added 
>> Mode/EventFD/Cleaner/PollerGroup architecture.
>
> Just an FYI that we've been experiments with epoll edge triggered mode in the 
> past. The main concerns were that it's very fragile (only works with specific 
> usage patterns) and adds complexity by way of book keeping. Yes, it can 
> reduce the need to re-arm a file descriptor but overall it was never clear if 
> significant benefits could be proving in real world cases to justify the 
> complexity.
> 
> I'm not opposed to trying again but I think this require creating a new 
> branch and iterating there. Would you be okay with that?
> 
> I think it would be useful to know what testing has been done so far. I did 
> some quick testing and see failures/timeouts with HTTP3 tests which seems to 
> be UDP or selection ops in the context of a virtual thread. I think it would 
> also be useful to see some benchmark data.

@AlanBateman I've added a JMH benchmark in 
https://github.com/openjdk/loom/pull/223/commits/28755d93663ce722b53cea38dabbb5701c2a6e1d.
 

IMO, since is not a CPU bound test, it requires some care to read its results
Running it produces this diff:

  ┌─────────────────────┬──────────┬────────────────┬─────────────────┐         
                                                                             
  │       Counter       │ Baseline │ Edge-triggered │      Ratio      │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤         
                                                                             
  │ ops/s               │ 105,620  │ 107,168        │ 1.01x           │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤         
                                                                             
  │ cycles/op           │ 11,724   │ 3,272          │ 3.6x less       │         
                                                                             
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤         
                                                                             
  │ instructions/op     │ 7,009    │ 2,031          │ 3.5x less       │         
                                                                             
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤         
                                                                             
  │ branches/op         │ 1,513    │ 438            │ 3.5x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤         
                                                                             
  │ branch-misses/op    │ 115      │ 33             │ 3.5x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤         
                                                                             
  │ L1-dcache-loads/op  │ 2,764    │ 816            │ 3.4x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤         
                                                                             
  │ L1-dcache-misses/op │ 358      │ 101            │ 3.5x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤         
                                                                             
  │ stalled-frontend/op │ 5,144    │ 1,442          │ 3.6x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤         
                                                                             
  │ CPI                 │ 1.67     │ 1.61           │ slightly better │
  └─────────────────────┴──────────┴────────────────┴─────────────────┘         
                                                                             
                                                                    

In short it is a huge (for the specific case with tiny read) CPU saving, but it 
won't impact latencies being bound to loopback RTT. 
But the CPU saving is there, and pretty relevant.

As per 
https://github.com/openjdk/loom/pull/223/commits/7e36c5fe089db64cb7a4921aae9a9ef5f583fbb2
 instead: I have pushed a fix for a race condition (you rightly pointed out to 
be complex to deal with ET and I can agree w it :D ) which disable ET for 
pollerMode=3 due to this behaviour:
- with pollerMode=2 lazy submit allow a subpoller to enqueue locally first an 
awaken VT, without finding the CHM entry in POLLED state
- with pollerMode=2 there's no "local" submit (unless a custom scheduler 
implement it!) so is likely another FJ worker which compete with the subpoller, 
finding POLLED state: this wastes a full park/unpark cycle on the master poller 
(each costing an epoll_ctl on it!)

So, in short, the fix at 
https://github.com/openjdk/loom/pull/223/commits/1ac6dc35942e4a71a2a6dcac400cba6d992ee85b
 is good enough for pollerMode=2 (which rarely see it unless stealing from the 
FJ worker local queue), but pollerMode=3 with built-in scheduler, won't make it 
that good, bothering the master poller way too much.

-------------

PR Comment: https://git.openjdk.org/loom/pull/223#issuecomment-4136322872

Re: RFR: Use edge-triggered epoll for VT read sub-pollers

Reply via email to