Hi all,
Storm 3.0 (currently in development upstream) raises its Java baseline
to 21 [1]. Once we move StormCrawler onto Storm 3, we will have to lift
our own baseline (currently Java 17) anyway, so I'd like to discuss
where we should land.
My proposal: go directly to Java 25 (the current LTS) instead of
stopping at Java 21.
Why not just 21?
The main technical argument is virtual threads. Our fetch path is a
textbook use case for them: FetcherBolt today maintains a fixed pool of
platform threads (fetcher.threads.number), each spending most of its
life blocked on DNS / TLS / slow servers / timeouts. With virtual
threads we could move to a thread-per-fetch model where concurrency is
bounded by politeness rules and connection pools rather than by thread
count. For broad multi-host crawls this lifts the per-worker
concurrency ceiling from a few hundred to many thousands of in-flight
fetches, and removes fetcher.threads.number as the tuning knob that
users most often get wrong. (Single-host crawls see no difference -
politeness remains the cap there.)
The catch with Java 21: virtual threads pin their carrier thread inside
synchronized blocks. Adopting them on a 21 baseline would mean
refactoring synchronized usage across the fetch path (FetchItemQueues,
ProtocolFactory, several external modules) to ReentrantLock. JEP 491
(JDK 24) removed this limitation, so on a Java 25 baseline most of that
refactoring simply isn't needed - we could adopt virtual threads in
FetcherBolt with a much smaller and safer change.
Beyond that, 25 is an LTS like 21, with a longer support window.
Practical considerations:
- Users moving to Storm 3 have to upgrade their JVM to 21+ anyway; the
additional step to 25 should be small for most, but it would exclude
anyone whose organisation pins them to 21. Input welcome on whether
this is a real concern for our user base.
- Storm 3 itself is built against 21; running it on a 25 JRE should be
fine, but we'd want to validate this in our CI matrix.
- Dependency ecosystem on 25 needs a quick audit (I don't expect
issues).
What this would enable as follow-up work (separate threads/issues):
- FetcherBolt: virtual-thread-per-fetch, deprecating
fetcher.threads.number in favour of a (much higher) max-in-flight cap
- Dropping the per-fetcher-thread timeout ExecutorServices
- Decoupling HTTP connection pool sizing from thread count
- Pluggable async-capable DNS resolution (JEP 418 SPI)
None of this would affect the current 3.x line - it would target the
release in which we adopt Storm 3.
Looking forward to your thoughts.
Gruß
Richard