+1 on Java 25. The JEP 491 argument is convincing on its own — skipping the ReentrantLock refactoring across FetchItemQueues is already worth it.
The OkHttp pool concern is real but I think orthogonal. We already document that it does not scale past ~1000 total connections and point to protocol.instances.num as the way to spread load. Virtual threads would stress that limit more, but that is a problem to solve separately regardless of which Java baseline we pick. Same goes for the timeout executors. They are there because platform threads can get stuck in native I/O and interrupt is not always reliable — with virtual threads that improves, and we already have OkHttp call timeouts and socket timeouts underneath. I think the pattern could be simplified, though I would keep some bolt-level safety net either way. One small thing: beingFetched in FetcherBolt is a String[] sized to threadCount, indexed by threadNum, only used for debug logging. Minor, but a good example of the kind of thing that would need updating in the bolt when the model changes. Il giorno mer 3 giu 2026 alle ore 09:18 Julien Nioche < [email protected]> ha scritto: > Thanks Richard for the suggestion and thorough explanation. > Two things come to my mind: > > - Need to check that it works with OKHttp's connection cache: IIRC > having too many connections made things slower because of the > implementation of their cache. If virtual threads means more > parallelism, > wouldn't that be a bottleneck? > - *"Dropping the per-fetcher-thread timeout ExecutorServices" - *this > was added recently to avoid threads getting blocked forever by the > protocol, which we did see in practice. Wouldn't we need a similar > mechanism with virtual threads? > > What do you think? > > Julien > > On Tue, 2 Jun 2026 at 11:55, Richard Zowalla <[email protected]> wrote: > > > Hi all, > > > > Storm 3.0 (currently in development upstream) raises its Java baseline > > to 21 [1]. Once we move StormCrawler onto Storm 3, we will have to lift > > our own baseline (currently Java 17) anyway, so I'd like to discuss > > where we should land. > > > > My proposal: go directly to Java 25 (the current LTS) instead of > > stopping at Java 21. > > > > Why not just 21? > > > > The main technical argument is virtual threads. Our fetch path is a > > textbook use case for them: FetcherBolt today maintains a fixed pool of > > platform threads (fetcher.threads.number), each spending most of its > > life blocked on DNS / TLS / slow servers / timeouts. With virtual > > threads we could move to a thread-per-fetch model where concurrency is > > bounded by politeness rules and connection pools rather than by thread > > count. For broad multi-host crawls this lifts the per-worker > > concurrency ceiling from a few hundred to many thousands of in-flight > > fetches, and removes fetcher.threads.number as the tuning knob that > > users most often get wrong. (Single-host crawls see no difference - > > politeness remains the cap there.) > > > > The catch with Java 21: virtual threads pin their carrier thread inside > > synchronized blocks. Adopting them on a 21 baseline would mean > > refactoring synchronized usage across the fetch path (FetchItemQueues, > > ProtocolFactory, several external modules) to ReentrantLock. JEP 491 > > (JDK 24) removed this limitation, so on a Java 25 baseline most of that > > refactoring simply isn't needed - we could adopt virtual threads in > > FetcherBolt with a much smaller and safer change. > > > > Beyond that, 25 is an LTS like 21, with a longer support window. > > > > Practical considerations: > > > > - Users moving to Storm 3 have to upgrade their JVM to 21+ anyway; > the > > additional step to 25 should be small for most, but it would exclude > > anyone whose organisation pins them to 21. Input welcome on whether > > this is a real concern for our user base. > > - Storm 3 itself is built against 21; running it on a 25 JRE should be > > fine, but we'd want to validate this in our CI matrix. > > - Dependency ecosystem on 25 needs a quick audit (I don't expect > > issues). > > > > What this would enable as follow-up work (separate threads/issues): > > > > - FetcherBolt: virtual-thread-per-fetch, deprecating > > fetcher.threads.number in favour of a (much higher) max-in-flight cap > > - Dropping the per-fetcher-thread timeout ExecutorServices > > - Decoupling HTTP connection pool sizing from thread count > > - Pluggable async-capable DNS resolution (JEP 418 SPI) > > > > None of this would affect the current 3.x line - it would target the > > release in which we adopt Storm 3. > > > > Looking forward to your thoughts. > > > > Gruß > > Richard > > > > >
