rzo1 commented on PR #1973: URL: https://github.com/apache/stormcrawler/pull/1973#issuecomment-4883183116
Thanks @dpol1! As long as Fable 5 is still available, here is its review — I will look at it later myself in a human review too. So take the robot's word with the appropriate grain of salt, but it did read the whole thing so I don't have to pretend I did (yet). 🤖 Overall it likes the design: parser in core, enforcement with the queue owner, good unit coverage. It found one broken link in the end-to-end chain and a config gap, though: **1. Silent no-op with default config — the big one.** `AbstractStatusUpdaterBolt.execute()` runs `metadata = mdTransfer.filter(metadata)` *before* calling `store()`, and the default `MetadataTransfer` only keeps `url.path`, `depth`, `max.depth` and the fetch error count. So `fetch.statusCode` and `<prefix>retry-after` are stripped before the queue-stream emit unless both are added to `metadata.persist`. A user wiring `HostBlockBolt` exactly as the javadoc says gets `blockUntilFor() == -1` for every tuple, with no log hint. The tests don't catch it because they hand-craft the metadata instead of going through the status updater. Either document the two required `metadata.persist` entries prominently (like `AdaptiveScheduler` does), or emit the pre-filter metadata on the queue stream. **2. `HostBlockBolt` ignores `urlfrontier.address`.** It only reads `urlfrontier.host`/`port`, while Spout and StatusUpdaterBolt resolve `urlfrontier.address` first. In a topology configured via `urlfrontier.address` (even a single node — not just the multi-node case the javadoc disclaims), the bolt silently connects to `localhost:7071`, and with `withWaitForReady()` the RPCs never fail, so nothing is ever logged. Since that address-resolution block is already duplicated between Spout and StatusUpdaterBolt, extracting it (e.g. into `ManagedChannelUtil`) and reusing it here would fix the bug and delete two copies. **3. Overflow when the cap is disabled (fail-open).** With `urlfrontier.max.retry.after: -1`, a server sending `Retry-After: 9223372036854775` passes the parser's `multiplyExact` guard (~9.2e18 ms), then `nowMs + retryAfterMs` in `blockUntilFor` wraps negative and the block is skipped — exactly when the server asked loudest for a back-off. The parser was careful with `multiplyExact`; the addition wants `Math.addExact` + clamp too. Masked by the default 24h cap. **4. `_DEFAULT_` queue can be collateral damage.** A 429 on a URL whose partition key falls back to `"_DEFAULT_"` makes the bolt call `blockQueueUntil(key="_DEFAULT_")`, stalling every unrelated URL in the shared catch-all queue for up to 24h. Cheap guard: skip the block for the sentinel key. **5. gRPC robustness.** `withWaitForReady()` + no deadline on a fire-and-forget RPC means a frontier outage during a 429 storm accumulates pending RPCs unboundedly and flushes stale blocks on reconnect — for a best-effort call, drop `waitForReady` or add a short `withDeadlineAfter`. Also, a burst of 429s from one host issues one RPC per tuple for the same key; a tiny per-key `blockedUntil` cache would make it one per back-off window. **6. The "while blocked" assertion in `HostBlockBoltTest` is vacuous.** The first `getURLs` poll hands the seeded URL out and it goes in-flight, so `await(!keys.contains(HOST))` passes within a couple of polls even if the block silently failed. Asserting the URL is *never* handed out (first call empty) would test the block and remove the race the unblock phase has with the in-flight timeout. **7. Minor / cleanup:** - The queue-stream emit is now copy-pasted verbatim from the OpenSearch updater, while `AbstractStatusUpdaterBolt` declares the stream for all seven subclasses but only two emit on it. A protected helper in the abstract class next to the `declareStream` call would keep the contract uniform. - The bespoke HTTP-date formatter is declared twice (parser + test), so the round-trip test can't catch a wrong pattern. IMF-fixdate-only strictness is a fair call per the PR description, but formatting the test inputs with the JDK's `RFC_1123_DATE_TIME` would make the test an actual cross-check. Findings 1 and 2 are the ones the robot would block on; the rest are hardening. Human review to follow — if it disagrees with the machine, the human wins. 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
