rzo1 commented on PR #1973:
URL: https://github.com/apache/stormcrawler/pull/1973#issuecomment-4883183116

   Thanks @dpol1! As long as Fable 5 is still available, here is its review — I 
will look at it later myself in a human review too. So take the robot's word 
with the appropriate grain of salt, but it did read the whole thing so I don't 
have to pretend I did (yet). 🤖
   
   Overall it likes the design: parser in core, enforcement with the queue 
owner, good unit coverage. It found one broken link in the end-to-end chain and 
a config gap, though:
   
   **1. Silent no-op with default config — the big one.** 
`AbstractStatusUpdaterBolt.execute()` runs `metadata = 
mdTransfer.filter(metadata)` *before* calling `store()`, and the default 
`MetadataTransfer` only keeps `url.path`, `depth`, `max.depth` and the fetch 
error count. So `fetch.statusCode` and `<prefix>retry-after` are stripped 
before the queue-stream emit unless both are added to `metadata.persist`. A 
user wiring `HostBlockBolt` exactly as the javadoc says gets `blockUntilFor() 
== -1` for every tuple, with no log hint. The tests don't catch it because they 
hand-craft the metadata instead of going through the status updater. Either 
document the two required `metadata.persist` entries prominently (like 
`AdaptiveScheduler` does), or emit the pre-filter metadata on the queue stream.
   
   **2. `HostBlockBolt` ignores `urlfrontier.address`.** It only reads 
`urlfrontier.host`/`port`, while Spout and StatusUpdaterBolt resolve 
`urlfrontier.address` first. In a topology configured via `urlfrontier.address` 
(even a single node — not just the multi-node case the javadoc disclaims), the 
bolt silently connects to `localhost:7071`, and with `withWaitForReady()` the 
RPCs never fail, so nothing is ever logged. Since that address-resolution block 
is already duplicated between Spout and StatusUpdaterBolt, extracting it (e.g. 
into `ManagedChannelUtil`) and reusing it here would fix the bug and delete two 
copies.
   
   **3. Overflow when the cap is disabled (fail-open).** With 
`urlfrontier.max.retry.after: -1`, a server sending `Retry-After: 
9223372036854775` passes the parser's `multiplyExact` guard (~9.2e18 ms), then 
`nowMs + retryAfterMs` in `blockUntilFor` wraps negative and the block is 
skipped — exactly when the server asked loudest for a back-off. The parser was 
careful with `multiplyExact`; the addition wants `Math.addExact` + clamp too. 
Masked by the default 24h cap.
   
   **4. `_DEFAULT_` queue can be collateral damage.** A 429 on a URL whose 
partition key falls back to `"_DEFAULT_"` makes the bolt call 
`blockQueueUntil(key="_DEFAULT_")`, stalling every unrelated URL in the shared 
catch-all queue for up to 24h. Cheap guard: skip the block for the sentinel key.
   
   **5. gRPC robustness.** `withWaitForReady()` + no deadline on a 
fire-and-forget RPC means a frontier outage during a 429 storm accumulates 
pending RPCs unboundedly and flushes stale blocks on reconnect — for a 
best-effort call, drop `waitForReady` or add a short `withDeadlineAfter`. Also, 
a burst of 429s from one host issues one RPC per tuple for the same key; a tiny 
per-key `blockedUntil` cache would make it one per back-off window.
   
   **6. The "while blocked" assertion in `HostBlockBoltTest` is vacuous.** The 
first `getURLs` poll hands the seeded URL out and it goes in-flight, so 
`await(!keys.contains(HOST))` passes within a couple of polls even if the block 
silently failed. Asserting the URL is *never* handed out (first call empty) 
would test the block and remove the race the unblock phase has with the 
in-flight timeout.
   
   **7. Minor / cleanup:**
   - The queue-stream emit is now copy-pasted verbatim from the OpenSearch 
updater, while `AbstractStatusUpdaterBolt` declares the stream for all seven 
subclasses but only two emit on it. A protected helper in the abstract class 
next to the `declareStream` call would keep the contract uniform.
   - The bespoke HTTP-date formatter is declared twice (parser + test), so the 
round-trip test can't catch a wrong pattern. IMF-fixdate-only strictness is a 
fair call per the PR description, but formatting the test inputs with the JDK's 
`RFC_1123_DATE_TIME` would make the test an actual cross-check.
   
   Findings 1 and 2 are the ones the robot would block on; the rest are 
hardening. Human review to follow — if it disagrees with the machine, the human 
wins. 😄
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to