rzo1 commented on PR #1898:
URL: https://github.com/apache/stormcrawler/pull/1898#issuecomment-4377749592
> Quick question: why not have the redirection bolt point back to the
FetcherBolt instead so that the URL gets refetched straight away? Obviously
would have to make sure it does not get into an endless loop by checking if it
has been fetched by Playwright.
A direct edge back to the fetcher would bypass the scheduler, which is what
enforces fetch intervals, per-host delays, and robots-aware re-queueing. With a
batch of URLs from the same domain flagged for JS rendering, that path would
re-fetch immediately and risk hammering the host. It also breaks Storm's acking
model: the original tuple from the spout never anchors cleanly, every re-fetch
extends the tuple tree, and any failure deep in the cycle replays from the
spout. Routing through the status index breaks the tree at a natural boundary:
the status update is acked, and the spout re-emits as a fresh tuple.
We have a persistence angle too: Once the "needs JS rendering" flag is
written into the status index metadata, it survives a topology restart. An
in-flight cyclic tuple would not. In addition, it also matches the established
pattern in StormCrawler: redirects, retries, and re-fetches always loop via
status →scheduler → spout. A sideband direct re-fetch path would break with
that pattern.
Finally, the approach removes the need for explicit loop detection. The
metadata flag in the status index is itself the guard against re-rendering, so
no extra "already rendered by Playwright" check is required.
> Should any outlinks with the same hostname inherit the flag? Should we
have a URL filter to that effect?
On the second point: agreed, propagating the flag to outlinks of the same
host (or expressing it as a URL filter) is worth doing, but I'd rather keep it
out of this PR to keep the scope tight. I'll open a follow-up PR or
issue/ticket for it so we can discuss the inheritance rules and filter shape on
their own. wdyt?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]