rzo1 commented on PR #1898:
URL: https://github.com/apache/stormcrawler/pull/1898#issuecomment-4377749592

   > Quick question: why not have the redirection bolt point back to the 
FetcherBolt instead so that the URL gets refetched straight away? Obviously 
would have to make sure it does not get into an endless loop by checking if it 
has been fetched by Playwright.
   
   A direct edge back to the fetcher would bypass the scheduler, which is what 
enforces fetch intervals, per-host delays, and robots-aware re-queueing. With a 
batch of URLs from the same domain flagged for JS rendering, that path would 
re-fetch immediately and risk hammering the host. It also breaks Storm's acking 
model: the original tuple from the spout never anchors cleanly, every re-fetch 
extends the tuple tree, and any failure deep in the cycle replays from the 
spout. Routing through the status index breaks the tree at a natural boundary: 
the status update is acked, and the spout re-emits as a fresh tuple.
   
   We have a persistence angle too: Once the "needs JS rendering" flag is 
written into the status index metadata, it survives a topology restart. An 
in-flight cyclic tuple would not. In addition, it also matches the established 
pattern in StormCrawler: redirects, retries, and re-fetches always loop via 
status →scheduler → spout. A sideband direct re-fetch path would break with 
that pattern.
                                                                                
                                       
   Finally, the approach removes the need for explicit loop detection. The 
metadata flag in the status index is itself the guard against re-rendering, so 
no extra "already rendered by Playwright" check is required.
   
   > Should any outlinks with the same hostname inherit the flag? Should we 
have a URL filter to that effect?
   
   On the second point: agreed, propagating the flag to outlinks of the same 
host (or expressing it as a URL filter) is worth doing, but I'd rather keep it 
out of this PR to keep the scope tight. I'll open a follow-up PR or 
issue/ticket for it so we can discuss the inheritance rules and filter shape on 
their own. wdyt?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to