jnioche commented on PR #1944: URL: https://github.com/apache/stormcrawler/pull/1944#issuecomment-4717937566
yes, the split makes sense > One thing I'd like your take on: should the re-emitted URLs go out as `Status.ERROR` (reuses the existing path, but carries error/retry semantics), or should we set an explicit future `nextFetchDate` so the scheduler honors the exact back-off? `Status.ERROR` is not the right status: it indicates an irremediable problem with the content of the document, like a pdf that would be unparsable for instance or a URL blocked by robots.txt Could set an explicit `nextFetchDate` but I think just mimicking what is done via `crawl-delay-too-long` would be good enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
