Henrik, yes, that's quite annoying - they respond with a 403 which *does* have html content which the browsers display and that page contains JavaScript code which calls a CGI script on their server which bounces to CF's server which after 7(!) more requests finally sets the challenge cookie and re-directs back to winehq.org. However, what is truly annoying, the response is the same whether the resource exists or not, so there is no way to verify the URL. I'm somewhat shocked that they rely on the browsers showing the error page and hijack it to quickly re-direct from it so the user isn't even aware they the server responded with an error.
More practically, I don't see that we can do anything about it. Those are URLs are truly responding with an error, so short of emulating a full browser with JavaScript (they also do fingerprinting etc. so it's distinctly non-trivial - by design) there is no way to verify them. Given the amount of shenanigans that page does with the user's browser I'd say your approach is probably good since the user won't accidentally click on the link then :). But more seriously, this is a problem since the idea behind checking URLs is a good one - they do disappear or change quite often, so not checking them is not an answer, either. One special-case approach for cases like you mentioned (i.e. where you want to check a top-domain as opposed to a specific resource) is to use a resource that is guaranteed (by design) to be accessible by direct requests, so for example robots.txt. So for top-level URLs, we could fall back to checking https://winehq.org/robots.txt which does work (since most sites do want those to be directly accessible). However, it doesn't help with URLs containing specific paths as those will be still blocked. Cheers, Simon > On 3/03/2026, at 18:08, Henrik Bengtsson <[email protected]> wrote: > > I've started to get: > > * checking CRAN incoming feasibility ... NOTE > Found the following (possibly) invalid URLs: > URL: https://www.winehq.org/ > From: inst/doc/parallelly-22-wine-workers.html > Status: 403 > Message: Forbidden > > when R CMD check:ing 'parallelly'. The page <https://www.winehq.org/> > works fine in the web browser, but it blocked (by Cloudflare) > elsewhere, e.g. > > $ curl --silent --head https://www.winehq.org/ | head -1 > HTTP/2 403 > > and > > $ wget https://www.winehq.org/ > --2026-03-02 21:01:12-- https://www.winehq.org/ > Resolving www.winehq.org (www.winehq.org)... 104.26.8.100, > 172.67.69.38, 104.26.9.100, ... > Connecting to www.winehq.org (www.winehq.org)|104.26.8.100|:443... connected. > HTTP request sent, awaiting response... 403 Forbidden > 2026-03-02 21:01:12 ERROR 403: Forbidden. > > I can only guess, but I suspect that <https://www.winehq.org/> started > to do this to protect against AI-scraping bots, or similar. I can > imagine more websites to do the same. > > To avoid having to deal with this check NOTE everywhere (e.g. locally, > CI, and on CRAN submission), my current strategy is to switch from > \url{https://www.winehq.org/} to \code{https://www.winehq.org/} in the > docs. Does anyone else have a better idea? > > /Henrik > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
