Henrik,

yes, that's quite annoying - they respond with a 403 which *does* have html 
content which the browsers display and that page contains JavaScript code which 
calls a CGI script on their server which bounces to CF's server which after 
7(!) more requests finally sets the challenge cookie and re-directs back to 
winehq.org. However, what is truly annoying, the response is the same whether 
the resource exists or not, so there is no way to verify the URL. I'm somewhat 
shocked that they rely on the browsers showing the error page and hijack it to 
quickly re-direct from it so the user isn't even aware they the server 
responded with an error.

More practically, I don't see that we can do anything about it. Those are URLs 
are truly responding with an error, so short of emulating a full browser with 
JavaScript (they also do fingerprinting etc. so it's distinctly non-trivial - 
by design) there is no way to verify them. Given the amount of shenanigans that 
page does with the user's browser I'd say your approach is probably good since 
the user won't accidentally click on the link then :). But more seriously, this 
is a problem since the idea behind checking URLs is a good one - they do 
disappear or change quite often, so not checking them is not an answer, either.

One special-case approach for cases like you mentioned (i.e. where you want to 
check a top-domain as opposed to a specific resource) is to use a resource that 
is guaranteed (by design) to be accessible by direct requests, so for example 
robots.txt. So for top-level URLs, we could fall back to checking 
https://winehq.org/robots.txt which does work (since most sites do want those 
to be directly accessible). However, it doesn't help with URLs containing 
specific paths as those will be still blocked.

Cheers,
Simon



> On 3/03/2026, at 18:08, Henrik Bengtsson <[email protected]> wrote:
> 
> I've started to get:
> 
> * checking CRAN incoming feasibility ... NOTE
>  Found the following (possibly) invalid URLs:
>    URL: https://www.winehq.org/
>      From: inst/doc/parallelly-22-wine-workers.html
>      Status: 403
>      Message: Forbidden
> 
> when R CMD check:ing 'parallelly'. The page <https://www.winehq.org/>
> works fine in the web browser, but it blocked (by Cloudflare)
> elsewhere, e.g.
> 
> $ curl --silent --head https://www.winehq.org/ | head -1
> HTTP/2 403
> 
> and
> 
> $ wget https://www.winehq.org/
> --2026-03-02 21:01:12--  https://www.winehq.org/
> Resolving www.winehq.org (www.winehq.org)... 104.26.8.100,
> 172.67.69.38, 104.26.9.100, ...
> Connecting to www.winehq.org (www.winehq.org)|104.26.8.100|:443... connected.
> HTTP request sent, awaiting response... 403 Forbidden
> 2026-03-02 21:01:12 ERROR 403: Forbidden.
> 
> I can only guess, but I suspect that <https://www.winehq.org/> started
> to do this to protect against AI-scraping bots, or similar. I can
> imagine more websites to do the same.
> 
> To avoid having to deal with this check NOTE everywhere (e.g. locally,
> CI, and on CRAN submission), my current strategy is to switch from
> \url{https://www.winehq.org/} to \code{https://www.winehq.org/} in the
> docs. Does anyone else have a better idea?
> 
> /Henrik
> 
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel

Reply via email to