[Rd] URL checks

Kirill Müller via R-devel Thu, 07 Jan 2021 03:26:11 -0800

Hi

The URL checks in R CMD check test all links in the README and vignettesfor broken or redirected links. In many cases this improvesdocumentation, I see problems with this approach which I have detailedbelow.

I'm writing to this mailing list because I think the change needs tohappen in R's check routines. I propose to introduce an "allow-list" forURLs, to reduce the burden on both CRAN and package maintainers.


Comments are greatly appreciated.


Best regards

Kirill


# Problems with the detection of broken/redirected URLs

## 301 should often be 307, how to change?

Many web sites use a 301 redirection code that probably should be a 307.For example, https://www.oracle.com and https://www.oracle.com/ bothredirect to https://www.oracle.com/index.html with a 301. I suspect thecompany still wants oracle.com to be recognized as the primary entrypoint of their web presence (to reserve the right to move theredirection to a different location later), I haven't checked with theirPR department though. If that's true, the redirect probably should be a307, which should be fixed by their IT department which I haven'tcontacted yet either.


$ curl -i https://www.oracle.com
HTTP/2 301
server: AkamaiGHost
content-length: 0
location: https://www.oracle.com/index.html
...

## User agent detection

twitter.com responds with a 400 error for requests without a user agentstring hinting at an accepted browser.


$ curl -i https://twitter.com/
HTTP/2 400
...
<body>...<p>Please switch to a supported browser...</p>...</body>

$ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linuxx86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1

HTTP/2 200

# Impact

While the latter problem *could* be fixed by supplying a browser-likeuser agent string, the former problem is virtually unfixable -- so manyweb sites should use 307 instead of 301 but don't. The above list isalso incomplete -- think of unreliable links, HTTP links, other failuremodes...

This affects me as a package maintainer, I have the choice to eitherchange the links to incorrect versions, or remove them altogether.

I can also choose to explain each broken link to CRAN, this subjects theteam to undue burden I think. Submitting a package with NOTEs delays therelease for a package which I must release very soon to avoid having itpulled from CRAN, I'd rather not risk that -- hence I need to remove thelink and put it back later.

I'm aware of https://github.com/r-lib/urlchecker, this alleviates theproblem but ultimately doesn't solve it.


# Proposed solution

## Allow-list

A file inst/URL that lists all URLs where failures are allowed --possibly with a list of the HTTP codes accepted for that link.


Example:

https://oracle.com/ 301
https://twitter.com/drob/status/1224851726068527106 400

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] URL checks

Reply via email to