Hi Pawel,
On Tue, Jul 16, 2024, at 08:52, Pawel Wojciech Glod wrote: > Hello GNU support, > > I am an employee of CERN and one of my tasks is web scraping the > internal pages of our organisation. To do this, I use wget to download > the entire directory structure of the website along with the HTML files. > > I have a problem with websites whose top-level domain (TLD) is ".cern". > An example page is https://openlab.cern/ > According to our documentation, it does not require cookies or a > session token. Unfortunately, a single HTML file is downloaded > containing only the code of the home page. Are you able to diagnose why > this is happening? Perhaps the website has additional security features > or it requires a session token or cookies. How are you invoking Wget? When I run: `wget -r https://openlab.cern/` I see multiple pages and their associated files being downloaded. Could you please give a more detailed description of the problem along with the command you used and the full output? > > My second question concerns the issue of when we need to download > cookies and the session token. We have our own tool for this, but how > do we take into account redirecting to another authentication page > using wget so that after authentication, the wget command works > correctly? What url address should be included? > I'm not sure I understand your question. If your authentication generates a session token that is stored as a cookie, you can save it in a standard Cookie text file (both Firefox and Chrome have ways to allow you to do this) and then import that cookie in Wget using the `--load-cookies` option Is there something else you need? > I would appreciate a prompt reply. > > Best regards, Pawel Glod > CERN, BE-CSS