Hello GNU support, I am an employee of CERN and one of my tasks is web scraping the internal pages of our organisation. To do this, I use wget to download the entire directory structure of the website along with the HTML files.
I have a problem with websites whose top-level domain (TLD) is ".cern". An example page is https://openlab.cern/ According to our documentation, it does not require cookies or a session token. Unfortunately, a single HTML file is downloaded containing only the code of the home page. Are you able to diagnose why this is happening? Perhaps the website has additional security features or it requires a session token or cookies. My second question concerns the issue of when we need to download cookies and the session token. We have our own tool for this, but how do we take into account redirecting to another authentication page using wget so that after authentication, the wget command works correctly? What url address should be included? I would appreciate a prompt reply. Best regards, Pawel Glod CERN, BE-CSS