Problem during using of GNU wget

Pawel Wojciech Glod Mon, 15 Jul 2024 23:53:02 -0700

Hello GNU support,

I am an employee of CERN and one of my tasks is web scraping the internal pages 
of our organisation. To do this, I use wget to download the entire directory 
structure of the website along with the HTML files.


I have a problem with websites whose top-level domain (TLD) is ".cern".
An example page is https://openlab.cern/
According to our documentation, it does not require cookies or a session token. 
Unfortunately, a single HTML file is downloaded containing only the code of the 
home page. Are you able to diagnose why this is happening? Perhaps the website 
has additional security features or it requires a session token or cookies.

My second question concerns the issue of when we need to download cookies and 
the session token. We have our own tool for this, but how do we take into 
account redirecting to another authentication page using wget so that after 
authentication, the wget command works correctly? What url address should be 
included?

 I would appreciate a prompt reply.

Best regards, Pawel Glod
CERN, BE-CSS

Problem during using of GNU wget

Reply via email to