Re: Problem during using of GNU wget

Darshit Shah Fri, 15 Nov 2024 15:02:30 -0800

Hi Pawel,


On Tue, Jul 16, 2024, at 08:52, Pawel Wojciech Glod wrote:
> Hello GNU support,
>
> I am an employee of CERN and one of my tasks is web scraping the 
> internal pages of our organisation. To do this, I use wget to download 
> the entire directory structure of the website along with the HTML files.
>
> I have a problem with websites whose top-level domain (TLD) is ".cern".
> An example page is https://openlab.cern/
> According to our documentation, it does not require cookies or a 
> session token. Unfortunately, a single HTML file is downloaded 
> containing only the code of the home page. Are you able to diagnose why 
> this is happening? Perhaps the website has additional security features 
> or it requires a session token or cookies.

How are you invoking Wget? When I run:
`wget -r https://openlab.cern/`

I see multiple pages and their associated files being downloaded. Could you 
please
give a more detailed description of the problem along with the command you used 
and
the full output?

>
> My second question concerns the issue of when we need to download 
> cookies and the session token. We have our own tool for this, but how 
> do we take into account redirecting to another authentication page 
> using wget so that after authentication, the wget command works 
> correctly? What url address should be included?
>
I'm not sure I understand your question. If your authentication generates a 
session token that is
stored as a cookie, you can save it in a standard Cookie text file (both 
Firefox and Chrome have
ways to allow you to do this) and then import that cookie in Wget using the 
`--load-cookies` option

Is there something else you need?

>  I would appreciate a prompt reply.
>
> Best regards, Pawel Glod
> CERN, BE-CSS

Re: Problem during using of GNU wget

Reply via email to