On 12/27/22 16:18, American Citizen wrote:
> Hi
>
> I used wget recently to try to download all 26 or 27 pages of my
> website, but it seems to miss about 40% of the pages.
>
> Does anyone have the CLI command line which captures 100% of a website
> URLS ?
>
> I tried the typical
>
> %wget -r --tries=10 https://my.website.com/ -o logfile
>
> as suggested in the "man wget" command, but it did NOT capture all the
> webpages. I even tried a wait parameter, but that only slowed things up
> and did not remedy the missing websubpages issue.
>
> I appreciate any tips so that ALL of the website data can be captured by
> wget. Yes, I am aware of the robots.txt restricting downloadable information
>
> - Randall
>
>

wget can be a bit tricky - it has a lot of options for downloading
websites.  For your case, how many directories deep is your website? By
default, the directory level is 5. try

wget -r -l 10 --tries=10 https://my.website.com/ -o logfile

for 10 levels deep, or adjust as needed.  To make a offline copy of the 
website, you can use 'mirror' instead

wget --mirror --tries=10 https://my.website.com/ -o logfile

or

wget --mirror            \
      --convert-links     \
      --html-extension    \
      --wait=2            \
      -o logfile              \
      https://my.website.com/

'--html-extension' is handy if some of your pages do not conform to *.html. Use 
'--convert-links' for offline viewing in a browser.

Some other options that may be handy:

-p (--page-requisites) : download all files that are necessary to properly 
display a given HTML page. This includes such things as inlined images, sounds, 
and referenced stylesheets.
-H (--span-hosts) : enable spanning across hosts when doing recursive 
retrieving.
--no-parent : When recursing do not ascend to the parent directory. It useful 
for restricting the download to only a portion of the site.


Also, be aware that some linux distros symlink wget to wget2 which operates a 
bit differently.

-Ed



Reply via email to