[Bug-wget] Updating a mirror

Luc Van der Veken Tue, 30 Dec 2008 08:26:06 -0800

I'm trying to mirror a site (http://www.gamingcommission.fgov.be) and
update the mirror at regular intervals.


This is primarily meant to detect new and changed pages and files.  While
at it, I set it up as a local mirror so I get the added benefit of that
(the site has a tendency of going down a LOT).

I'm using wget 1.10.2 on Ubuntu Server 7.10, apparently they don't have
1.11.x in the repository yet (just scanned for and installed updates today
to make sure).
Should I have asked somewhere else in this case?



The initial local copy works perfectly, it is fully browsable, etc.

Then when I run wget again with the exact same command line arguments, this
is what happens:

1) It checks only the top page (index.html) and direct descendants.  If
there are no changes there, it just stops.  In other words, it only detects
a change in a page if the parent of that page (in the link tree) has also
changed.

2) It messes up links in the pages it re-checks.  Any link to a page that
isn't downloaded again (because it hasn't changed) is replaced by an
absolute link to the original server, but using the filename of the local
copy.

An example of an URL mutilated by (2):

Original:
   jsp/main.jsp?lang=NL
Mirror, initial copy:
   jsp/main.jsp%3Flang=NL.html
Mirror, after one update:
   http://www.gamingcommission.fgov.be/jsp/main.jsp%3Flang=NL.html


I am using these command line parameters:
   --mirror
   --html-extension
   --convert-links
   --backup-converted
   --w 2
   --random-wait
   -p
   --no-check-certificate
   --read-timeout=60
   -e robots=off

Is there anything wrong with my parameters?
Is it something in the site?
Is it a bug?  Fixed in 1.11.4?


BTW, in case someone wants to test it out locally: total size of the mirror
is 116MB.  You can significantly reduce that by skipping the .pdf and .doc
files.

[Bug-wget] Updating a mirror

Reply via email to