[Bug-wget] wget mirror site failing due to file / directory name clashes

Paul Beckett (ITCS) Fri, 12 Oct 2012 13:28:35 -0700

I am attempting to use wget to create a mirrored copy of a CMS (Liferay) 
website. I want to be able to failover to this static copy in case the 
application server goes offline. I therefore need the URL's to remain 
absolutely identical. The problem I have is that I cannot figure out how I can 
configure wget in a way that will cope with:
http://www.example.com/about
http://www.example.com/about/something


In this case either the file or directory 'about' already exists at prevents 
the second being created.

Initially I though the most obvious solution, was to rely on Apache's 
DirectoryIndex, and save the files as:
/about/index.html
/about/something/index.html

But, currently I can't figure out how I can do this in a way that doesn't break 
either the relative path to other pages or create links to the index.html 
rather than the original location. I need the links (a href etc.) to still go 
to /about and not explicitly call /index.html - as this will mean people may 
bookmark things that won't exist when the CMS came back.

If anyone can offer me any advice on how I can achieve this (either correct 
options), or how I could patch the source code to achieve this, I would be 
extremely grateful.

Thanks,
Paul





/usr/local/bin/wget --background --append-output=/tmp/wget-log --no-verbose 
--tries=20 --waitretry=10 --retry-connrefused --limit-rate=100m --quota=10000m 
--timestamping 
--directory-prefix=/usr/local/apache2/content/uk.ac.uea.www_flat2 
--protocol-directories --user-agent="UEA WebSite Flattener" --backup-converted 
-e robots=off --page-requisites --convert-links --recursive --level=inf 
--trust-server-names --domains example.com www.example.com

[Bug-wget] wget mirror site failing due to file / directory name clashes

Reply via email to