Thanks for the suggestions. Micah, unfortunately the CMS system we're using doesn't seem to allow people to create the links with a trailing slash (although it still servers correct page, if the slash is added).
Ángel, I agree this would work, but our management do not want to have .html extensions on the URL's. I previously experimented with the adjust-extension to add '/index.html' . From my recollection I was able to do this as a command line option, but this meant all the links got adjusted to include the /index.html which I didn't want. I then attempted to hack the C code a little to add it, without adjusting the links, but that broke all the links to CSS / JS and other HTML pages, as I was moving the relative location of the HTML file into a sub-directory, and the CSS/JS and other HTML links weren't being adjusted. Thanks, Paul >-----Original Message----- >From: Ángel González [mailto:[email protected]] >Sent: Saturday, October 13, 2012 2:45 PM >To: Paul Beckett (ITCS) >Cc: [email protected] >Subject: Re: [Bug-wget] wget mirror site failing due to file / directory name >clashes > >On 12/10/12 15:38, Paul Beckett (ITCS) wrote: >> I am attempting to use wget to create a mirrored copy of a CMS (Liferay) >website. I want to be able to failover to this static copy in case the >application >server goes offline. I therefore need the URL's to remain absolutely identical. >The problem I have is that I cannot figure out how I can configure wget in a >way that will cope with: >> http://www.example.com/about >> http://www.example.com/about/something >> >> In this case either the file or directory 'about' already exists at prevents >> the >second being created. >> >> Initially I though the most obvious solution, was to rely on Apache's >DirectoryIndex, and save the files as: >> /about/index.html >> /about/something/index.html >> >> But, currently I can't figure out how I can do this in a way that doesn't >> break >either the relative path to other pages or create links to the index.html >rather >than the original location. I need the links (a href etc.) to still go to >/about and >not explicitly call /index.html - as this will mean people may bookmark things >that won't exist when the CMS came back. >> >> If anyone can offer me any advice on how I can achieve this (either correct >options), or how I could patch the source code to achieve this, I would be >extremely grateful. >> >> Thanks, >> Paul >> >> >> >> /usr/local/bin/wget --background --append-output=/tmp/wget-log >> --no-verbose --tries=20 --waitretry=10 --retry-connrefused >> --limit-rate=100m --quota=10000m --timestamping >> --directory-prefix=/usr/local/apache2/content/uk.ac.uea.www_flat2 >> --protocol-directories --user-agent="UEA WebSite Flattener" >> --backup-converted -e robots=off --page-requisites --convert-links >> --recursive --level=inf --trust-server-names --domains example.com >> www.example.com >Download with --adjust-extension >This way, you will get: > >/about.html >/about/something.html > > >Then configure the root of the static copy: >RewriteEngine On >RewriteCond %{SCRIPT_FILENAME} !\.html$ RewriteRule >^(.*[^/])/?$ $1.html > >to append the .html extension to the requested urls. >If your CMS returns non-html contents on some urls you will need to adjust >this to exclude them from the rewrite. > >Also, I'd remove --convert-links from the command line, since you want the >same page contents as the real pages. > > >
