If I crawl a page with a url like: http://localhost/Documents/pharma/DocSamples/?C=N;O=A (which is what you get when you have a directory without an index.*, and you've configured "Options Indexes", and you click one of the sorting options) and it presents all the files in the directory as relative links like "foo.html", Nutch ends up trying to fetch the files with the second part of that same parameter on the end, like "foo.htmlO=A", which ends up getting a 404.
Look at the parse data for http://localhost/Documents/pharma/DocSamples/?C=D;O=A ... [java] outlink: toUrl: http://localhost/Documents/pharma/DocSamples/15%20minutes.htm;O=A anchor: 15 minutes.htm [java] outlink: toUrl: http://localhost/Documents/pharma/DocSamples/18whistle.html;O=A anchor: 18whistle.html [java] outlink: toUrl: http://localhost/Documents/pharma/DocSamples/2010%20brings%20changes.doc;O=A anchor: 2010 brings changes.doc ... -- http://www.linkedin.com/in/paultomblin
