Isn't this a bug?

Paul Tomblin Tue, 01 Sep 2009 08:08:41 -0700

If I crawl a page with a url like:
http://localhost/Documents/pharma/DocSamples/?C=N;O=A
(which is what you get when you have a directory without an index.*,
and you've configured "Options Indexes", and you click one of the
sorting options)
and it presents all the files in the directory as relative links like
"foo.html", Nutch ends up trying to fetch the files with the second
part of that same parameter on the end, like "foo.htmlO=A", which ends
up getting a 404.


Look at the parse data for http://localhost/Documents/pharma/DocSamples/?C=D;O=A
...
     [java]   outlink: toUrl:
http://localhost/Documents/pharma/DocSamples/15%20minutes.htm;O=A
anchor: 15 minutes.htm
     [java]   outlink: toUrl:
http://localhost/Documents/pharma/DocSamples/18whistle.html;O=A
anchor: 18whistle.html
     [java]   outlink: toUrl:
http://localhost/Documents/pharma/DocSamples/2010%20brings%20changes.doc;O=A
anchor: 2010 brings changes.doc
...

-- 
http://www.linkedin.com/in/paultomblin

Isn't this a bug?

Reply via email to