Hi,
my crawler got caught by a site with an url-loop. Each time I fetch a
page, the same page with one more / is added to the fetchlist. So the
urls look like:
http://host.name/dir/page.html
http://host.name//dir/page.html
http://host.name///dir/page.html
http://host.name////dir/page.html
...
I think, it should be possible to fix this by using the
regex-normalize.xml. How does the following rule look?
<regex>
<pattern>(.*://.*)//(.*)</pattern>
<substitution>$1/$2</substitution>
</regex>
is this ok?
Regards
Michael
--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general