[Nutch-general] need regex-normalize.xml help (crawler trap)

Michael Nebel Wed, 31 Aug 2005 16:31:18 -0700

Hi,

my crawler got caught by a site with an url-loop. Each time I fetch apage, the same page with one more / is added to the fetchlist. So theurls look like:


        http://host.name/dir/page.html
        http://host.name//dir/page.html
        http://host.name///dir/page.html
        http://host.name////dir/page.html
        ...

I think, it should be possible to fix this by using theregex-normalize.xml. How does the following rule look?


        <regex>
          <pattern>(.*://.*)//(.*)</pattern>
          <substitution>$1/$2</substitution>
        </regex>

is this ok?

Regards

        Michael


--
Michael Nebel           
http://www.nebel.de/
http://www.netluchs.de/



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] need regex-normalize.xml help (crawler trap)

Reply via email to