I need to restrict my crawling to pages on a particular set of urls.
i.e. http://www.myweb1.com/<allpages> http://www.myweb2.com/<allpages> ...
Is there a simple way of doing this?
Yes.
1. Copy conf/regex-urlfilter-default.txt to conf/regex-urlfilter-site.txt.
2. Edit conf/regex-urlfilter-site.txt to exclude pages which don't match the domains you want, e.g., replace the line reading '+.' with something like:
--snip-- # include myweb1 and myweb2 urls +http://www.myweb1.com/ +http://www.myweb2.com/
# exclude everything else -. --snip--
Note that you might also revisit some of the other rules in this file, e.g., to include urls that have query strings, etc.
3. Create a conf/nutch-site.xml containing something like:
--snip-- <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<nutch-conf>
<property> <name>urlfilter.regex.file</name> <value>regex-urlfilter-site.txt</value> <description>Name of file on CLASSPATH containing default regular expressions used by RegexURLFilter.</description> </property>
</nutch-conf> --snip--
Things in this file override those in nutch-default.xml, so you might want to set other options here, like your robot agent name.
That's it!
Doug
------------------------------------------------------- This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek For a limited time only, get FREE Ground shipping on all orders of $35 or more. Hurry up and shop folks, this offer expires April 30th! http://www.thinkgeek.com/freeshipping/?cpg=12297 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
