Skip McCrystal wrote:
I need to restrict my crawling to pages on a particular set of urls.

i.e.
http://www.myweb1.com/<allpages>
http://www.myweb2.com/<allpages>
...

Is there a simple way of doing this?

Yes.


1. Copy conf/regex-urlfilter-default.txt to conf/regex-urlfilter-site.txt.

2. Edit conf/regex-urlfilter-site.txt to exclude pages which don't match the domains you want, e.g., replace the line reading '+.' with something like:

--snip--
# include myweb1 and myweb2 urls
+http://www.myweb1.com/
+http://www.myweb2.com/

# exclude everything else
-.
--snip--

Note that you might also revisit some of the other rules in this file, e.g., to include urls that have query strings, etc.

3. Create a conf/nutch-site.xml containing something like:

--snip--
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<nutch-conf>

 <property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter-site.txt</value>
  <description>Name of file on CLASSPATH containing default regular
  expressions used by RegexURLFilter.</description>
</property>

</nutch-conf>
--snip--

Things in this file override those in nutch-default.xml, so you might want to set other options here, like your robot agent name.

That's it!

Doug


------------------------------------------------------- This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek For a limited time only, get FREE Ground shipping on all orders of $35 or more. Hurry up and shop folks, this offer expires April 30th! http://www.thinkgeek.com/freeshipping/?cpg=12297 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to