Re: [Nutch-dev] restiricting crawling to pages on a set of urls

Doug Cutting Fri, 23 Apr 2004 09:56:54 -0700

Skip McCrystal wrote:

I need to restrict my crawling to pages on a particular set of urls.
i.e.
http://www.myweb1.com/<allpages>
http://www.myweb2.com/<allpages>
...
Is there a simple way of doing this?

Yes.

1. Copy conf/regex-urlfilter-default.txt to conf/regex-urlfilter-site.txt.

2. Edit conf/regex-urlfilter-site.txt to exclude pages which don't match the domains you want, e.g., replace the line reading '+.' with something like:

--snip--
# include myweb1 and myweb2 urls
+http://www.myweb1.com/
+http://www.myweb2.com/

# exclude everything else
-.
--snip--

Note that you might also revisit some of the other rules in this file, e.g., to include urls that have query strings, etc.

3. Create a conf/nutch-site.xml containing something like:

--snip--
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<nutch-conf>

 <property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter-site.txt</value>
  <description>Name of file on CLASSPATH containing default regular
  expressions used by RegexURLFilter.</description>
</property>

</nutch-conf>
--snip--

Things in this file override those in nutch-default.xml, so you might want to set other options here, like your robot agent name.

That's it!

Doug


-------------------------------------------------------
This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek
For a limited time only, get FREE Ground shipping on all orders of $35
or more. Hurry up and shop folks, this offer expires April 30th!
http://www.thinkgeek.com/freeshipping/?cpg=12297
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] restiricting crawling to pages on a set of urls

Reply via email to