Hi all,
I have look in the archive and have followed the instructions in the tutorial and I am still having problems limiting nutch to just my site.
For instance, the tutorial reads:
2. Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read:
+^http://([a-z0-9]*\.)*nutch.org/
But when I test the above regex according to a comment in the archives on April 16 using:
cat file-with-test-urls | nutch net/nutch/net/RegexURLFilter
I get this for the output:
<snip> +# skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] - +# limit to org site only -+^http://([a-z0-9]*\.)*nutch.org/ - +# do not accept anything else ++. </snip>
So, according to to the filter test, the regex in the tutorial does not work. Also, when I use Doug's example from another email (+^http://www.cs.princeton.edu/(people/(grad|fac)\.php)?$) I also get the "-" sign when I run the test. Also, the "[EMAIL PROTECTED]" also gets a "-" sign...
So, can anyone out there give me the exact syntax so that nutch will *only* crawl the domain (and subdomain(s)) for the site I want to crawl?
Many thanks.
-lucas
