Dear Rob,
Thanks, for great information from prefix plugin ->> prefix plugin
faster than regex plugin.
What is the syntax of the prefix-urlfilter.txt file? I would like enable
/ disable completty domains (e.g. notebook.com and lg.notebook.com). Can
I use regex in this file?
Regards,
Ferenc
Rob Pettengill wrotte:
Bryan,
It takes thousands of url rules to make a significant speed
difference between the the regexp
and the prefix url filter. So if you can be more elegant with a
regexp rule or need some of the other file type regexp rules I'd
suggest that you stick with the regexp rules. I have around 5
thousand rules and use a ComboURLFilter which uses rules from both
the regexp and prefix files.
To use the PrefixURLFilter put your changes to the default values in
the nutch-site.xml file which overrides the defaults file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<nutch-conf>
<property>
<name>urlfilter.class</name>
<value>net.nutch.net.PrefixURLFilter</value>
<description>Name of the class used to filter URLs.</description>
</property>
<property>
<name>urlfilter.prefix.file</name>
<value>prefix-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing default regular
expressions used by PrefixURLFilter.</description>
</property>
</nutch-conf>
--
Robert C. Pettengill, Ph.D.
[EMAIL PROTECTED]
Questions about petroleum?
Goto: http://AskAboutOil.com/
On 2005, Jul 10, at 9:38 PM, Bryan Woliner wrote: