Dear Rob,

Thanks, for great information from prefix plugin ->> prefix plugin faster than regex plugin. What is the syntax of the prefix-urlfilter.txt file? I would like enable / disable completty domains (e.g. notebook.com and lg.notebook.com). Can I use regex in this file?

Regards,
   Ferenc

Rob Pettengill wrotte:

Bryan,

It takes thousands of url rules to make a significant speed difference between the the regexp and the prefix url filter. So if you can be more elegant with a regexp rule or need some of the other file type regexp rules I'd suggest that you stick with the regexp rules. I have around 5 thousand rules and use a ComboURLFilter which uses rules from both the regexp and prefix files.

To use the PrefixURLFilter put your changes to the default values in the nutch-site.xml file which overrides the defaults file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<nutch-conf>

<property>
  <name>urlfilter.class</name>
  <value>net.nutch.net.PrefixURLFilter</value>
  <description>Name of the class used to filter URLs.</description>
</property>

<property>
  <name>urlfilter.prefix.file</name>
  <value>prefix-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing default regular
  expressions used by PrefixURLFilter.</description>
</property>

</nutch-conf>

--
Robert C. Pettengill, Ph.D.
   [EMAIL PROTECTED]

Questions about petroleum?
    Goto:   http://AskAboutOil.com/

On 2005, Jul 10, at 9:38 PM, Bryan Woliner wrote:


Reply via email to