Hilkiah,

Thanks! First time I set my eyes on this file! 

Based on my tests, I can conclude that the order is: nutch-default.txt
--> crawl-tool.txt --> nutch-site.txt

I just found an old post of Dennis Kubbes
(http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200703.mbox/
[EMAIL PROTECTED]) explaining how to set the
suffix-urlfilter.txt and the prefix-urlfilter.txt and guess what, it
solved half of my problems! Thanks Dennis!

And thanks Hilkiah! 


David




-----Original Message-----
From: Hilkiah Lavinier [mailto:[EMAIL PROTECTED] 
Sent: vendredi, 25. avril 2008 15:33
To: [email protected]
Subject: Re: crawl command & urlfilter

Hi take a look at crawl-tool.xml, in particular :

<property>
  <name>urlfilter.regex.file</name>
  <value>crawl-urlfilter.txt</value>
</property>

Thus you can specify which file will contain your crawl regex
expressions.

However as pointed out in this file :

<!-- Do not modify this file directly.  Instead, copy entries that you
-->
<!-- wish to modify from this file into nutch-site.xml and change them
-->
<!-- there.  If nutch-site.xml does not already exist, create it.
-->

Thus you should modify nutch-site.xml instead!  Which also means you can
use prefix or suffix for your url filter (using the crawl command).
Just ensure the appropriate plugin is being loaded. For e.g. I use
suffix filtering so the appropriate section of my nutch-site.xml looks
like:

<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-suffix|parse-(text|html)|index-(basic|anc
hor)|query-(basic|site|url)|summary-lucene|scoring-opic|urlnormalizer-(p
ass|regex|basic)</value>
  <description>...shortened...</description>
</property>

Lastly, nutch-site.xml overrides nutch-default.xml AND crawl-tool.xml,
hence the reason the above works.

Regards,

 Hilkiah G. Lavinier MEng (Hons), ACGI 
6 Winston Lane, 
Goodwill, 
Roseau, Dominica


Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487


Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201  / AOL hilkiah21


----- Original Message ----
From: POIRIER David <[EMAIL PROTECTED]>
To: [email protected]
Sent: Friday, April 25, 2008 8:24:03 AM
Subject: crawl command & urlfilter

Hello,

A quick question!

I am crawling different sources using the "crawl" command. As you know,
I can define my crawling space editing a series of regex available in
the crawl-urlfilter.txt. Based on my tests, I concluded that this file
is actually used by the "urlfilter-regex" plugin. But, in my
nutch-default.txt file, this plugin is actually configured to read info
out of the regex-urlfilter.txt file. 

Am I right when I say that the crawl-urlfilter.txt is overriding the
regex-urlfilter.txt like nutch-site.xml is overriding the
nutch-default.txt file? 

But then, what happen if I use the urlfilter-prefix? Is my regex inside
my crawl-urlfilter.txt file still used?

Thank you,

David


 
________________________________________________________________________
____________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Reply via email to