Hilkiah, Thanks! First time I set my eyes on this file!
Based on my tests, I can conclude that the order is: nutch-default.txt --> crawl-tool.txt --> nutch-site.txt I just found an old post of Dennis Kubbes (http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200703.mbox/ [EMAIL PROTECTED]) explaining how to set the suffix-urlfilter.txt and the prefix-urlfilter.txt and guess what, it solved half of my problems! Thanks Dennis! And thanks Hilkiah! David -----Original Message----- From: Hilkiah Lavinier [mailto:[EMAIL PROTECTED] Sent: vendredi, 25. avril 2008 15:33 To: [email protected] Subject: Re: crawl command & urlfilter Hi take a look at crawl-tool.xml, in particular : <property> <name>urlfilter.regex.file</name> <value>crawl-urlfilter.txt</value> </property> Thus you can specify which file will contain your crawl regex expressions. However as pointed out in this file : <!-- Do not modify this file directly. Instead, copy entries that you --> <!-- wish to modify from this file into nutch-site.xml and change them --> <!-- there. If nutch-site.xml does not already exist, create it. --> Thus you should modify nutch-site.xml instead! Which also means you can use prefix or suffix for your url filter (using the crawl command). Just ensure the appropriate plugin is being loaded. For e.g. I use suffix filtering so the appropriate section of my nutch-site.xml looks like: <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-suffix|parse-(text|html)|index-(basic|anc hor)|query-(basic|site|url)|summary-lucene|scoring-opic|urlnormalizer-(p ass|regex|basic)</value> <description>...shortened...</description> </property> Lastly, nutch-site.xml overrides nutch-default.xml AND crawl-tool.xml, hence the reason the above works. Regards, Hilkiah G. Lavinier MEng (Hons), ACGI 6 Winston Lane, Goodwill, Roseau, Dominica Mbl: (767) 275 3382 Hm : (767) 440 3924 Fax: (767) 440 4991 VoIP USA: (646) 432 4487 Email: [EMAIL PROTECTED] Email: [EMAIL PROTECTED] IM: Yahoo hilkiah / MSN [EMAIL PROTECTED] IM: ICQ #8978201 / AOL hilkiah21 ----- Original Message ---- From: POIRIER David <[EMAIL PROTECTED]> To: [email protected] Sent: Friday, April 25, 2008 8:24:03 AM Subject: crawl command & urlfilter Hello, A quick question! I am crawling different sources using the "crawl" command. As you know, I can define my crawling space editing a series of regex available in the crawl-urlfilter.txt. Based on my tests, I concluded that this file is actually used by the "urlfilter-regex" plugin. But, in my nutch-default.txt file, this plugin is actually configured to read info out of the regex-urlfilter.txt file. Am I right when I say that the crawl-urlfilter.txt is overriding the regex-urlfilter.txt like nutch-site.xml is overriding the nutch-default.txt file? But then, what happen if I use the urlfilter-prefix? Is my regex inside my crawl-urlfilter.txt file still used? Thank you, David ________________________________________________________________________ ____________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
