Doug Cutting wrote:

David Spencer wrote:

crawl-urlfilter.txt needs at least 1 change, to allow file URLs (this is the culprit: '-^(file|ftp|mailto|https):)' , and maybe a 2nd change if it's only allowing http: at the bottom

and

nutch-site.xml needs an entry for the plugin.includes property to allow the file plugin...the value will be like this: 'protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)'

[1] Is this accurate (it worked for me) and should I add a Wiki entry, or will things be easier soon (I'm on 0.6).


This is accurate. Please add a wiki entry.

Opps, I caught something - it jumps off the disk unless you disable the http schema, so this line:
-^(file|ftp|mailto|https):
changes to this:
-^(http|ftp|mailto|https):



I added an entry to the FAQ, but if this is the only time this question has ever been answered then maybe it isn't so frequent so I can break it out into it's own little page.


http://www.nutch.org/cgi-bin/twiki/view/Main/FAQ#FaqIndexing

[2] Seems kinda lame that I have to make an entry in nutch-site.xml to enable the plugin -why not just make it easy for users and allow(load) *all* plugins by default? Is there some security or performance reason for not doing so?


We had a number of performance and reliablity problems when all plugins were enabled (in 0.5) and instead opted for excluding all but the essential plugins. Then when folks have problems after enabling a plugin its much easier to identify the cause.

Nearly every use of Nutch requires some config changes. The url filters and plugins are probably among the most common things that need to be altered.


Sorry there wasn't better documentation!

Well it made for a good exercise to try to figure this out w/o asking the question...



How did it work once you got past these problems?

Simple question but I realized that I brashly forgot to test - and I rediscovered something I rediscover, um, it seems like 1x/year - I have a side project I work on on and off to develop a Lucene desktop indexer and I hit the problem w/ that too..the problem is that Mozilla won't, by default, load file: URLs from a http: URL for security reasons. See the bottom of the FAQ entry for the 2 relevant URLs. IE5 works fine. The general "fix" for this is to go thru some work to feed these local file: URLs thru the web container - I can elaborate on this if anyone wants but it's a bit off topic from the main goals of Nutch anyway...


thx,
 Dave







Cheers,

Doug


------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general



------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to