David Spencer wrote:
crawl-urlfilter.txt needs at least 1 change, to allow file URLs (this is the culprit: '-^(file|ftp|mailto|https):)' , and maybe a 2nd change if it's only allowing http: at the bottom

and

nutch-site.xml needs an entry for the plugin.includes property to allow the file plugin...the value will be like this: 'protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)'

[1] Is this accurate (it worked for me) and should I add a Wiki entry, or will things be easier soon (I'm on 0.6).

This is accurate. Please add a wiki entry.

[2] Seems kinda lame that I have to make an entry in nutch-site.xml to enable the plugin -why not just make it easy for users and allow(load) *all* plugins by default? Is there some security or performance reason for not doing so?

We had a number of performance and reliablity problems when all plugins were enabled (in 0.5) and instead opted for excluding all but the essential plugins. Then when folks have problems after enabling a plugin its much easier to identify the cause.


Nearly every use of Nutch requires some config changes. The url filters and plugins are probably among the most common things that need to be altered. Sorry there wasn't better documentation!

How did it work once you got past these problems?

Cheers,

Doug


------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to