New Nutch user - configuration questions

Honda-Search Administrator Fri, 16 Jun 2006 15:32:36 -0700

hello all,

I'd like to apologize in advance for asking redundant questions. I've beensearching for information for a while now and it's all greek to me.Normally I'm great with computers/programming/etc. This Nutch thing justhas me beat.


So, what I'm doing.

i'm setting up a vertical search engine which will search a handful (no morethan 10) web forums that relate to Honda Cars. I already have a number ofthe forums picked out and have the appropriate URLs in a flat url file.


I have run the initial crawl command, which in my case is as follows:

bin/nutch crawl urls/honda-all -dir crawl -depth 10 -threads 40

This crawls and indexes the sites correctly. I can post my url file(honda-all) if you want.

As many of you know internet forums update rapidly. I would like to recrawlthese sites each night to pick up any new/updated content, but I honestlyhave no idea how. I have a cron set up for the recrawl.sh script, but Idon't know if it is adding new content or not. It's all a bit confusing tome.

I'm also a bit confused as to how to add a new sites to my list of urls(urls/honda-all). Do i add new sites by just adding them to the url file,or do i have to inject the urls into the webdb?

The more I read the more confused I get, which i guess is my own fault forhaving so many little "projects" I keep taking on.


If anyone can help I'd appreciate it.

In summary,

1. how do I recrawl the urls in my flat url file each night to pick up newcontent? Each site may have hundreds of new pages each day.2. Can anyone give me a simple, laymen's sample of how to inject new URLsinto the database to be crawled?

Thanks in advance. hopefully I won't be flamed for not searching the listalready, or for just using the Nutch wiki. Both appear to be too cumbersomefor me right now and I'm just hoping that a few people will help me out.

Matt

New Nutch user - configuration questions

Reply via email to