hello all,

I'd like to apologize in advance for asking redundant questions. I've been searching for information for a while now and it's all greek to me. Normally I'm great with computers/programming/etc. This Nutch thing just has me beat.

So, what I'm doing.

i'm setting up a vertical search engine which will search a handful (no more than 10) web forums that relate to Honda Cars. I already have a number of the forums picked out and have the appropriate URLs in a flat url file.

I have run the initial crawl command, which in my case is as follows:

bin/nutch crawl urls/honda-all -dir crawl -depth 10 -threads 40

This crawls and indexes the sites correctly. I can post my url file (honda-all) if you want.

As many of you know internet forums update rapidly. I would like to recrawl these sites each night to pick up any new/updated content, but I honestly have no idea how. I have a cron set up for the recrawl.sh script, but I don't know if it is adding new content or not. It's all a bit confusing to me.

I'm also a bit confused as to how to add a new sites to my list of urls (urls/honda-all). Do i add new sites by just adding them to the url file, or do i have to inject the urls into the webdb?

The more I read the more confused I get, which i guess is my own fault for having so many little "projects" I keep taking on.

If anyone can help I'd appreciate it.

In summary,

1. how do I recrawl the urls in my flat url file each night to pick up new content? Each site may have hundreds of new pages each day. 2. Can anyone give me a simple, laymen's sample of how to inject new URLs into the database to be crawled?

Thanks in advance. hopefully I won't be flamed for not searching the list already, or for just using the Nutch wiki. Both appear to be too cumbersome for me right now and I'm just hoping that a few people will help me out.

Matt

Reply via email to