hello all,

I'd like to apologize in advance for asking redundant questions.  I've been 
searching for information for a while now and it's all greek to me. 
Normally I'm great with computers/programming/etc.  This Nutch thing just 
has me beat.

So, what I'm doing.

i'm setting up a vertical search engine which will search a handful (no more 
than 10) web forums that relate to Honda Cars.  I already have a number of 
the forums picked out and have the appropriate URLs in a flat url file.

I have run the initial crawl command, which in my case is as follows:

bin/nutch crawl urls/honda-all -dir crawl -depth 10 -threads 40

This crawls and indexes the sites correctly.  I can post my url file 
(honda-all) if you want.

As many of you know internet forums update rapidly.  I would like to recrawl 
these sites each night to pick up any new/updated content, but I honestly 
have no idea how.  I have a cron set up for the recrawl.sh script, but I 
don't know if it is adding new content or not.  It's all a bit confusing to 
me.

I'm also a bit confused as to how to add a new sites to my list of urls 
(urls/honda-all).  Do i add new sites by just adding them to the url file, 
or do i have to inject the urls into the webdb?

The more I read the more confused I get, which i guess is my own fault for 
having so many little "projects" I keep taking on.

If anyone can help I'd appreciate it.

In summary,

1.  how do I recrawl the urls in my flat url file each night to pick up new 
content?  Each site may have hundreds of new pages each day.
2.  Can anyone give me a simple, laymen's sample of how to inject new URLs 
into the database to be crawled?

Thanks in advance.  hopefully I won't be flamed for not searching the list 
already, or for just using the Nutch wiki.  Both appear to be too cumbersome 
for me right now and I'm just hoping that a few people will help me out.

Matt 



_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to