hello all, I'd like to apologize in advance for asking redundant questions. I've been searching for information for a while now and it's all greek to me. Normally I'm great with computers/programming/etc. This Nutch thing just has me beat.
So, what I'm doing. i'm setting up a vertical search engine which will search a handful (no more than 10) web forums that relate to Honda Cars. I already have a number of the forums picked out and have the appropriate URLs in a flat url file. I have run the initial crawl command, which in my case is as follows: bin/nutch crawl urls/honda-all -dir crawl -depth 10 -threads 40 This crawls and indexes the sites correctly. I can post my url file (honda-all) if you want. As many of you know internet forums update rapidly. I would like to recrawl these sites each night to pick up any new/updated content, but I honestly have no idea how. I have a cron set up for the recrawl.sh script, but I don't know if it is adding new content or not. It's all a bit confusing to me. I'm also a bit confused as to how to add a new sites to my list of urls (urls/honda-all). Do i add new sites by just adding them to the url file, or do i have to inject the urls into the webdb? The more I read the more confused I get, which i guess is my own fault for having so many little "projects" I keep taking on. If anyone can help I'd appreciate it. In summary, 1. how do I recrawl the urls in my flat url file each night to pick up new content? Each site may have hundreds of new pages each day. 2. Can anyone give me a simple, laymen's sample of how to inject new URLs into the database to be crawled? Thanks in advance. hopefully I won't be flamed for not searching the list already, or for just using the Nutch wiki. Both appear to be too cumbersome for me right now and I'm just hoping that a few people will help me out. Matt _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
