A few notes: * You need to change 'nutch_home' to point to your nutch installation * Running with no arguments will print the usage * Only supports the local fs
To initiate a crawl (where myurls is a file with a list of urls one per line)
./crawl.sh -initdb myurls /tmp/index To recrawl using the same web db (updated pages or just to go deeper): ./crawl.sh /tmp/index Steven Richard Braman wrote:
Steven Could you share those schell scripts? -----Original Message-----From: Steven Yelton [mailto:[EMAIL PROTECTED] Sent: Sunday, March 05, 2006 10:22 AMTo: nutch-user@lucene.apache.org Subject: Re: how can i go deep?Yes! I have abandoned the 'crawl' command for even my single site searches. I wrote shell scripts that accomplish (generally) the same tasks the crawl does.The only piece I had to watch out for is: one of the first thing the 'crawl' class does is load 'crawl-tool.xml'. So to get the exact same behavior I cut and pasted the contents of 'crawl-tool.xml' into my 'nutch-site.xml' (these configuration parameters do things like include the crawl-urlfilter.txt, pays attention to internal links, triesto not kill your host, and so on...) Steven Richard Braman wrote:Stefan,I think I know what you're saying. When you are new to nutch and you read the tutorial, It kind of leads you to believe (incorrectly) that whole web crawling is different from intranet crawling and that the steps are somehow different and independent of one another. In fact itlooks like using the crawl command is somekind of consolidated way of doing each of the steps involved in whole web crawlingI think what I didn't understand is that, you don't even have to ever use the crawl command, even if you are limiting your crawling to a limited list of URLsInstead you can :-create your list of urls (put them in a urls.txt file)\ -create the url filter, to make sure the fetcher stays within the bound of the urlsyou want to crawl -Inject the urls into the crawl database, bin/nutch inject crawl/crawldb urls.text -generate a fetchlist which creates a new segment bin/nutch generate crawl/crawldb crawl/segments -fetch the segment bin/nutch fetch <segmentname> -update the db bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname> -index the segment bin/nutch index crawl/indexdb crawl/segments/<segmentname>Then you could repeat steps from generate to index again which woud generate, fetch, update(the db of fetched segments) and index a new segmentWhen you do the generate -topN parmeter generates a fetchlist based on ? I think the answer is the top scoring page already in the crawldb, but I am not 100% positive.Rich -----Original Message----- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Saturday, March 04, 2006 3:27 PM To: nutch-user@lucene.apache.org Subject: Re: how can i go deep? The crawl command creates a crawlDB for each call. So as Rchard mentioned try a higher depth.In case you like nutch to go deeper with each iteration, try the whole web tutorial but change the url filter in a manner that it only crawls your webpage.This will go as deep as much iteration you run. Stefan In case you like to Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:Try using depth=n when you do the crawl. Post crawl I don't know, but Ihave the same question. How do you make the index go deeper when you doyour next roudn of fetching is still something I haven't figured out. -----Original Message----- From: Peter Swoboda [mailto:[EMAIL PROTECTED] Sent: Friday, March 03, 2006 4:28 AM To: nutch-user@lucene.apache.org Subject: how can i go deep? Hi.I've don a whole web crawl like it is shown in the tutorial. There is just "http://www.kreuztal.de/" in the urls.txt i did the Fetchingthreetimes. But unfortunately the crawl hasn't gone deep. while searching, i can only find keywords from the first(home-)site. for example i couldn'tfind anythin on "http://www.kreuztal.de/impressum.php" How can i configure the depht? Thanx for helping. greetings Peter -- Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos downloaden: http://www.gmx.net/de/go/smartsurfer--------------------------------------------- blog: http://www.find23.org company: http://www.media-style.com
crawl.sh
Description: application/shellscript