Steven Could you share those schell scripts? -----Original Message----- From: Steven Yelton [mailto:[EMAIL PROTECTED] Sent: Sunday, March 05, 2006 10:22 AM To: [email protected] Subject: Re: how can i go deep?
Yes! I have abandoned the 'crawl' command for even my single site searches. I wrote shell scripts that accomplish (generally) the same tasks the crawl does. The only piece I had to watch out for is: one of the first thing the 'crawl' class does is load 'crawl-tool.xml'. So to get the exact same behavior I cut and pasted the contents of 'crawl-tool.xml' into my 'nutch-site.xml' (these configuration parameters do things like include the crawl-urlfilter.txt, pays attention to internal links, tries to not kill your host, and so on...) Steven Richard Braman wrote: >Stefan, > >I think I know what you're saying. When you are new to nutch and you >read the tutorial, It kind of leads you to believe (incorrectly) that >whole web crawling is different from intranet crawling and that the >steps are somehow different and independent of one another. In fact it >looks like using the crawl command is somekind of consolidated way of >doing each of the steps involved in whole web crawling > >I think what I didn't understand is that, you don't even have to ever >use the crawl command, even if you are limiting your crawling to a >limited list of URLs > >Instead you can : > >-create your list of urls (put them in a urls.txt file)\ -create the >url filter, to make sure the fetcher stays within the bound of the urls >you want to crawl > >-Inject the urls into the crawl database, >bin/nutch inject crawl/crawldb urls.text > >-generate a fetchlist which creates a new segment >bin/nutch generate crawl/crawldb crawl/segments > >-fetch the segment >bin/nutch fetch <segmentname> > >-update the db >bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname> > >-index the segment >bin/nutch index crawl/indexdb crawl/segments/<segmentname> > >Then you could repeat steps from generate to index again which woud >generate, fetch, update(the db of fetched segments) and index a new >segment > >When you do the generate -topN parmeter generates a fetchlist based on >? I think the answer is the top scoring page already in the crawldb, >but I am not 100% positive. > >Rich > > > > > >-----Original Message----- >From: Stefan Groschupf [mailto:[EMAIL PROTECTED] >Sent: Saturday, March 04, 2006 3:27 PM >To: [email protected] >Subject: Re: how can i go deep? > > >The crawl command creates a crawlDB for each call. So as Rchard >mentioned try a higher depth. >In case you like nutch to go deeper with each iteration, try the >whole web tutorial but change the url filter in a manner that it only >crawls your webpage. >This will go as deep as much iteration you run. > > >Stefan > >In case you like to >Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman: > > > >>Try using depth=n when you do the crawl. Post crawl I don't know, >>but I >>have the same question. How do you make the index go deeper when >>you do >>your next roudn of fetching is still something I haven't figured out. >> >>-----Original Message----- >>From: Peter Swoboda [mailto:[EMAIL PROTECTED] >>Sent: Friday, March 03, 2006 4:28 AM >>To: [email protected] >>Subject: how can i go deep? >> >> >>Hi. >>I've don a whole web crawl like it is shown in the tutorial. There is >>just "http://www.kreuztal.de/" in the urls.txt i did the Fetching >>three >>times. But unfortunately the crawl hasn't gone deep. while >>searching, i >>can only find keywords from the first(home-)site. for example i >>couldn't >>find anythin on "http://www.kreuztal.de/impressum.php" >>How can i configure the depht? >>Thanx for helping. >> >>greetings >>Peter >> >>-- >>Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos >>downloaden: http://www.gmx.net/de/go/smartsurfer >> >> >> >> > >--------------------------------------------- >blog: http://www.find23.org >company: http://www.media-style.com > > >
