Steven Could you share those schell scripts?

-----Original Message-----
From: Steven Yelton [mailto:[EMAIL PROTECTED] 
Sent: Sunday, March 05, 2006 10:22 AM
To: [email protected]
Subject: Re: how can i go deep?


Yes!  I have abandoned the 'crawl' command for even my single site 
searches.  I wrote shell scripts that  accomplish (generally) the same 
tasks the crawl does.

The only piece I had to watch out for is: one of the first thing the 
'crawl' class does is load 'crawl-tool.xml'.  So to get the exact same 
behavior I cut and pasted the contents of 'crawl-tool.xml' into my 
'nutch-site.xml'   (these configuration parameters do things like 
include the crawl-urlfilter.txt, pays attention to internal links, tries

to not kill your host, and so on...)

Steven

Richard Braman wrote:

>Stefan,
>
>I think I know what you're saying.  When you are new to nutch and you 
>read the tutorial,  It kind of leads you to believe (incorrectly) that 
>whole web crawling is different from intranet crawling and that the 
>steps are somehow different and independent of one another.  In fact it

>looks like using the crawl command is somekind of consolidated way of 
>doing each of the steps involved in whole web crawling
>
>I think what I didn't understand is that, you don't even have to ever 
>use the crawl command, even if you are limiting your crawling to a 
>limited list of URLs
>
>Instead you can :
>
>-create your list of urls (put them in a urls.txt file)\ -create the 
>url filter, to make sure the fetcher stays within the bound of the urls

>you want to crawl
>
>-Inject the urls into the crawl database,
>bin/nutch inject crawl/crawldb urls.text
>
>-generate a fetchlist which creates a new segment
>bin/nutch generate crawl/crawldb crawl/segments
>
>-fetch the segment
>bin/nutch fetch <segmentname>
>
>-update the db
>bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname>
>
>-index the segment
>bin/nutch index crawl/indexdb crawl/segments/<segmentname>
>
>Then you could repeat steps from generate to index again which woud 
>generate, fetch, update(the db of fetched segments) and index a new 
>segment
>
>When you do the generate -topN parmeter generates a fetchlist based on 
>? I think the answer is the top scoring page already in the crawldb, 
>but I am not 100% positive.
>
>Rich
>
>
>
>
>
>-----Original Message-----
>From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
>Sent: Saturday, March 04, 2006 3:27 PM
>To: [email protected]
>Subject: Re: how can i go deep?
>
>
>The crawl command creates a crawlDB for each call. So as Rchard
>mentioned try a higher depth.
>In case you like nutch to go deeper with each iteration, try the  
>whole web tutorial but change the url filter in a manner that it only  
>crawls your webpage.
>This will go as deep as much iteration you run.
>
>
>Stefan
>
>In case you like to
>Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:
>
>  
>
>>Try using depth=n when you do the crawl.  Post crawl I don't know,
>>but I
>>have the same question.  How do you make the index go deeper when  
>>you do
>>your next roudn of fetching is still something I haven't figured out.
>>
>>-----Original Message-----
>>From: Peter Swoboda [mailto:[EMAIL PROTECTED]
>>Sent: Friday, March 03, 2006 4:28 AM
>>To: [email protected]
>>Subject: how can i go deep?
>>
>>
>>Hi.
>>I've don a whole web crawl like it is shown in the tutorial. There is 
>>just "http://www.kreuztal.de/"; in the urls.txt i did the Fetching
>>three
>>times. But unfortunately the crawl hasn't gone deep. while  
>>searching, i
>>can only find keywords from the first(home-)site. for example i  
>>couldn't
>>find anythin on "http://www.kreuztal.de/impressum.php";
>>How can i configure the depht?
>>Thanx for helping.
>>
>>greetings
>>Peter
>>
>>--
>>Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
>>downloaden: http://www.gmx.net/de/go/smartsurfer
>>
>>
>>    
>>
>
>---------------------------------------------
>blog: http://www.find23.org
>company: http://www.media-style.com
>
>  
>

Reply via email to