RE: how can i go deep?

Richard Braman Sat, 04 Mar 2006 16:34:36 -0800

Stefan,

I think I know what you're saying.  When you are new to nutch and you
read the tutorial,  It kind of leads you to believe (incorrectly) that
whole web crawling is different from intranet crawling and that the
steps are somehow different and independent of one another.  In fact it
looks like using the crawl command is somekind of consolidated way of
doing each of the steps involved in whole web crawling


I think what I didn't understand is that, you don't even have to ever
use the crawl command, even if you are limiting your crawling to a
limited list of URLs

Instead you can :

-create your list of urls (put them in a urls.txt file)\
-create the url filter, to make sure the fetcher stays within the bound
of the urls you want to crawl

-Inject the urls into the crawl database, 
bin/nutch inject crawl/crawldb urls.text

-generate a fetchlist which creates a new segment
bin/nutch generate crawl/crawldb crawl/segments

-fetch the segment
bin/nutch fetch <segmentname>

-update the db
bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname>

-index the segment
bin/nutch index crawl/indexdb crawl/segments/<segmentname>

Then you could repeat steps from generate to index again which woud
generate, fetch, update(the db of fetched segments) and index a new
segment

When you do the generate -topN parmeter generates a fetchlist based on ?
I think the answer is the top scoring page already in the crawldb, but I
am not 100% positive.

Rich





-----Original Message-----
From: Stefan Groschupf [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 04, 2006 3:27 PM
To: nutch-user@lucene.apache.org
Subject: Re: how can i go deep?


The crawl command creates a crawlDB for each call. So as Rchard  
mentioned try a higher depth.
In case you like nutch to go deeper with each iteration, try the  
whole web tutorial but change the url filter in a manner that it only  
crawls your webpage.
This will go as deep as much iteration you run.


Stefan

In case you like to
Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:

> Try using depth=n when you do the crawl.  Post crawl I don't know,
> but I
> have the same question.  How do you make the index go deeper when  
> you do
> your next roudn of fetching is still something I haven't figured out.
>
> -----Original Message-----
> From: Peter Swoboda [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 03, 2006 4:28 AM
> To: nutch-user@lucene.apache.org
> Subject: how can i go deep?
>
>
> Hi.
> I've don a whole web crawl like it is shown in the tutorial. There is 
> just "http://www.kreuztal.de/"; in the urls.txt i did the Fetching
> three
> times. But unfortunately the crawl hasn't gone deep. while  
> searching, i
> can only find keywords from the first(home-)site. for example i  
> couldn't
> find anythin on "http://www.kreuztal.de/impressum.php";
> How can i configure the depht?
> Thanx for helping.
>
> greetings
> Peter
>
> --
> Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
> downloaden: http://www.gmx.net/de/go/smartsurfer
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com

RE: how can i go deep?

Reply via email to