Re: Intranet crawl and re-fetch - newbie question

Konstantin Ott Fri, 03 Jun 2005 08:12:14 -0700

Hi,

I too am not getting round this problem. But maybe I didnt understandthe documentation good. This is how I got it:

- WebDBInjector injects urls into the webdb

- FetchListTool gets all the urls from the webdb und generates thesegment for the

- Fetcher

- UpdateDatabaseTool gets the crawled urls from the segments and savesthem into the webdb

this is wanted for the first crawl, but when I inject another url andwant to crawl only this one then the FetchListTool generates segmentsfor all the urls from the webdb, also the old ones.So I started to inject into a second webdb und fetched only for thisone. After that I updated the original webdb, deleted the second webdband merged the segments. Well this works fine under linux. But runningunder windows there stay some locked files, so I cant delete and i cantrecrawl because the WebDBWriter is waiting for the lock to be releasedand that will never be.

So did I understand you correct and it is possible to:
fetch new pages (with a depth of 3 for example)
and
refetch/update existing ones
as independent tasks?

I cant read the tutorial in that way. So please could you explain it alittle more?

thx Konstantin

Piotr Kosiorowski wrote:

Hello,
I am not sure if I understood you correctly but if you use techniquedescribed as "whole web crawling" in tutorial you are not startingfrom scratch but you can fetch new pages and refetch and updateexisting ones. But probably I misunderstood your question so pleasegive us more details on the thing you want to achieve -e.g. do youplan to fetch from limited number of sites ?
Regards
Piotr
carmmello wrote:
have been using Nutch for over 1 year now and that is a questionthat I have allways asked without any answer. I have tried a lot ofthings, looked in the mail lists, the tutorial, everywhere, but noanswer. So, for me, it seems that the only way to keep yourselfupdated is to start everything all over again. It seems (as far as Iknow) that Nutch was not designed to allow you to update yourselfwith only new or modified pages on an existing set of index, db andsegments. If someone knows something about this issue, let us know,because this points seems, to me, the bigest problem to, really,start using Nutch on a regular basis in a "production site".
Tanks


------------------------------------------------------------------------

No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.322 / Virus Database: 267.4.1 - Release Date: 2/6/2005

Re: Intranet crawl and re-fetch - newbie question

Reply via email to