Thanks for clearing that up Piotr.  The reason I asked
is that for my project most of the pages I'm fetching
are only useful for a short period of time and then I
don't want to ever refetch them again.  So over time
I'm going to end up with a big webdb with a large
percentage of obsolete pages. 

However it looks like the webdb architecture is quite
efficient so this might not ever be a problem.  I did
notice that you can manually delete pages with the
WebDBWriter class - perhaps a tool based on this that
would let you prune the webdb would be useful in the
future.

best regards,
raymond

--- Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:

> Hello,
> Pages from WebDB are not deleted automatically.
> Nutch does not check
> if page has inlinks during fetchlist generation - so
> "orphaned" page
> would be refetched. It will stop to refetch the page
> if page becomes
> unavailable for some number of fetch attempts.
> Regards
> Piotr
> 
> On 8/10/05, Raymond Creel <[EMAIL PROTECTED]>
> wrote:
> > I have a question about the webdb and fetching. 
> When
> > a page that used to have incoming links is found
> to be
> > "orphaned" (i.e. there are no longer any pages
> that
> > have links to it), is it deleted from the webdb? 
> Or
> > is it left in the webdb but set not to be
> refetched?
> > Or will it continue to be refetched anyway (this
> > doesn't seem right to me)?
> > 
> > Conversely, what will happen when a link to it
> > reappears later?
> > 
> > One more thing - are pages injected with the webdb
> > injector treated any differently (as I see them as
> > being sort of the "root nodes" of the webdb - they
> > should never be deleted)?
> > 
> > Thanks much for any clarity on this!
> > 
> > raymond
> > 
> > 
> > 
> > __________________________________
> > Yahoo! Mail for Mobile
> > Take Yahoo! Mail with you! Check email on your
> mobile phone.
> > http://mobile.yahoo.com/learn/mail
> >
> 



                
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to