Thanks for clearing that up Piotr. The reason I asked is that for my project most of the pages I'm fetching are only useful for a short period of time and then I don't want to ever refetch them again. So over time I'm going to end up with a big webdb with a large percentage of obsolete pages.
However it looks like the webdb architecture is quite efficient so this might not ever be a problem. I did notice that you can manually delete pages with the WebDBWriter class - perhaps a tool based on this that would let you prune the webdb would be useful in the future. best regards, raymond --- Piotr Kosiorowski <[EMAIL PROTECTED]> wrote: > Hello, > Pages from WebDB are not deleted automatically. > Nutch does not check > if page has inlinks during fetchlist generation - so > "orphaned" page > would be refetched. It will stop to refetch the page > if page becomes > unavailable for some number of fetch attempts. > Regards > Piotr > > On 8/10/05, Raymond Creel <[EMAIL PROTECTED]> > wrote: > > I have a question about the webdb and fetching. > When > > a page that used to have incoming links is found > to be > > "orphaned" (i.e. there are no longer any pages > that > > have links to it), is it deleted from the webdb? > Or > > is it left in the webdb but set not to be > refetched? > > Or will it continue to be refetched anyway (this > > doesn't seem right to me)? > > > > Conversely, what will happen when a link to it > > reappears later? > > > > One more thing - are pages injected with the webdb > > injector treated any differently (as I see them as > > being sort of the "root nodes" of the webdb - they > > should never be deleted)? > > > > Thanks much for any clarity on this! > > > > raymond > > > > > > > > __________________________________ > > Yahoo! Mail for Mobile > > Take Yahoo! Mail with you! Check email on your > mobile phone. > > http://mobile.yahoo.com/learn/mail > > > ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
