Re: Nutch WebDb storage alternatives: Revisited

Doug Cutting Wed, 30 Nov 2005 10:46:43 -0800

Dalton, Jeffery wrote:

My point above is merely that the web is very dynamic and that it is
important to be able to update the database very frequently.  In other
words, I am arguing for a system capable of performing real-time (or
near real-time updates) with an acceptable performance level.  In the
above I am not arguing that you should crawl all of those urls as they

change, because there are scarce resources that need to be balanced.

That's fine. What I was pointing out is that if you only need to updateyour index daily, or even hourly, it is probably more efficient toperform batch updates to the crawl db. It's not until minute-by-minuteupdate of very large collections is required that random-access to thecrawl db would be faster.

I would like to advocate a middle-of-the-road architecture where updates
don't require one seek per document and also don't require re-streaming
the entire database.

It's not just update, but access. Adding new entries to the set ofknown urls is fairly easy--just add some new files. The expensive bitis detecting if you've already seen referenced urls. If you'reaccessing the crawl db page-by-page, rather than batched, then, ingeneral, to determine whether a url has been seen before requires aseek. Some cases can be optimized, and some things may be cached, butas the number of known urls grows, the number of seeks required becomesproportional to the number of urls checked.

Another approach might be to blindly re-fetch and index high-priorityurls and perform duplicate elimination at search time.


Doug

Re: Nutch WebDb storage alternatives: Revisited

Reply via email to