Dalton, Jeffery wrote:
My point above is merely that the web is very dynamic and that it is
important to be able to update the database very frequently.  In other
words, I am arguing for a system capable of performing real-time (or
near real-time updates) with an acceptable performance level.  In the
above I am not arguing that you should crawl all of those urls as they
change, because there are scarce resources that need to be balanced.

That's fine. What I was pointing out is that if you only need to update your index daily, or even hourly, it is probably more efficient to perform batch updates to the crawl db. It's not until minute-by-minute update of very large collections is required that random-access to the crawl db would be faster.

I would like to advocate a middle-of-the-road architecture where updates
don't require one seek per document and also don't require re-streaming
the entire database.

It's not just update, but access. Adding new entries to the set of known urls is fairly easy--just add some new files. The expensive bit is detecting if you've already seen referenced urls. If you're accessing the crawl db page-by-page, rather than batched, then, in general, to determine whether a url has been seen before requires a seek. Some cases can be optimized, and some things may be cached, but as the number of known urls grows, the number of seeks required becomes proportional to the number of urls checked.

Another approach might be to blindly re-fetch and index high-priority urls and perform duplicate elimination at search time.

Doug

Reply via email to