> On the robots side, I'd be interested to know what techniques > people are using to store URLs in the queue for later processing. > (i.e., since folks want a delay between requests, it makes sense > to have multiple input queues or use a database of some sort > to store the URLs until they are needed for processing.) Of course, > the idea is to put them in a queue but to evenly distribute the output > of the queue over the hosts being crawled so that the requests > do not center on any one given host. > -Art
I was wondering how other people are doing this too. Are most people distributing the queue in different database tables, as Art says? Are people creating their own file structures? Using a queue in memory? Right now, I am using a single database table. Its not working that great, especially a the size of that table grows. What types of databases are people using? I am using ODBC, and can switch between different databases. Even in the best cases, when my URL queue grows to around 100,000, inserting new URLs becomes painfully slow, and I need to purge. To distribute among hosts, I grab the next URL for processing by doing a round-robin from a set of root URLs. It really doesn't guarantee that consecutive connections to the same host wont be made yet, but I am working on it. -Corey