> On the robots side, I'd be interested to know what techniques
> people are using to store URLs in the queue for later processing.
> (i.e., since folks want a delay between requests, it makes sense
> to have multiple input queues or use a database of some sort
> to store the URLs until they are needed for processing.)  Of course,
> the idea is to put them in a queue but to evenly distribute the output
> of the queue over the hosts being crawled so that the requests
> do not center on any one given host.
> -Art

I was wondering how other people are doing this too. Are most people
distributing the queue in different database tables, as Art says? Are people
creating their own file structures? Using a queue in memory?
Right now, I am using a single database table. Its not working that great,
especially a  the size of that table grows.  What types of databases are
people using? I am using ODBC, and can switch between different databases.
Even in the best cases, when my URL queue grows to around 100,000, inserting
new URLs becomes painfully slow, and I need to purge.
To distribute among hosts, I grab the next URL for processing by doing a
round-robin from a set of root URLs. It really doesn't guarantee that
consecutive connections to the same host wont be made yet, but I am working
on it.

-Corey

Reply via email to