[Robots] Re: Data structures for crawlers?

Corey Schwartz Tue, 26 Jun 2001 05:11:23 -0700


Implementing a FIFO queue will certainly work for the crawler but is not
friendly toward the websites being crawled.  Using a FIFO queue, as you
mentioned, means that you are doing a breadth first search through the site.
It is very likely that you will send hundreds of page requests to the same
server in a very short amount of time.  Depending on how you design your
data structures you should be able to record the time of the last request
for a page from any particular server and pace the requests so that you
don't request more than a page every few minutes.


It sounds like you are implementing this as a recursive call to a crawl
function.  It seems to me that you should parse out the URL into a
scheme/server/path/filename/port and store all of that information in a
database of your choice, along with other important data such as the number
of times you've visited a site, when the last visit was made, whether the
site is still active, etc.

Corey


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

[Robots] Re: Data structures for crawlers?

Reply via email to