Implementing a FIFO queue will certainly work for the crawler but is not
friendly toward the websites being crawled.  Using a FIFO queue, as you
mentioned, means that you are doing a breadth first search through the site.
It is very likely that you will send hundreds of page requests to the same
server in a very short amount of time.  Depending on how you design your
data structures you should be able to record the time of the last request
for a page from any particular server and pace the requests so that you
don't request more than a page every few minutes.

It sounds like you are implementing this as a recursive call to a crawl
function.  It seems to me that you should parse out the URL into a
scheme/server/path/filename/port and store all of that information in a
database of your choice, along with other important data such as the number
of times you've visited a site, when the last visit was made, whether the
site is still active, etc.

Corey


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to