Hi,

I don't know if this helps, but I've done my spider in a slightly
different way. I started out trying to store as much as possible in memory
(for speed's sake) and then realised that the bottleneck is almost never
the machine, it's normally the 'Net... so it's quite efficient to store
URL's in a dbm of some sort.

I've been using Gnu DBM for this task, with a simple wrapper around it
that simulates a simple queue (push/pop). This works by storing each URL
as a value with a numerical key and then storing the current key-index as
'currentIndex' in the same Gdbm.

My point is that you could use something similar, but add the
hostname (or a hostId) to the key, and a counter for each host.

Another clear advantage is persistence - as long as you make an effort to
prevent dbm corruption, you can start up exactly where your spider
stopped last time, with minimum startup time.

I hope that made sense :-)

Mike Davis



On Wed, 13 Jun 2001, Otis Gospodnetic wrote:

>
> Hello,
>
> Some members of this list probably had to write
> something like this before....
>
> I am trying to create/pick a data structure that will
> allow me to store a fixed number of unique hostnames
> (call it HostData object) each of which has a
> fixed-sized list of unique URLs associated with it.
>
> For example:
>
> host1 -> (url1-1, url1-2, url1-3)
> host2 -> (url2-1, url2-2, url2-3)
> host3 -> (url3-1, url3-2, url3-3)
> host4 -> (url4-1, url4-2, url4-3)
>
> I also need to be able to add elements to this list of
> hosts or to the list of URLs for a specific host, but
> because that could cause either list to grow very
> large (huge heap, memory hog) I would like to store
> any 'overflow' (inserting when either list is full)
> elements on the disk.
>
> Ideally this would be transparent: somebody would
> just call some add(host) or add(url) method and the
> logic for whether to just add to the list or store on
> the disk would be hidden...but
> this is not required. I could also do .isFull() and
> save things myself.
>
> Then when a list of URLs for any one host is
> exhausted I need to:
>
> 1. Look for more URLs for the exhausted host on the
>    disk
> 2. If there are some URLs then load a fixed-sized
>    batch of URLs into the URL list
> 3. Else remove that exhausted HostData from the
>    list all together, and place a new HostData
>    object in its place
>
> I can keep track of known host names in a
> LinkedList queue, so this is where I can pick a new
> host if I get to step 3.
>
> I'm not sure if a LinkedList is the best data
> structure for that.
> I'm thinking about using a LinkedList queue because
> its size if not fixed, which will allow me to add new
> hosts as I discover them, and because it will act as a
> queue and allow me to easily get the next host to
> crawl from the beginning of the queue.
> The drawback is that Lists allow duplicates, and I
> wouldn't want to have the same host listed more than
> once in the queue.
>
> Questions:
>
> Lists allow duplicates.
> Maps and Sets do not.
> Would it be better to use Map or Set instead of the
> List then?
> But then how do I easily grab a single element from a
> Map or a Set?
>
> How about implementing a LinkedList queue that
> enforces uniquness of its elements by overriding
> add(Object) method and doing queue.add(object) only if
> (!queue.contains(Object))?
> Would this be a bad way of doing it?
> Would Circular Array queue be better suited for
> this?
>
> Can you recommend an appropriate/efficient data
> structure for the above situation?
>
> Thanks,
>
> Otis
> P.S.
> I'm writing this in Java, although that shouldn't
> matter as these are really generic ADT questions.
> P.P.S.
> Does anyone know of a document/paper/book/etc. that
> describes data structures useful for implemented
> crawlers? (efficient storage, RAM + disk storage
> combinations, etc.)
>
>
> __________________________________________________
> Do You Yahoo!?
> Get personalized email addresses from Yahoo! Mail - only $35
> a year!  http://personal.mail.yahoo.com/
>
> --
> This message was sent by the Internet robots and spiders discussion list 
>([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a 
>message to "[EMAIL PROTECTED]".
>


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to