Hello,

Some members of this list probably had to write
something like this before....

I am trying to create/pick a data structure that will
allow me to store a fixed number of unique hostnames
(call it HostData object) each of which has a
fixed-sized list of unique URLs associated with it.

For example:

host1 -> (url1-1, url1-2, url1-3)
host2 -> (url2-1, url2-2, url2-3)
host3 -> (url3-1, url3-2, url3-3)
host4 -> (url4-1, url4-2, url4-3)

I also need to be able to add elements to this list of
hosts or to the list of URLs for a specific host, but
because that could cause either list to grow very
large (huge heap, memory hog) I would like to store
any 'overflow' (inserting when either list is full)
elements on the disk.

Ideally this would be transparent: somebody would
just call some add(host) or add(url) method and the
logic for whether to just add to the list or store on
the disk would be hidden...but
this is not required. I could also do .isFull() and
save things myself.

Then when a list of URLs for any one host is
exhausted I need to:

1. Look for more URLs for the exhausted host on the
   disk
2. If there are some URLs then load a fixed-sized
   batch of URLs into the URL list
3. Else remove that exhausted HostData from the
   list all together, and place a new HostData 
   object in its place

I can keep track of known host names in a
LinkedList queue, so this is where I can pick a new
host if I get to step 3.

I'm not sure if a LinkedList is the best data
structure for that.
I'm thinking about using a LinkedList queue because
its size if not fixed, which will allow me to add new
hosts as I discover them, and because it will act as a
queue and allow me to easily get the next host to
crawl from the beginning of the queue.
The drawback is that Lists allow duplicates, and I
wouldn't want to have the same host listed more than
once in the queue.

Questions:

Lists allow duplicates.
Maps and Sets do not.
Would it be better to use Map or Set instead of the
List then?
But then how do I easily grab a single element from a
Map or a Set?

How about implementing a LinkedList queue that
enforces uniquness of its elements by overriding
add(Object) method and doing queue.add(object) only if
(!queue.contains(Object))?
Would this be a bad way of doing it?
Would Circular Array queue be better suited for
this?

Can you recommend an appropriate/efficient data
structure for the above situation?

Thanks,

Otis
P.S.
I'm writing this in Java, although that shouldn't
matter as these are really generic ADT questions.
P.P.S.
Does anyone know of a document/paper/book/etc. that
describes data structures useful for implemented
crawlers? (efficient storage, RAM + disk storage
combinations, etc.)


__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail - only $35 
a year!  http://personal.mail.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to