hi Otis Gospodnetic,
I am also facing the same problem with the data structure for crawler,for now i have implmented a fifo Queue ,with a crawler object which has the URL and link depth and adding this object to the Queue,and removing the object , when that url is crawled, i dont find this efficent, though i am doing breadth first serching, which gives maximun urls from the same host, while crawling. The problem is i am not having any reference to the host, as i am adding the complete URL. so you fwd me if you get any relavent information. Thank you Mohan --- Mike Davis <[EMAIL PROTECTED]> wrote: > > Hi, > > I don't know if this helps, but I've done my spider > in a slightly > different way. I started out trying to store as much > as possible in memory > (for speed's sake) and then realised that the > bottleneck is almost never > the machine, it's normally the 'Net... so it's quite > efficient to store > URL's in a dbm of some sort. > > I've been using Gnu DBM for this task, with a simple > wrapper around it > that simulates a simple queue (push/pop). This works > by storing each URL > as a value with a numerical key and then storing the > current key-index as > 'currentIndex' in the same Gdbm. > > My point is that you could use something similar, > but add the > hostname (or a hostId) to the key, and a counter for > each host. > > Another clear advantage is persistence - as long as > you make an effort to > prevent dbm corruption, you can start up exactly > where your spider > stopped last time, with minimum startup time. > > I hope that made sense :-) > > Mike Davis > > > > On Wed, 13 Jun 2001, Otis Gospodnetic wrote: > > > > > Hello, > > > > Some members of this list probably had to write > > something like this before.... > > > > I am trying to create/pick a data structure that > will > > allow me to store a fixed number of unique > hostnames > > (call it HostData object) each of which has a > > fixed-sized list of unique URLs associated with > it. > > > > For example: > > > > host1 -> (url1-1, url1-2, url1-3) > > host2 -> (url2-1, url2-2, url2-3) > > host3 -> (url3-1, url3-2, url3-3) > > host4 -> (url4-1, url4-2, url4-3) > > > > I also need to be able to add elements to this > list of > > hosts or to the list of URLs for a specific host, > but > > because that could cause either list to grow very > > large (huge heap, memory hog) I would like to > store > > any 'overflow' (inserting when either list is > full) > > elements on the disk. > > > > Ideally this would be transparent: somebody would > > just call some add(host) or add(url) method and > the > > logic for whether to just add to the list or store > on > > the disk would be hidden...but > > this is not required. I could also do .isFull() > and > > save things myself. > > > > Then when a list of URLs for any one host is > > exhausted I need to: > > > > 1. Look for more URLs for the exhausted host on > the > > disk > > 2. If there are some URLs then load a fixed-sized > > batch of URLs into the URL list > > 3. Else remove that exhausted HostData from the > > list all together, and place a new HostData > > object in its place > > > > I can keep track of known host names in a > > LinkedList queue, so this is where I can pick a > new > > host if I get to step 3. > > > > I'm not sure if a LinkedList is the best data > > structure for that. > > I'm thinking about using a LinkedList queue > because > > its size if not fixed, which will allow me to add > new > > hosts as I discover them, and because it will act > as a > > queue and allow me to easily get the next host to > > crawl from the beginning of the queue. > > The drawback is that Lists allow duplicates, and I > > wouldn't want to have the same host listed more > than > > once in the queue. > > > > Questions: > > > > Lists allow duplicates. > > Maps and Sets do not. > > Would it be better to use Map or Set instead of > the > > List then? > > But then how do I easily grab a single element > from a > > Map or a Set? > > > > How about implementing a LinkedList queue that > > enforces uniquness of its elements by overriding > > add(Object) method and doing queue.add(object) > only if > > (!queue.contains(Object))? > > Would this be a bad way of doing it? > > Would Circular Array queue be better suited for > > this? > > > > Can you recommend an appropriate/efficient data > > structure for the above situation? > > > > Thanks, > > > > Otis > > P.S. > > I'm writing this in Java, although that shouldn't > > matter as these are really generic ADT questions. > > P.P.S. > > Does anyone know of a document/paper/book/etc. > that > > describes data structures useful for implemented > > crawlers? (efficient storage, RAM + disk storage > > combinations, etc.) > > > > > > __________________________________________________ > > Do You Yahoo!? > > Get personalized email addresses from Yahoo! Mail > - only $35 > > a year! http://personal.mail.yahoo.com/ > > > > -- > > This message was sent by the Internet robots and > spiders discussion list ([EMAIL PROTECTED]). For > list server commands, send "help" in the body of a > message to "[EMAIL PROTECTED]". > > > > > -- > This message was sent by the Internet robots and > spiders discussion list ([EMAIL PROTECTED]). For > list server commands, send "help" in the body of a > message to "[EMAIL PROTECTED]". __________________________________________________ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail http://personal.mail.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".