[Robots] Re: Data structures for crawlers?

srinivas mohan Mon, 25 Jun 2001 08:41:03 -0700


hi Otis Gospodnetic,


I am also facing the same problem with the data
structure for crawler,for now i have implmented a fifo
Queue ,with a crawler object which has the URL and
link depth and adding this object to the Queue,and
removing the object , when that url is crawled,
i dont find this efficent, though i am doing breadth
first serching, which gives maximun urls from the
same host, while crawling.
The problem is i am not having any reference to the 
host, as i am adding the complete URL.

so you fwd me if you get any relavent information.

Thank you
Mohan


--- Mike Davis <[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> I don't know if this helps, but I've done my spider
> in a slightly
> different way. I started out trying to store as much
> as possible in memory
> (for speed's sake) and then realised that the
> bottleneck is almost never
> the machine, it's normally the 'Net... so it's quite
> efficient to store
> URL's in a dbm of some sort.
> 
> I've been using Gnu DBM for this task, with a simple
> wrapper around it
> that simulates a simple queue (push/pop). This works
> by storing each URL
> as a value with a numerical key and then storing the
> current key-index as
> 'currentIndex' in the same Gdbm.
> 
> My point is that you could use something similar,
> but add the
> hostname (or a hostId) to the key, and a counter for
> each host.
> 
> Another clear advantage is persistence - as long as
> you make an effort to
> prevent dbm corruption, you can start up exactly
> where your spider
> stopped last time, with minimum startup time.
> 
> I hope that made sense :-)
> 
> Mike Davis
> 
> 
> 
> On Wed, 13 Jun 2001, Otis Gospodnetic wrote:
> 
> >
> > Hello,
> >
> > Some members of this list probably had to write
> > something like this before....
> >
> > I am trying to create/pick a data structure that
> will
> > allow me to store a fixed number of unique
> hostnames
> > (call it HostData object) each of which has a
> > fixed-sized list of unique URLs associated with
> it.
> >
> > For example:
> >
> > host1 -> (url1-1, url1-2, url1-3)
> > host2 -> (url2-1, url2-2, url2-3)
> > host3 -> (url3-1, url3-2, url3-3)
> > host4 -> (url4-1, url4-2, url4-3)
> >
> > I also need to be able to add elements to this
> list of
> > hosts or to the list of URLs for a specific host,
> but
> > because that could cause either list to grow very
> > large (huge heap, memory hog) I would like to
> store
> > any 'overflow' (inserting when either list is
> full)
> > elements on the disk.
> >
> > Ideally this would be transparent: somebody would
> > just call some add(host) or add(url) method and
> the
> > logic for whether to just add to the list or store
> on
> > the disk would be hidden...but
> > this is not required. I could also do .isFull()
> and
> > save things myself.
> >
> > Then when a list of URLs for any one host is
> > exhausted I need to:
> >
> > 1. Look for more URLs for the exhausted host on
> the
> >    disk
> > 2. If there are some URLs then load a fixed-sized
> >    batch of URLs into the URL list
> > 3. Else remove that exhausted HostData from the
> >    list all together, and place a new HostData
> >    object in its place
> >
> > I can keep track of known host names in a
> > LinkedList queue, so this is where I can pick a
> new
> > host if I get to step 3.
> >
> > I'm not sure if a LinkedList is the best data
> > structure for that.
> > I'm thinking about using a LinkedList queue
> because
> > its size if not fixed, which will allow me to add
> new
> > hosts as I discover them, and because it will act
> as a
> > queue and allow me to easily get the next host to
> > crawl from the beginning of the queue.
> > The drawback is that Lists allow duplicates, and I
> > wouldn't want to have the same host listed more
> than
> > once in the queue.
> >
> > Questions:
> >
> > Lists allow duplicates.
> > Maps and Sets do not.
> > Would it be better to use Map or Set instead of
> the
> > List then?
> > But then how do I easily grab a single element
> from a
> > Map or a Set?
> >
> > How about implementing a LinkedList queue that
> > enforces uniquness of its elements by overriding
> > add(Object) method and doing queue.add(object)
> only if
> > (!queue.contains(Object))?
> > Would this be a bad way of doing it?
> > Would Circular Array queue be better suited for
> > this?
> >
> > Can you recommend an appropriate/efficient data
> > structure for the above situation?
> >
> > Thanks,
> >
> > Otis
> > P.S.
> > I'm writing this in Java, although that shouldn't
> > matter as these are really generic ADT questions.
> > P.P.S.
> > Does anyone know of a document/paper/book/etc.
> that
> > describes data structures useful for implemented
> > crawlers? (efficient storage, RAM + disk storage
> > combinations, etc.)
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Get personalized email addresses from Yahoo! Mail
> - only $35
> > a year!  http://personal.mail.yahoo.com/
> >
> > --
> > This message was sent by the Internet robots and
> spiders discussion list ([EMAIL PROTECTED]).  For
> list server commands, send "help" in the body of a
> message to "[EMAIL PROTECTED]".
> >
> 
> 
> --
> This message was sent by the Internet robots and
> spiders discussion list ([EMAIL PROTECTED]).  For
> list server commands, send "help" in the body of a
> message to "[EMAIL PROTECTED]".


__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail
http://personal.mail.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

[Robots] Re: Data structures for crawlers?

Reply via email to