[Robots] Re: Data structures for crawlers?

srinivas mohan Mon, 25 Jun 2001 08:43:41 -0700

Hi,


i am working on a search engine,
i am facing two majaor problems
1. i need to store the indexed data without using any
 traditional database, so i should save in text files
 are XML files, can u please sugesst , how to 
 do this,
2. and the search engine, takes the user feedback 
   for search results, like the user has an option
   to tell what documents he liked from the search    
   result, and then the search result is to be refined
   based on the user feedback, so i need to do some
   user profiling, and we are planning to use bayes
   inference, so please help me out.. if you have some
   suggestions.

Thanks in Advance 
Mohan
 




--- Mike Davis <[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> I don't know if this helps, but I've done my spider
> in a slightly
> different way. I started out trying to store as much
> as possible in memory
> (for speed's sake) and then realised that the
> bottleneck is almost never
> the machine, it's normally the 'Net... so it's quite
> efficient to store
> URL's in a dbm of some sort.
> 
> I've been using Gnu DBM for this task, with a simple
> wrapper around it
> that simulates a simple queue (push/pop). This works
> by storing each URL
> as a value with a numerical key and then storing the
> current key-index as
> 'currentIndex' in the same Gdbm.
> 
> My point is that you could use something similar,
> but add the
> hostname (or a hostId) to the key, and a counter for
> each host.
> 
> Another clear advantage is persistence - as long as
> you make an effort to
> prevent dbm corruption, you can start up exactly
> where your spider
> stopped last time, with minimum startup time.
> 
> I hope that made sense :-)
> 
> Mike Davis
> 
> 
> 
> On Wed, 13 Jun 2001, Otis Gospodnetic wrote:
> 
> >
> > Hello,
> >
> > Some members of this list probably had to write
> > something like this before....
> >
> > I am trying to create/pick a data structure that
> will
> > allow me to store a fixed number of unique
> hostnames
> > (call it HostData object) each of which has a
> > fixed-sized list of unique URLs associated with
> it.
> >
> > For example:
> >
> > host1 -> (url1-1, url1-2, url1-3)
> > host2 -> (url2-1, url2-2, url2-3)
> > host3 -> (url3-1, url3-2, url3-3)
> > host4 -> (url4-1, url4-2, url4-3)
> >
> > I also need to be able to add elements to this
> list of
> > hosts or to the list of URLs for a specific host,
> but
> > because that could cause either list to grow very
> > large (huge heap, memory hog) I would like to
> store
> > any 'overflow' (inserting when either list is
> full)
> > elements on the disk.
> >
> > Ideally this would be transparent: somebody would
> > just call some add(host) or add(url) method and
> the
> > logic for whether to just add to the list or store
> on
> > the disk would be hidden...but
> > this is not required. I could also do .isFull()
> and
> > save things myself.
> >
> > Then when a list of URLs for any one host is
> > exhausted I need to:
> >
> > 1. Look for more URLs for the exhausted host on
> the
> >    disk
> > 2. If there are some URLs then load a fixed-sized
> >    batch of URLs into the URL list
> > 3. Else remove that exhausted HostData from the
> >    list all together, and place a new HostData
> >    object in its place
> >
> > I can keep track of known host names in a
> > LinkedList queue, so this is where I can pick a
> new
> > host if I get to step 3.
> >
> > I'm not sure if a LinkedList is the best data
> > structure for that.
> > I'm thinking about using a LinkedList queue
> because
> > its size if not fixed, which will allow me to add
> new
> > hosts as I discover them, and because it will act
> as a
> > queue and allow me to easily get the next host to
> > crawl from the beginning of the queue.
> > The drawback is that Lists allow duplicates, and I
> > wouldn't want to have the same host listed more
> than
> > once in the queue.
> >
> > Questions:
> >
> > Lists allow duplicates.
> > Maps and Sets do not.
> > Would it be better to use Map or Set instead of
> the
> > List then?
> > But then how do I easily grab a single element
> from a
> > Map or a Set?
> >
> > How about implementing a LinkedList queue that
> > enforces uniquness of its elements by overriding
> > add(Object) method and doing queue.add(object)
> only if
> > (!queue.contains(Object))?
> > Would this be a bad way of doing it?
> > Would Circular Array queue be better suited for
> > this?
> >
> > Can you recommend an appropriate/efficient data
> > structure for the above situation?
> >
> > Thanks,
> >
> > Otis
> > P.S.
> > I'm writing this in Java, although that shouldn't
> > matter as these are really generic ADT questions.
> > P.P.S.
> > Does anyone know of a document/paper/book/etc.
> that
> > describes data structures useful for implemented
> > crawlers? (efficient storage, RAM + disk storage
> > combinations, etc.)
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Get personalized email addresses from Yahoo! Mail
> - only $35
> > a year!  http://personal.mail.yahoo.com/
> >
> > --
> > This message was sent by the Internet robots and
> spiders discussion list ([EMAIL PROTECTED]).  For
> list server commands, send "help" in the body of a
> message to "[EMAIL PROTECTED]".
> >
> 
> 
> --
> This message was sent by the Internet robots and
> spiders discussion list ([EMAIL PROTECTED]).  For
> list server commands, send "help" in the body of a
> message to "[EMAIL PROTECTED]".


__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail
http://personal.mail.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

[Robots] Re: Data structures for crawlers?

Reply via email to