RE: Web Crawler

Halácsy Péter Wed, 24 Apr 2002 14:35:36 -0700

> -----Original Message-----
> From: Clemens Marschner [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, April 24, 2002 11:14 PM
> To: Lucene Developers List; [EMAIL PROTECTED]
> Subject: Re: Web Crawler
> 
>> Another thing I have in mind is to compress the URLs in 
> memory. First of
> all, the URL can be divided in several parts, some of which 
> occur in a lot
> of URLs (i.e. the host name). And finally, URLs contain only a limited
> number of different characters, so Huffman encoding is probably quite
> efficient.
>
see this:http://www.almaden.ibm.com/cs/k53/www9.final/

. "In CS2, each URL is stored in 10 bytes. In CS1, each link requires 8 bytes to store 
as both an in-link and out-link; in CS2, an average of only 3.4 bytes are used. 
Second, CS2 provides additional functionality in the form of a host database. For 
example, in CS2, it is easy to get all the in-links for a given node, or just the 
in-links from remote hosts. 

Like CS1, CS2 is designed to give high-performance access to all this data on a 
high-end machine with enough RAM to store the database in memory.  On a 465 MHz Compaq 
AlphaServer 4100 with 12GB of RAM, it takes 70-80 ms to convert a URL into an internal 
id or vice versa, and then only 0.15 ms/link to retrieve each in-link or out-link.  On 
a uniprocessor machine, a BFS that reaches 100M nodes takes about 4 minutes; on a 
2-processor machine we were able complete a BFS every two minutes." 


peter

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
RE: Web Crawler

Reply via email to