> -----Original Message-----
> From: Clemens Marschner [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, April 24, 2002 11:14 PM
> To: Lucene Developers List; [EMAIL PROTECTED]
> Subject: Re: Web Crawler
>
>> Another thing I have in mind is to compress the URLs in
> memory. First of
> all, the URL can be divided in several parts, some of which
> occur in a lot
> of URLs (i.e. the host name). And finally, URLs contain only a limited
> number of different characters, so Huffman encoding is probably quite
> efficient.
>
see this:http://www.almaden.ibm.com/cs/k53/www9.final/
. "In CS2, each URL is stored in 10 bytes. In CS1, each link requires 8 bytes to store
as both an in-link and out-link; in CS2, an average of only 3.4 bytes are used.
Second, CS2 provides additional functionality in the form of a host database. For
example, in CS2, it is easy to get all the in-links for a given node, or just the
in-links from remote hosts.
Like CS1, CS2 is designed to give high-performance access to all this data on a
high-end machine with enough RAM to store the database in memory. On a 465 MHz Compaq
AlphaServer 4100 with 12GB of RAM, it takes 70-80 ms to convert a URL into an internal
id or vice versa, and then only 0.15 ms/link to retrieve each in-link or out-link. On
a uniprocessor machine, a BFS that reaches 100M nodes takes about 4 minutes; on a
2-processor machine we were able complete a BFS every two minutes."
peter
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>