LARM web crawler: use lucene itself for visited URLs

Ype Kingma Wed, 30 Oct 2002 13:58:09 -0800

I managed to loose some recent messages on the LARM crawler and the lucene
file formats, so I don't know whom to address.


Anyway, I noticed this on the LARM crawler info page
http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
<<<
Something worth while would be to compress the URLs. A lot of parts of URLs 
are the same between hundreds of URLs (i.e. the host name). And since only a 
limited number of characters are allowed in URLs, Huffman compression will 
lead to a good compression rate. 
>>>

and this on the file formats page
http://jakarta.apache.org/lucene/docs/fileformats.html
<<<
Term text prefixes are shared. The PrefixLength is the number of initial 
characters from the previous term which must be pre-pended to a term's suffix 
in order to form the term's text. Thus, if the previous term's text was 
"bone" and the term is "boy", the PrefixLength is two and the suffix is "y". 
>>>

Somehow I get the impression that lucene itself would be quite helpful for 
the crawler by using indexed, non stored fields for the normalized visited 
URLs.

Have fun,
Ype

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>

LARM web crawler: use lucene itself for visited URLs

Reply via email to