I managed to loose some recent messages on the LARM crawler and the lucene file formats, so I don't know whom to address.
Anyway, I noticed this on the LARM crawler info page http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html <<< Something worth while would be to compress the URLs. A lot of parts of URLs are the same between hundreds of URLs (i.e. the host name). And since only a limited number of characters are allowed in URLs, Huffman compression will lead to a good compression rate. >>> and this on the file formats page http://jakarta.apache.org/lucene/docs/fileformats.html <<< Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y". >>> Somehow I get the impression that lucene itself would be quite helpful for the crawler by using indexed, non stored fields for the normalized visited URLs. Have fun, Ype -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>
