A few months ago, I did a quick and dirty experiment using the java.util.zip compression utilities to compress stored text fields using Lucene 1.3. Unfortunately, I don't still have the data available, but as I recall, it was not clear that compression was always a benefit. In particular, if the text fields are short (like titles of a paper,) the overhead of the compression's embedded dictionary can make the compressed string longer than the uncompressed string. Additionally, the CPU overhead was non-trivial compared to the already fast Lucene searches. There probably are better compression algorithms than the ZIP approach, but that's the only one I tried. If one were to use an "expensive" method like ZIP, then it might make sense to have some sort of threshold length before the compression kicks in? The "isCompressed" flag might only take effect if that threshold were exceeded? Alternately, the user could be responsible to set the isCompressed flag based on the field's length.
David McCallie -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, May 14, 2004 11:23 AM To: Lucene Developers List Subject: Re: stored field compression Doug Cutting wrote: > A more elaborate approach would be to lazily decompress fields when > values are accessed. Another big advantage of this approach (as reminded by Peter Cipollone) is that it will make indexing faster, as decompression will be avoided when merging. Doug --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
