RE: stored field compression

McCallie,David Fri, 14 May 2004 10:00:57 -0700

A few months ago, I did a quick and dirty experiment using the
java.util.zip compression utilities to compress stored text fields using
Lucene 1.3. Unfortunately, I don't still have the data available, but as
I recall, it was not clear that compression was always a benefit.  In
particular, if the text fields are short (like titles of a paper,) the
overhead of the compression's embedded dictionary can make the
compressed string longer than the uncompressed string.  Additionally,
the CPU overhead was non-trivial compared to the already fast Lucene
searches.  There probably are better compression algorithms than the ZIP
approach, but that's the only one I tried.  If one were to use an
"expensive" method like ZIP, then it might make sense to have some sort
of threshold length before the compression kicks in?  The "isCompressed"
flag might only take effect if that threshold were exceeded?
Alternately, the user could be responsible to set the isCompressed flag
based on the field's length.


David McCallie



-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 14, 2004 11:23 AM
To: Lucene Developers List
Subject: Re: stored field compression

Doug Cutting wrote:
> A more elaborate approach would be to lazily decompress fields when 
> values are accessed.

Another big advantage of this approach (as reminded by Peter Cipollone)
is that it will make indexing faster, as decompression will be avoided
when merging.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: stored field compression

Reply via email to