I also tried Drew Farris's binary patch. It works fine with a few testing cases of mine. However, I didn't have enough time to do a thorough performance comparison. I suggest the patch should be checked into cvs.
On Wed, 01 Sep 2004 22:42:54 +0200, Bernhard Messer <[EMAIL PROTECTED]> wrote: > Doug Cutting wrote: > > > Bernhard Messer wrote: > > > >> a few month ago, there was a very interesting discussion about field > >> compression and the possibility to store binary field values within a > >> lucene document. Regarding to this topic, Drew Farris came up with a > >> patch to add the necessary functionality. I ran all the necessary > >> tests on his implementation and didn't find one problem. So the > >> original implementation from Drew could now be enhanced to compress > >> the binary field data (maybe even the text fields if they are stored > >> only) before writing to disc. I made some simple statistical > >> measurements using the java.util.zip package for data compression. > >> Enabling it, we could save about 40% data when compressing plain text > >> files with a size from 1KB to 4KB. If there is still some interest, > >> we could first try to update the patch, because it's outdated due to > >> several changes within the Fields class. After finishing that, > >> compression could be added to the updated version of the patch. > > > > > > I like this patch and support upgrading it and adding it to Lucene. > > > Having a single, huge patch, implementing all the functionality, seems > to be very difficult to maintain thru Bugzilla. So i would suggest to > split the whole implementation in maybe 3 different steps. > 1) updating the binary field patch and add it to lucene > 2) making FieldsReader and FieldsWriter more readable using private > static finals and add compression > 3) additional thoughts about compressing whole documents instead of > single fields. > > > I imagine a public API like: > > > > public static final class Store { > > > > [ ... ] > > > > public static final COMPRESS = new Store(); > > } > > > > new Field(String, byte[]) // stored, not compressed or indexed > > new Field(String, byte[], Store) > > > > Also, in Field.java, perhaps we could replace: > > > > String stringValue; > > Reader readerValue; > > byte[] binaryValue; > > > > with: > > > > Object value; > > > > And in FieldsReader.java and FieldsWriter.java, some package-private > > constants would make the code more readable, like: > > > > static final int FieldWriter.IS_TOKENIZED = 1; > > static final int FieldWriter.IS_BINARY = 2; > > static final int FieldWriter.IS_COMPRESSED = 4; > > > > Note that it makes sense to compress non-binary values. One could use > > String.getBytes("UTF-8") and compress that. > > > I'm totally with you. Compressing string values would make sense if the > length reaches a certain size (the same for byte[]). This limit is > something we have to figure out, what the minimum size of a compression > candidate has to be. During my tests, i saw that everything up to 100 > bytes is a perfect candidate for compression. But there is much more > work to do in that area. > > > I wonder if it might make more sense to compress entire document > > records, rather than individual fields. This would probably do better > > when documents have lots of short text fields, as is not uncommon, and > > would also minimize the fixed compression/decompression setup costs > > (i.e., inflator/deflator allocations). We could instead add a > > "isCompressed" flag to Document, and then, in Field{Reader,Writer}, > > store a bit per document indicating whether it is compressed. > > Document records could first be serialized uncompressed to a buffer > > which is then compressed and written. Thoughts? > > > Interesting idea. I think this strongly depends on the fields, the > options they have and at least their values. Would it make sense to > compress a field which is tokenized and indexed but not stored ? My be > we could think on some kind of algorithm, checking the document fields > setting and decide if it is a candidate for compression. Just a thought ;-) > > > > > Doug > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]