Doug Cutting wrote:

Dmitry Serebrennikov wrote:

A different approach would be to just allow binary data in fields. That way applications can compress and decompress as they see fit, plus they would be able to store numerical and other data more efficiently.


That's an interesting idea. One could, for convenience and compatibility, add accessor methods to Field that, when you add a String, convert it to UTF-8 bytes, and make stringValue() parse (and possibly cache) a UTF-8 string from the binary value. There'd be another allocation per field read: FieldReader would construct a byte[], then stringValue() would construct a String with a char[]. Right now we only construct a String with a char[] per stringValue(). Perhaps this is moot, especially if we're lazy about constructing the strings and they're cached. That way, for all the fields you don't access you save an allocation.

Actually, I was thinking of something simpler... Somthing like a special case where one could supply binary data directly into a stored field. Something like:
public class Field {
public static Field Binary(String name, byte[] value);
public boolean isBinary();
public byte[] binaryValue();
}


This would automatically become a stored field. Lucene wouldn't need to know what the data means - just carry it around. The binaryValue() can return null unless isBinary() is true, in which case you'd get the data back and stringValue() would return null instead.

This would be a start. If we want to provide special handling for ints, floats, and so on, we provide a BinaryField class, a la DateField.

We might lose some efficiency because ints and longs would be better off if they were stored as ints and longs rather than a byte[]...

Actually, we might be able to represent binary data fields as offsets into the complete byte[] that was read from the index file in the first place. That way we woudln't need to copy the data until binaryValue() method was called. Also the BinaryField class can do byte[] -> int conversion directly from the offsets into the main byte[] buffer, again saving byte[] allocation.

Would binary fields only be useful for stored fields? I can't really see how binary data could be usefully tokenized, but maybe in some multimedia applications? Binary keyword fields might be interesting. These could allow searching on integer ranges, more straight-forward date ranges, and more efficient data storage in some cases. That's a big change though. We'd have to change all searching to be based on binary tokens instead of strings.



Of course, this would then be a per-value compression and probably not as effective as a whole index compression that could be done with the other approaches.


But, since documents are accessed randomly, we can't easily do a lot better for field data.

I don't know much about how Zip algorithm works internally, but it seems that there could be a parallel between the zip file with zip entries and the lucene index with lucene documents.


This feature is primarily intended to make life easier for folks who want to store whole documents in the index. Selective use of gzip would be a huge improvement over the present situation. Alternate compression algorithms might make things a bit better yet, but probably not hugely.

I agree, unless one can figure out how to share the dictionary across documents.
If we just go now with a simple binary data-bucket design described above, applications can do any clever implementation they chose. BinaryField class will provide helper methods for the most common things. Perhaps GZipField is another good candidate for the immediate future.


Going forward, perhaps there is a way to do compression such that dictionary is managed for each segment of the index, and merged when the segments are merged? If this is possible, it would be a good argument for Lucene to be compression-aware.

How does all of this sound?

Dmitry.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to