On Aug 16, 2006, at 8:32 AM, Nicolas Lalevée wrote:

Hi,

In the issue, you wrote that "This way the indexing level just stores opaque binary fields, and then Document handles compress/uncompressing as needed."

I have looked into the Lucene code, and it seems to me that it is Field that should take care of compress/uncompress, and it is the FieldsReader and
FieldsWriter that should only view binary data.
Or you mean that compression should be completely external to Lucene ?


I believe the consensus is it should be done externally.

In fact, from the end of the other thread "Flexible index format / Payloads Cont'd", I was discussing about how to cutomize the way data are stored. So I have looked deeper in the code and I think I have found a way to do so. And as you could change the way is it stored, you also can define the compression level, or handle your own compression algorithm. I will show you a patch, but I have modified so much code because of my sevral tries, that I need first to
remove the unecessary changes. To describe it shortly :
- I have provided a way to provide you own FieldsReader and FieldsWriter (via a factory). To create a IndexReader, you have to provide that factory; the
actual API is just using a default factory.
- I have moved the code of FieldsReader and FieldsReader that do the field
data reading to a new class FieldData. The FieldsReader instanciates a
FieldData, do a fielddata.read(input), and do a new Field (fielddata,...). The
FieldsReader do a field.getFieldData().write(output);
- so extending FieldsReader, you can provide you own implementation of
FieldData, so you can implement the way you want how data are stored and
read.
The tests pass successfully, but I have an issue with that design : one thing that is important I think is that in the current design, we can read an index in an old format, and just do a writer.addIndexes() into a new format. With the new design, you cannot, because the writer will use the FieldData.write
provided by the reader.
To be continued...

I would love to see this patch. I think one could make a pretty good argument for this kind of implementation being done "cleanly", that is, it shouldn't necessarily involve reworking the internals, but instead could represent the foundation for a new, codec based indexing mechanism (with an implementation that can read/write the existing file format.)



cheers,
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to