From: Grant Ingersoll <[EMAIL PROTECTED]>
Sent: Aug 16, 2006 6:51 AM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP
compression level for compressed fields
On Aug 16, 2006, at 8:32 AM, Nicolas Lalev�e wrote:
Hi,
In the issue, you wrote that "This way the indexing level just
stores opaque
binary fields, and then Document handles compress/uncompressing as
needed."
I have looked into the Lucene code, and it seems to me that it is
Field that
should take care of compress/uncompress, and it is the FieldsReader
and
FieldsWriter that should only view binary data.
Or you mean that compression should be completely external to
Lucene ?
I believe the consensus is it should be done externally.
In fact, from the end of the other thread "Flexible index format /
Payloads
Cont'd", I was discussing about how to cutomize the way data are
stored. So I
have looked deeper in the code and I think I have found a way to do
so. And
as you could change the way is it stored, you also can define the
compression
level, or handle your own compression algorithm. I will show you a
patch, but
I have modified so much code because of my sevral tries, that I
need first to
remove the unecessary changes. To describe it shortly :
- I have provided a way to provide you own FieldsReader and
FieldsWriter (via
a factory). To create a IndexReader, you have to provide that
factory; the
actual API is just using a default factory.
- I have moved the code of FieldsReader and FieldsReader that do
the field
data reading to a new class FieldData. The FieldsReader
instanciates a
FieldData, do a fielddata.read(input), and do a new Field
(fielddata,...). The
FieldsReader do a field.getFieldData().write(output);
- so extending FieldsReader, you can provide you own
implementation of
FieldData, so you can implement the way you want how data are
stored and
read.
The tests pass successfully, but I have an issue with that design :
one thing
that is important I think is that in the current design, we can
read an index
in an old format, and just do a writer.addIndexes() into a new
format. With
the new design, you cannot, because the writer will use the
FieldData.write
provided by the reader.
To be continued...
I would love to see this patch. I think one could make a pretty good
argument for this kind of implementation being done "cleanly", that
is, it shouldn't necessarily involve reworking the internals, but
instead could represent the foundation for a new, codec based
indexing mechanism (with an implementation that can read/write the
existing file format.)
cheers,
Nicolas
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]