I agree. I would vote for deprecating the compression stuff. I am still interested in the flexible indexing part mentioned later in Nicolas' response, but that is a separate thread.


On Aug 16, 2006, at 8:33 PM, Robert Engels wrote:

I just think the compressed field type should be removed from lucene all together. Only the binary field type should remain, and the application can externally compress/uncompress fields using a fascade/containment hierarchy using Document.

That is

class MyDocument {
    Document doc;

   String getField(String name) {
       if(isCompressed(name) {
           return decompress(doc.getBinaryField())
      else
           return doc.getField();
}

Or some such thing, and not deal with the compression at a lucene level. In order to have Lucene deal with the compression, you would really need to settle on the compression type, and parameters and how they would be stored - otherwise cross platform (or Plucene) would never be able to read to access the index. If the compression were external, all the implementation need is binary field support, and then they would only no be able to access the compressed fields if they did not have a suitable way to decompress them.

Otherwise, I think you need a much more advanced compression scheme - similar to the PDF specification - because different fields would ideally be compressed using different alogorithyms, and forcing a one size fits all doesn't normally work well in such a low-level library.



-----Original Message-----
From: Grant Ingersoll <[EMAIL PROTECTED]>
Sent: Aug 16, 2006 6:51 AM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields


On Aug 16, 2006, at 8:32 AM, Nicolas Lalev�e wrote:

Hi,

In the issue, you wrote that "This way the indexing level just
stores opaque
binary fields, and then Document handles compress/uncompressing as
needed."

I have looked into the Lucene code, and it seems to me that it is
Field that
should take care of compress/uncompress, and it is the FieldsReader
and
FieldsWriter that should only view binary data.
Or you mean that compression should be completely external to Lucene ?


I believe the consensus is it should be done externally.

In fact, from the end of the other thread "Flexible index format /
Payloads
Cont'd", I was discussing about how to cutomize the way data are
stored. So I
have looked deeper in the code and I think I have found a way to do
so. And
as you could change the way is it stored, you also can define the
compression
level, or handle your own compression algorithm. I will show you a
patch, but
I have modified so much code because of my sevral tries, that I
need first to
remove the unecessary changes. To describe it shortly :
- I have provided a way to provide you own FieldsReader and
FieldsWriter (via
a factory). To create a IndexReader, you have to provide that
factory; the
actual API is just using a default factory.
- I have moved the code of FieldsReader and FieldsReader that do
the field
data reading to a new class FieldData. The FieldsReader instanciates a
FieldData, do a fielddata.read(input), and do a new Field
(fielddata,...). The
FieldsReader do a field.getFieldData().write(output);
- so extending FieldsReader, you can provide you own implementation of
FieldData, so you can implement the way you want how data are
stored and
read.
The tests pass successfully, but I have an issue with that design :
one thing
that is important I think is that in the current design, we can
read an index
in an old format, and just do a writer.addIndexes() into a new
format. With
the new design, you cannot, because the writer will use the
FieldData.write
provided by the reader.
To be continued...

I would love to see this patch.  I think one could make a pretty good
argument for this kind of implementation being done "cleanly", that
is, it shouldn't necessarily involve reworking the internals, but
instead could represent the foundation for a new, codec based
indexing mechanism (with an implementation that can read/write the
existing file format.)



cheers,
Nicolas

-------------------------------------------------------------------- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to