Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields

Grant Ingersoll Wed, 16 Aug 2006 18:25:22 -0700

I agree. I would vote for deprecating the compression stuff. I amstill interested in the flexible indexing part mentioned later inNicolas' response, but that is a separate thread.



On Aug 16, 2006, at 8:33 PM, Robert Engels wrote:

I just think the compressed field type should be removed fromlucene all together. Only the binary field type should remain, andthe application can externally compress/uncompress fields using afascade/containment hierarchy using Document.


That is

class MyDocument {
    Document doc;

   String getField(String name) {
       if(isCompressed(name) {
           return decompress(doc.getBinaryField())
      else
           return doc.getField();
}

Or some such thing, and not deal with the compression at a lucenelevel. In order to have Lucene deal with the compression, you wouldreally need to settle on the compression type, and parameters andhow they would be stored - otherwise cross platform (or Plucene)would never be able to read to access the index. If the compressionwere external, all the implementation need is binary field support,and then they would only no be able to access the compressed fieldsif they did not have a suitable way to decompress them.

Otherwise, I think you need a much more advanced compression scheme- similar to the PDF specification - because different fields wouldideally be compressed using different alogorithyms, and forcing aone size fits all doesn't normally work well in such a low-levellibrary.




-----Original Message-----

From: Grant Ingersoll <[EMAIL PROTECTED]>
Sent: Aug 16, 2006 6:51 AM
To: java-dev@lucene.apache.org

Subject: Re: [jira] Commented: (LUCENE-648) Allow changing of ZIPcompression level for compressed fields



On Aug 16, 2006, at 8:32 AM, Nicolas Lalevï¿½e wrote:

Hi,

In the issue, you wrote that "This way the indexing level just
stores opaque
binary fields, and then Document handles compress/uncompressing as
needed."

I have looked into the Lucene code, and it seems to me that it is
Field that
should take care of compress/uncompress, and it is the FieldsReader
and
FieldsWriter that should only view binary data.

Or you mean that compression should be completely external toLucene ?


I believe the consensus is it should be done externally.

In fact, from the end of the other thread "Flexible index format /
Payloads
Cont'd", I was discussing about how to cutomize the way data are
stored. So I
have looked deeper in the code and I think I have found a way to do
so. And
as you could change the way is it stored, you also can define the
compression
level, or handle your own compression algorithm. I will show you a
patch, but
I have modified so much code because of my sevral tries, that I
need first to
remove the unecessary changes. To describe it shortly :
- I have provided a way to provide you own FieldsReader and
FieldsWriter (via
a factory). To create a IndexReader, you have to provide that
factory; the
actual API is just using a default factory.
- I have moved the code of FieldsReader and FieldsReader that do
the field

data reading to a new class FieldData. The FieldsReaderinstanciates a

FieldData, do a fielddata.read(input), and do a new Field
(fielddata,...). The
FieldsReader do a field.getFieldData().write(output);

- so extending FieldsReader, you can provide you ownimplementation of

FieldData, so you can implement the way you want how data are
stored and
read.
The tests pass successfully, but I have an issue with that design :
one thing
that is important I think is that in the current design, we can
read an index
in an old format, and just do a writer.addIndexes() into a new
format. With
the new design, you cannot, because the writer will use the
FieldData.write
provided by the reader.
To be continued...


I would love to see this patch.  I think one could make a pretty good
argument for this kind of implementation being done "cleanly", that
is, it shouldn't necessarily involve reworking the internals, but
instead could represent the foundation for a new, codec based
indexing mechanism (with an implementation that can read/write the
existing file format.)

cheers,
Nicolas
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields

Reply via email to