Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields

Robert Engels Wed, 16 Aug 2006 17:33:47 -0700

I just think the compressed field type should be removed from lucene all 
together. Only the binary field type should remain, and the application can 
externally compress/uncompress fields using a fascade/containment hierarchy 
using Document.


That is

class MyDocument {
    Document doc;

   String getField(String name) {
       if(isCompressed(name) {
           return decompress(doc.getBinaryField())
      else
           return doc.getField();
}

Or some such thing, and not deal with the compression at a lucene level. In 
order to have Lucene deal with the compression, you would really need to settle 
on the compression type, and parameters and how they would be stored - 
otherwise cross platform (or Plucene) would never be able to read to access the 
index. If the compression were external, all the implementation need is binary 
field support, and then they would only no be able to access the compressed 
fields if they did not have a suitable way to decompress them.

Otherwise, I think you need a much more advanced compression scheme - similar 
to the PDF specification - because different fields would ideally be compressed 
using different alogorithyms, and forcing a one size fits all doesn't normally 
work well in such a low-level library.



-----Original Message-----
>From: Grant Ingersoll <[EMAIL PROTECTED]>
>Sent: Aug 16, 2006 6:51 AM
>To: [email protected]
>Subject: Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression 
>level for compressed fields
>
>
>On Aug 16, 2006, at 8:32 AM, Nicolas Lalevï¿½e wrote:
>
>> Hi,
>>
>> In the issue, you wrote that "This way the indexing level just  
>> stores opaque
>> binary fields, and then Document handles compress/uncompressing as  
>> needed."
>>
>> I have looked into the Lucene code, and it seems to me that it is  
>> Field that
>> should take care of compress/uncompress, and it is the FieldsReader  
>> and
>> FieldsWriter that should only view binary data.
>> Or you mean that compression should be completely external to Lucene ?
>>
>
>I believe the consensus is it should be done externally.
>
>> In fact, from the end of the other thread "Flexible index format /  
>> Payloads
>> Cont'd", I was discussing about how to cutomize the way data are  
>> stored. So I
>> have looked deeper in the code and I think I have found a way to do  
>> so. And
>> as you could change the way is it stored, you also can define the  
>> compression
>> level, or handle your own compression algorithm. I will show you a  
>> patch, but
>> I have modified so much code because of my sevral tries, that I  
>> need first to
>> remove the unecessary changes. To describe it shortly :
>> - I have provided a way to provide you own FieldsReader and  
>> FieldsWriter (via
>> a factory). To create a IndexReader, you have to provide that  
>> factory; the
>> actual API is just using a default factory.
>> - I have moved the code of FieldsReader and FieldsReader that do  
>> the field
>> data reading to a new class FieldData. The FieldsReader instanciates a
>> FieldData, do a fielddata.read(input), and do a new Field 
>> (fielddata,...). The
>> FieldsReader do a field.getFieldData().write(output);
>> - so extending FieldsReader, you can provide you own implementation of
>> FieldData, so you can implement the way you want how data are  
>> stored and
>> read.
>> The tests pass successfully, but I have an issue with that design :  
>> one thing
>> that is important I think is that in the current design, we can  
>> read an index
>> in an old format, and just do a writer.addIndexes() into a new  
>> format. With
>> the new design, you cannot, because the writer will use the  
>> FieldData.write
>> provided by the reader.
>> To be continued...
>
>I would love to see this patch.  I think one could make a pretty good  
>argument for this kind of implementation being done "cleanly", that  
>is, it shouldn't necessarily involve reworking the internals, but  
>instead could represent the foundation for a new, codec based  
>indexing mechanism (with an implementation that can read/write the  
>existing file format.)
>
>
>>
>> cheers,
>> Nicolas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>--------------------------
>Grant Ingersoll
>Sr. Software Engineer
>Center for Natural Language Processing
>Syracuse University
>335 Hinds Hall
>Syracuse, NY 13244
>http://www.cnlp.org
>
>Voice: 315-443-5484
>Skype: grant_ingersoll
>Fax: 315-443-6886
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-648) Allow changing of ZIP compression level for compressed fields

Reply via email to