Re: Toggling compression for stored fields

Vitaly Funstein Wed, 15 May 2013 16:12:13 -0700

Yes, I thought about inlining an anonymous subclass of Lucene41Codec but
unfortunately all of its methods are final, which effectively rules out
this approach. I think I may have to do the latter, since I am obviously in
control of internal JAR packaging anyway...


On Wed, May 15, 2013 at 4:06 PM, Uwe Schindler <[email protected]> wrote:

> You don't change the Codec at all just the stored fields implementation,
> so you dont need to give it a new name. The simpliest is to anonymous
> subclass Lucene41Codec without FilterCodec.
>
> If your codec gets a new name, this name must be regustered in the codec
> manager by adding META-INF files to your JAR and not using anonymous
> subclasses.
>
>
>
> Vitaly Funstein <[email protected]> schrieb:
>
> >Uwe,
> >
> >I may not be doing this correctly, but I tried to see what would happen
> >if
> >I were to a reopen an index created with a custom codec that disables
> >stored fields compression, and it doesn't seem to work. Here's how I
> >configure the writer to disable compression, prior to indexing:
> >
> >     final StoredFieldsFormat sfFmt = new Lucene40StoredFieldsFormat();
> >        idxWriterCfg.setCodec(new
> >FilterCodec("DisableStoreFieldCompressionCodec", new Lucene41Codec()) {
> >
> >          @Override
> >          public StoredFieldsFormat storedFieldsFormat() {
> >            return sfFmt;
> >          }
> >
> >        });
> >      }
> >
> >However, when an index that was created with this writer configuration
> >is
> >opened, I get this exception:
> >
> >Exception in thread "main" java.lang.IllegalArgumentException: A SPI
> >class
> >of type org.apache.lucene.codecs.Codec with name
> >'DisableStoreFieldCompressionCodec' does not exist. You need to add the
> >corresponding JAR file supporting this SPI to your classpath.The
> >current
> >classpath supports the following names: [Lucene40, Lucene3x, Lucene41]
> >at
> >org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:104)
> >    at org.apache.lucene.codecs.Codec.forName(Codec.java:95)
> >    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:299)
> >at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:347)
> >    at
>
> >org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
> >    at
>
> >org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:630)
> >    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:343)
> >    at
>
> >org.apache.lucene.index.DirectoryReader.indexExists(DirectoryReader.java:322)
> >
> >
> >I also tried instantiating Lucene40Codec directly to avoid using a
> >named
> >FilterCodec, but that codec apparently disallows writing to index in
> >Lucene
> >4.1:
> >
> >java.lang.UnsupportedOperationException: this codec can only be used
> >for
> >reading
> >    at
>
> >org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsConsumer(Lucene40PostingsFormat.java:246)
> >    at
>
> >org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:130)
> >    at
>
> >org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:336)
> >    at
>
> >org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)
> >   at org.apache.lucene.index.TermsHash.flush(TermsHash.java:116)    at
> >org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
> >    at
> >org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81)
> >    at
>
> >org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:487)
> >    at
> >org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
> >    at
>
> >org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:559)
> > at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:357)
> >    at
>
> >org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:270)
> >    at
>
> >org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:245)
> >    at
>
> >org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:235)
> >    at
>
> >org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:169)
> >    at
>
> >org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118)
> >    at
>
> >org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
> >    at
>
> >org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:154)
> >    at
>
> >org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:233)
> >
> >What am I doing wrong here?
> >
> >Thx,
> >Vitaly
> >
> >On Wed, May 15, 2013 at 2:47 PM, Uwe Schindler <[email protected]> wrote:
> >
> >> Yes. You can also force this by using IW.forceMerge(1), unless your
> >index
> >> is not already consisting of only one segment. Another alternative is
> >to
> >> use IndexUpgrader, but this one would only merge if there are
> >segments
> >> created with an older Lucene version. You can change this by
> >overriding
> >> IndexUpgrader's merge policy to use all segments.
> >>
> >> You reminded me to open an issue to add the possibility to
> >IndexUpgrader
> >> to also "upgrade" segments using a different codec configuration, not
> >just
> >> coming from an older Lucene version (which is possible to do).
> >>
> >> Uwe
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: [email protected]
> >>
> >>
> >> > -----Original Message-----
> >> > From: Vitaly Funstein [mailto:[email protected]]
> >> > Sent: Wednesday, May 15, 2013 11:36 PM
> >> > To: [email protected]
> >> > Subject: Re: Toggling compression for stored fields
> >> >
> >> > Thanks for the quick reply, this is certainly good news. So just to
> >> clarify
> >> > - doing a manual segment merge is optional when changing codecs,
> >> correct? I
> >> > mean, I can just restart my application with a new codec config and
> >let
> >> the
> >> > regular, background merging task do the work of eventually
> >converting all
> >> > the data to the new format?
> >> >
> >> > On Wed, May 15, 2013 at 2:30 PM, Uwe Schindler <[email protected]>
> >> > wrote:
> >> >
> >> > > Hi Vitaly,
> >> > >
> >> > > what you call an "index" is just a collection (a CompositeReader)
> >of
> >> > > atomic readers. They can be mixed regarding compression, just
> >like you
> >> > > could have a MultiReader with different indexes using different
> >codecs.
> >> > > Every atomic segment of an index can only have one stored fields
> >> format.
> >> > > Once merging occurs, the uncompressed fields of e.g. an older
> >atomic
> >> > > segment gets merged into a new segment with compression enabled.
> >The
> >> > > same can happen in the other direction. The codec is responsible
> >for
> >> > > encoding the data on disk and this includes the compression. When
> >> > > merging segments, the data is uncompressed and recompressed as
> >> > needed.
> >> > > To improve performance, there are shortcuts to copy the data
> >directly
> >> > > if the codec does not change while merging.
> >> > >
> >> > > With Lucene 4.x, you are free to open an IndexWriter with a
> >different
> >> > > codec configuration and e.g. use IndexUpgrader or do a force
> >merge
> >> > > manually to merge all "old" segments and "recompress" them to a
> >> > > different codec config. This has nothing to do with "reindexing"
> >as
> >> > > you are just changing the encoding of the exact same data on
> >disk.
> >> > >
> >> > > Uwe
> >> > >
> >> > > -----
> >> > > Uwe Schindler
> >> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> >> > > http://www.thetaphi.de
> >> > > eMail: [email protected]
> >> > >
> >> > >
> >> > > > -----Original Message-----
> >> > > > From: Vitaly Funstein [mailto:[email protected]]
> >> > > > Sent: Wednesday, May 15, 2013 10:38 PM
> >> > > > To: [email protected]
> >> > > > Subject: Toggling compression for stored fields
> >> > > >
> >> > > > Is it possible to have a mix of compressed and uncompressed
> >> > > > documents within a single index? That is, can I load an index
> >> > > > created with Lucene
> >> > > 4.0 into
> >> > > > 4.1 and defer the decision of whether or not to use
> >> > > > CompressingStoredFieldsFormat until a later time, or even go
> >back
> >> > > > and
> >> > > forth
> >> > > > between compressed and uncompressed codecs, if needed? I
> >thought at
> >> > > > first the answer would be an unequivocal "no", but then how
> >would
> >> > > > one migrate data from 4.0 to 4.1 without a full reindex?
> >> > >
> >> > >
> >> > >
> >---------------------------------------------------------------------
> >> > > To unsubscribe, e-mail: [email protected]
> >> > > For additional commands, e-mail: [email protected]
> >> > >
> >> > >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>
> --
> Uwe Schindler
> H.-H.-Meier-Allee 63, 28213 Bremen
> http://www.thetaphi.de

Re: Toggling compression for stored fields

Reply via email to