Hi Robert, Thank you for your reply! I used the same data set for both versions.
There are mainly two changes: 1. Before package com.ea.eadp.data.aem.audience.indexer.data.extension; import com.ea.eadp.data.aem.audience.shared.IndexField; import org.apache.lucene.codecs.Codec; import org.apache.lucene.codecs.DocValuesFormat; import org.apache.lucene.codecs.diskdv.DiskDocValuesFormat; import org.apache.lucene.codecs.lucene42.Lucene42Codec; import org.apache.lucene.codecs.lucene42.Lucene42DocValuesFormat; public class DiskDocValuesCodec { public static final Codec CODEC = new Lucene42Codec() { final Lucene42DocValuesFormat memoryDVFormat = new Lucene42DocValuesFormat(); final DiskDocValuesFormat diskDVFormat = new DiskDocValuesFormat(); @Override public DocValuesFormat getDocValuesFormatForField(String field) { if (field.contains("freq")) { // use Disk for boot/game session frequency data return diskDVFormat; } else { // use Lucene42 otherwise return memoryDVFormat; } } }; } After: package com.ea.eadp.data.aem.audience.indexer.data.extension; import com.ea.eadp.data.aem.audience.shared.IndexField; import org.apache.lucene.codecs.Codec; import org.apache.lucene.codecs.DocValuesFormat; import org.apache.lucene.codecs.diskdv.DiskDocValuesFormat; import org.apache.lucene.codecs.lucene45.Lucene45Codec; import org.apache.lucene.codecs.lucene45.Lucene45DocValuesFormat; public class DiskDocValuesCodec { public static final Codec CODEC = new Lucene45Codec() { final Lucene45DocValuesFormat memoryDVFormat = new Lucene45DocValuesFormat(); final DiskDocValuesFormat diskDVFormat = new DiskDocValuesFormat(); @Override public DocValuesFormat getDocValuesFormatForField(String field) { if (field.contains("freq")) { // use Disk for frequency data return diskDVFormat; } else { // use Lucene45 otherwise return memoryDVFormat; } } }; } 2. Changed IndexField.LUCENE_VERSION from Version.LUCENE_44 to Version.LUCENE_45 in the following code: Directory lucene_dir = FSDirectory.open(index_dir); Analyzer analyzer = new StandardAnalyzer(IndexField.LUCENE_VERSION); IndexWriterConfig lucene_iwc = new IndexWriterConfig( IndexField.LUCENE_VERSION, analyzer); lucene_iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE); lucene_iwc.setCodec(DiskDocValuesCodec.CODEC); // default memory buffer size is 16MB lucene_iwc.setRAMBufferSizeMB(configuration.getIndexerMembufSizeMB()); IndexWriter lucene_writer = new IndexWriter(lucene_dir, lucene_iwc); Did I do anything wrong? Any advice is appreciated! -----Original Message----- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Saturday, June 14, 2014 6:27 AM To: java-user Subject: Re: Indexing size increase 20% after switching from lucene 4.4 to 4.5 or 4.8 with BinaryDocValuesField They are still encoded the same way: so likely you arent testing apples to apples (e.g. different number of segments or whatever). On Fri, Jun 13, 2014 at 8:28 PM, Zhao, Gang <gz...@ea.com> wrote: > > > I used lucene 4.4 to create index for some documents. One of the > indexing fields is BinaryDocValuesField. After I change the dependency > to lucene 4.5. The index size for 1 million documents increases from 293MB to > 357MB. > If I did not use BinaryDocValuesField, the index size increases only > about 2%. I also tried lucene 4.8. The index size is similar to index > size with lucene 4.5. > > > > I am wondering what the change for handling BinaryDocValuesField from > 4.4 to 4.5 or 4.8 is. > > > > Gang Zhao > > Software Engineer - EA Digital Platform > > 207 Redwood Shores Parkway > Redwood City, CA 94065 > > Direct Line: 650-628-3719 > > [image: cid:image001.png@01CD68F0.6239B040] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org