Hi Robert,
Thank you for your reply! I used the same data set for both versions.
There are mainly two changes:
1.
Before
package com.ea.eadp.data.aem.audience.indexer.data.extension;
import com.ea.eadp.data.aem.audience.shared.IndexField;
import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.DocValuesFormat;
import org.apache.lucene.codecs.diskdv.DiskDocValuesFormat;
import org.apache.lucene.codecs.lucene42.Lucene42Codec;
import org.apache.lucene.codecs.lucene42.Lucene42DocValuesFormat;
public class DiskDocValuesCodec {
public static final Codec CODEC = new Lucene42Codec() {
final Lucene42DocValuesFormat memoryDVFormat =
new Lucene42DocValuesFormat();
final DiskDocValuesFormat diskDVFormat =
new DiskDocValuesFormat();
@Override
public DocValuesFormat getDocValuesFormatForField(String field) {
if (field.contains("freq")) {
// use Disk for boot/game session frequency data
return diskDVFormat;
} else {
// use Lucene42 otherwise
return memoryDVFormat;
}
}
};
}
After:
package com.ea.eadp.data.aem.audience.indexer.data.extension;
import com.ea.eadp.data.aem.audience.shared.IndexField;
import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.DocValuesFormat;
import org.apache.lucene.codecs.diskdv.DiskDocValuesFormat;
import org.apache.lucene.codecs.lucene45.Lucene45Codec;
import org.apache.lucene.codecs.lucene45.Lucene45DocValuesFormat;
public class DiskDocValuesCodec {
public static final Codec CODEC = new Lucene45Codec() {
final Lucene45DocValuesFormat memoryDVFormat =
new Lucene45DocValuesFormat();
final DiskDocValuesFormat diskDVFormat =
new DiskDocValuesFormat();
@Override
public DocValuesFormat getDocValuesFormatForField(String field) {
if (field.contains("freq")) {
// use Disk for frequency data
return diskDVFormat;
} else {
// use Lucene45 otherwise
return memoryDVFormat;
}
}
};
}
2. Changed IndexField.LUCENE_VERSION from Version.LUCENE_44 to
Version.LUCENE_45 in the following code:
Directory lucene_dir = FSDirectory.open(index_dir);
Analyzer analyzer = new StandardAnalyzer(IndexField.LUCENE_VERSION);
IndexWriterConfig lucene_iwc = new IndexWriterConfig(
IndexField.LUCENE_VERSION, analyzer);
lucene_iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
lucene_iwc.setCodec(DiskDocValuesCodec.CODEC);
// default memory buffer size is 16MB
lucene_iwc.setRAMBufferSizeMB(configuration.getIndexerMembufSizeMB());
IndexWriter lucene_writer = new IndexWriter(lucene_dir, lucene_iwc);
Did I do anything wrong? Any advice is appreciated!
-----Original Message-----
From: Robert Muir [mailto:[email protected]]
Sent: Saturday, June 14, 2014 6:27 AM
To: java-user
Subject: Re: Indexing size increase 20% after switching from lucene 4.4 to 4.5
or 4.8 with BinaryDocValuesField
They are still encoded the same way: so likely you arent testing apples to
apples (e.g. different number of segments or whatever).
On Fri, Jun 13, 2014 at 8:28 PM, Zhao, Gang <[email protected]> wrote:
>
>
> I used lucene 4.4 to create index for some documents. One of the
> indexing fields is BinaryDocValuesField. After I change the dependency
> to lucene 4.5. The index size for 1 million documents increases from 293MB to
> 357MB.
> If I did not use BinaryDocValuesField, the index size increases only
> about 2%. I also tried lucene 4.8. The index size is similar to index
> size with lucene 4.5.
>
>
>
> I am wondering what the change for handling BinaryDocValuesField from
> 4.4 to 4.5 or 4.8 is.
>
>
>
> Gang Zhao
>
> Software Engineer - EA Digital Platform
>
> 207 Redwood Shores Parkway
> Redwood City, CA 94065
>
> Direct Line: 650-628-3719
>
> [image: cid:[email protected]]
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]