Hi experts,
I'm upgrading Lucene 4.4 and trying to use DocValues instead of store field
for performance reason. But due to unknown size of index(depends on
customer), so I will use DiskDocValuesFormat, especially for some binary
field. Then I wrote my customized Codec:
final Codec codec = new Lucene42Codec() {
private final Lucene42DocValuesFormat memoryDVFormat = new
Lucene42DocValuesFormat();
private final DiskDocValuesFormat diskDVFormat = new
DiskDocValuesFormat();
@Override
public DocValuesFormat getDocValuesFormatForField(String field) {
if
(LucenePluginConstants.INDEX_STORED_RETURNABLE_FIELD.equals(field)
|| LucenePluginConstants.PAYLOAD_FIELD_NAME.equals(field) ||
LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE.equals(field)) {
return diskDVFormat;
} else {
return memoryDVFormat
}
}
};
iwc.setCodec(codec);
Here field LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE is numeric field,
long type. And others are binary.
Then I consume DV like below pseudo-code:
nodeIDDocValuesSource =
MultiDocValues.getNumericValues(searcher.getIndexReader(),
LucenePluginConstants.INDEX_NODE_ID_DOC_VALUE);
......
long nodeId= nodeIDDocValuesSource.get(scoreDoc.doc);
Then I'm sure I get a wrong nodeId, which will be verified by upper logic
and treated as data corruption.
But if I change to memoryDVFormat for the long type field, then everything
is OK.
Also for upgrading legacy data, I keep two index format, DV or stored
field, controlled by version. If I use stored field, everything is OK.
So I guess there is a bug with DiskDocValuesFormat, numeric data type,
does it relate to byte-aligned numeric compression?
Or I didn't use DiskDocValuesFormat correctly? Seems no other parameters
for it.
Sorry that I have no pure Lucene test case yet. Hope someone shed some
light on this.
Best regards,
Duke
If not now, when? If not me, who?