Chuck Williams wrote:
It appears that termIndexInterval is factored into the stored index and
thus cannot be changed dynamically to work around the problem after an
index has become polluted. Other than identifying the documents
containing binary data, deleting them, and then optimizing the whole
index, has anybody found a better way to recover from this problem?
Hadoop's MapFile is similar to Lucene's term index, and supports a
feature where only a subset of the index entries are loaded (determined
by io.map.index.skip). It would not be difficult to add such a feature
to Lucene by changing TermInfosReader#ensureIndexIsRead().
Here's a (totally untested) patch.
Doug
Index: src/java/org/apache/lucene/index/TermInfosReader.java
===================================================================
--- src/java/org/apache/lucene/index/TermInfosReader.java (revision 592857)
+++ src/java/org/apache/lucene/index/TermInfosReader.java (working copy)
@@ -41,6 +41,8 @@
private SegmentTermEnum indexEnum;
+ private int indexDivisor = 1;
+
TermInfosReader(Directory dir, String seg, FieldInfos fis)
throws CorruptIndexException, IOException {
this(dir, seg, fis, BufferedIndexInput.BUFFER_SIZE);
@@ -63,6 +65,12 @@
readBufferSize), fieldInfos, true);
success = true;
+
+ String divisorString = System.getProperty("lucene.term.index.divisor");
+ if (divisorString != null) {
+ indexDivisor = Integer.parseInt(divisorString);
+ }
+
} finally {
// With lock-less commits, it's entirely possible (and
// fine) to hit a FileNotFound exception above. In
@@ -109,7 +117,7 @@
if (indexTerms != null) // index already read
return; // do nothing
try {
- int indexSize = (int)indexEnum.size; // otherwise read index
+ int indexSize = (int)indexEnum.size/indexDivisor;
indexTerms = new Term[indexSize];
indexInfos = new TermInfo[indexSize];
@@ -119,6 +127,12 @@
indexTerms[i] = indexEnum.term();
indexInfos[i] = indexEnum.termInfo();
indexPointers[i] = indexEnum.indexPointer;
+
+ for (int j = 1; j < indexDivisor; j++) { // only load a subset?
+ if (!indexEnum.next()) {
+ break;
+ }
+ }
}
} finally {
indexEnum.close();
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]