Chuck Williams wrote:
It appears that termIndexInterval is factored into the stored index and thus cannot be changed dynamically to work around the problem after an index has become polluted. Other than identifying the documents containing binary data, deleting them, and then optimizing the whole index, has anybody found a better way to recover from this problem?
Hadoop's MapFile is similar to Lucene's term index, and supports a feature where only a subset of the index entries are loaded (determined by io.map.index.skip). It would not be difficult to add such a feature to Lucene by changing TermInfosReader#ensureIndexIsRead().
Here's a (totally untested) patch. Doug
Index: src/java/org/apache/lucene/index/TermInfosReader.java =================================================================== --- src/java/org/apache/lucene/index/TermInfosReader.java (revision 592857) +++ src/java/org/apache/lucene/index/TermInfosReader.java (working copy) @@ -41,6 +41,8 @@ private SegmentTermEnum indexEnum; + private int indexDivisor = 1; + TermInfosReader(Directory dir, String seg, FieldInfos fis) throws CorruptIndexException, IOException { this(dir, seg, fis, BufferedIndexInput.BUFFER_SIZE); @@ -63,6 +65,12 @@ readBufferSize), fieldInfos, true); success = true; + + String divisorString = System.getProperty("lucene.term.index.divisor"); + if (divisorString != null) { + indexDivisor = Integer.parseInt(divisorString); + } + } finally { // With lock-less commits, it's entirely possible (and // fine) to hit a FileNotFound exception above. In @@ -109,7 +117,7 @@ if (indexTerms != null) // index already read return; // do nothing try { - int indexSize = (int)indexEnum.size; // otherwise read index + int indexSize = (int)indexEnum.size/indexDivisor; indexTerms = new Term[indexSize]; indexInfos = new TermInfo[indexSize]; @@ -119,6 +127,12 @@ indexTerms[i] = indexEnum.term(); indexInfos[i] = indexEnum.termInfo(); indexPointers[i] = indexEnum.indexPointer; + + for (int j = 1; j < indexDivisor; j++) { // only load a subset? + if (!indexEnum.next()) { + break; + } + } } } finally { indexEnum.close();
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]