I like this approach: it means, at search time, you can choose to further subsample the already subsampled (during indexing) set of terms for the TermInfosReader index. So you can easily turn the knob to trade off memory usage vs IO cost/latency during searching.
I'll open an issue and work through this patch. One thing is: I'd prefer to not use system property for this, since it's so global, but I'm not sure how to better do it. A static int on the class would likewise be global. Passing down an argument to the ctor would be good, except, it would have to be threaded up into SegmentReader, IndexReader, etc., mutiplying the ctors these classes already have. We can't add a "setIndexDivisor(...)" method because the terms are already loading (consuming too much ram) during the ctor. This would be the perfect time to use optional named/keyword arguments, but Java does not support them (grrrr). What if, instead, we passed down a Properties instance to IndexReader ctors? Or alternatively a dedicated class, eg, "IndexReaderInitParameters"? The advantage of a dedicated class is it's strongly typed at compile time, and, you could put things in there like an optional DeletionPolicy instance as well. I think there are a growing list of these sorts of "advanced optional parameters used during init" that could be handled with such an approach? Any other options here? Mike "Doug Cutting" <[EMAIL PROTECTED]> wrote: > Chuck Williams wrote: > > It appears that termIndexInterval is factored into the stored index and > > thus cannot be changed dynamically to work around the problem after an > > index has become polluted. Other than identifying the documents > > containing binary data, deleting them, and then optimizing the whole > > index, has anybody found a better way to recover from this problem? > > Hadoop's MapFile is similar to Lucene's term index, and supports a > feature where only a subset of the index entries are loaded (determined > by io.map.index.skip). It would not be difficult to add such a feature > to Lucene by changing TermInfosReader#ensureIndexIsRead(). > > Here's a (totally untested) patch. > > Doug --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]