[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544136
 ] 

Chuck Williams commented on LUCENE-1052:
----------------------------------------

I can report that in our application having a formula is critical.  We have no 
control over the content our users index, nor in fact do they.  These are 
arbitrary documents.  We find a surprising number of them contain embedded 
encoded binary data.  When those are indexed, lucene's memory consumption 
skyrockets, either bringing the whole app down with an OOM or slowing 
performance to a crawl due to excessive GC's reclaiming a tiny remaining 
working memory space.

Our users won't accept a solution like, wait until the problem occurs and then 
increment your termIndexDivisor.  They expect our app to manage this 
automatically.

I agree that making TermInfosReader, SegmentReader, etc. public classes is not 
a great solution  The current patch does not do that.  It simply adds a 
configurable class that can be used to provide formula parameters as opposed to 
just value parameters.  At least for us, this special case is sufficiently 
important to outweigh any considerations of the complexity of an additional 
class.

A single configuration class could be used at the IndexReader level that 
provides for both static and dynamically-varying properties through getters, 
some of which take parameters.

Here is another possible solution.  My current thought is that the bound should 
always be a multiple of sqrt(numDocs).  E.g., see Heap's Law here:  
http://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html

I'm currently using this formula in my TermInfosConfigurer:

            int bound = (int) 
(1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

This has Heap's law as foundation.  I provide TERM_BOUNDING_MULTIPLIER as the 
config parameter, with 0 meaning don't do this.  I also provide a 
TERM_INDEX_DIVISOR_OVERRIDE that overrides the dynamic bounding with a manually 
specified constant amount.

If that approach would be acceptable to lucene in general, then we just need 
two static parameters.  However, I don't have enough experience with how well 
this formula works in our user base yet to know whether or not we'll tune it 
further.




> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to