[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Chuck Williams (JIRA) Sun, 18 Nov 2007 09:51:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543383
 ]


Chuck Williams commented on LUCENE-1052:
----------------------------------------

I believe this needs to be a formula as a reasonable bound on the number of 
terms is in general a function of the number of documents in the segment and 
the nature of the index (e.g., types of fields).  A common thing to do would be 
to enforce that RAM usage for cached terms grows no faster than logarithmically 
in the number of documents.  The specific formula that is appropriate will 
depend on the index, i.e. on the application.  It might be of the form:  
c*ln(numdocs+k), wnere c and k are constants dependent on the index.

One consequence of this approach, or any approach along these lines, is that 
the indexDivisor will vary across the segments, both in a single index and 
across indexes.  It seems to me from the code that this should work fine.

This leaves the issue of how to best specify an arbitrary formula.  This 
requires a method to compute the max cached terms allowed for a segment based 
on the number of docs in the segment, the number of terms in the segment's 
index, and possibly other factors.  The most direct way to do this is to 
introduce an interface, e.g. TermInfosConfigurer, to define the method 
signature, and to add setTermInfosConfigurer as an alternative to 
setTermInfosIndexDivisor.  It would need to be in all the same places.

A more general approach would be to introduce an IndexConfigurer class which 
over time could hold additional methods like this.  It could even replace the 
current setters on IndexReader (as well as IndexWriter, etc.) with a more 
general mechanism that would allow dynamic parameters used to configure any 
classes in the index structure.  Each constructor would be passed the 
IndexConfigurer and call getters or other methods on it to obtain its config.  
The methods could provide constant values or dynamic formulas.

I'm going to implement the straightforward solution at the moment in our older 
version of Lucene, then will sync up to whatever you guys decide is best for 
the trunk later.
 

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Reply via email to