[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544136 ]
Chuck Williams commented on LUCENE-1052: ---------------------------------------- I can report that in our application having a formula is critical. We have no control over the content our users index, nor in fact do they. These are arbitrary documents. We find a surprising number of them contain embedded encoded binary data. When those are indexed, lucene's memory consumption skyrockets, either bringing the whole app down with an OOM or slowing performance to a crawl due to excessive GC's reclaiming a tiny remaining working memory space. Our users won't accept a solution like, wait until the problem occurs and then increment your termIndexDivisor. They expect our app to manage this automatically. I agree that making TermInfosReader, SegmentReader, etc. public classes is not a great solution The current patch does not do that. It simply adds a configurable class that can be used to provide formula parameters as opposed to just value parameters. At least for us, this special case is sufficiently important to outweigh any considerations of the complexity of an additional class. A single configuration class could be used at the IndexReader level that provides for both static and dynamically-varying properties through getters, some of which take parameters. Here is another possible solution. My current thought is that the bound should always be a multiple of sqrt(numDocs). E.g., see Heap's Law here: http://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html I'm currently using this formula in my TermInfosConfigurer: int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL); This has Heap's law as foundation. I provide TERM_BOUNDING_MULTIPLIER as the config parameter, with 0 meaning don't do this. I also provide a TERM_INDEX_DIVISOR_OVERRIDE that overrides the dynamic bounding with a manually specified constant amount. If that approach would be acceptable to lucene in general, then we just need two static parameters. However, I don't have enough experience with how well this formula works in our user base yet to know whether or not we'll tune it further. > Add an "termInfosIndexDivisor" to IndexReader > --------------------------------------------- > > Key: LUCENE-1052 > URL: https://issues.apache.org/jira/browse/LUCENE-1052 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1052.patch, termInfosConfigurer.patch > > > The termIndexInterval, set during indexing time, let's you tradeoff > how much RAM is used by a reader to load the indexed terms vs cost of > seeking to the specific term you want to load. > But the downside is you must set it at indexing time. > This issue adds an indexDivisor to TermInfosReader so that on opening > a reader you could further sub-sample the the termIndexInterval to use > less RAM. EG a setting of 2 means every 2 * termIndexInterval is > loaded into RAM. > This is particularly useful if your index has a great many terms (eg > you accidentally indexed binary terms). > Spinoff from this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]