Thanks Mike,

At some point maybe the File Formats Document could be updated to make it clear 
that the tii has an entry similar to the IntexInterval'th tis entry but instead 
of holding frq/prx deltas it holds absolute pointers.  Is it worth entering a 
JIRA issue?  I would be happy to update the doc myself, but I'm don't think  I 
have enough of an in depth understanding.

As you probably have guessed, I'm trying to understand the impact of the over 
2.4 billion unique terms in our indexes on performance 
(https://issues.apache.org/jira/browse/LUCENE-2257).  We suspect that a very 
large percentage of these terms are due to dirty OCR, but have not yet found a 
good way to eliminate a significant amount of dirty OCR.   

I assume that these cause a few extra steps in the binary search of the tii 
file in memory but we probably won't notice that performance impact since our 
bottleneck is disk I/O for reading long postings lists for frequently occurring 
terms.

Am I correct in assuming that even if we have a very large number of garbage 
terms in our prx file, the overall size of the file does not significantly 
affect the number of disk seeks or amount of data to be read since Lucene can 
seek to the beginning of the postings for any particular term?

>> I would love to get ahold of your terms dict :)  I'd have a field day
>>testing Lucene against it... I'm very curious how the flex improvements 
>>affect your usage.

Sometime in the next month or so we will get our new test server and after I 
get the backup of testing jobs under control, I'd love to do some testing with 
flex and our data.  

Tom

-----Original Message-----
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Tuesday, April 13, 2010 5:27 AM
To: java-user@lucene.apache.org
Subject: Re: Understanding lucene indexes and disk I/O

Hi Tom,

Fear not: we only scan up to 128 terms, to find the specific term.

First, the terms dict index (tii) is fully loaded into RAM, and then a
binary search is done on this (in-RAM) to find the nearest index term
just before the term you want.  Then, we seek to that spot in the
main terms dict (tis) file, and scan (at most 128 entries) to find the
term.

On the frq/prx deltas: the tii holds absolute pointers.  So, on
seeking to that first spot in the tis, we know the absolute frq/prx
(long) offsets, and then during scanning we just add the deltas we
see to those base absolutes.



Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to