Thanks Mike, At some point maybe the File Formats Document could be updated to make it clear that the tii has an entry similar to the IntexInterval'th tis entry but instead of holding frq/prx deltas it holds absolute pointers. Is it worth entering a JIRA issue? I would be happy to update the doc myself, but I'm don't think I have enough of an in depth understanding.
As you probably have guessed, I'm trying to understand the impact of the over 2.4 billion unique terms in our indexes on performance (https://issues.apache.org/jira/browse/LUCENE-2257). We suspect that a very large percentage of these terms are due to dirty OCR, but have not yet found a good way to eliminate a significant amount of dirty OCR. I assume that these cause a few extra steps in the binary search of the tii file in memory but we probably won't notice that performance impact since our bottleneck is disk I/O for reading long postings lists for frequently occurring terms. Am I correct in assuming that even if we have a very large number of garbage terms in our prx file, the overall size of the file does not significantly affect the number of disk seeks or amount of data to be read since Lucene can seek to the beginning of the postings for any particular term? >> I would love to get ahold of your terms dict :) I'd have a field day >>testing Lucene against it... I'm very curious how the flex improvements >>affect your usage. Sometime in the next month or so we will get our new test server and after I get the backup of testing jobs under control, I'd love to do some testing with flex and our data. Tom -----Original Message----- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, April 13, 2010 5:27 AM To: java-user@lucene.apache.org Subject: Re: Understanding lucene indexes and disk I/O Hi Tom, Fear not: we only scan up to 128 terms, to find the specific term. First, the terms dict index (tii) is fully loaded into RAM, and then a binary search is done on this (in-RAM) to find the nearest index term just before the term you want. Then, we seek to that spot in the main terms dict (tis) file, and scan (at most 128 entries) to find the term. On the frq/prx deltas: the tii holds absolute pointers. So, on seeking to that first spot in the tis, we know the absolute frq/prx (long) offsets, and then during scanning we just add the deltas we see to those base absolutes. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org