On Sat, Apr 27, 2013 at 8:41 PM, Manuel Le Normand <manuel.lenorm...@gmail.com> wrote: > Hi, real thanks for the previous reply. > For now i'm not able to make a separation between these useless words, > whether they contain words or digits. > I liked the idea of iterating with TermsEnum. Will it also delete the > occurances of these terms in the other file formats (termVectors etc.)?
Yes it will. But since Lucene ony marks documents as deleted, you will need to force a merge in order to expunge deletes. > As i understand, the strField implementation is a kind of TrieField ordered > by the leading char (as searches support wildcards), every term in the > Dictionnary points to the inverted file (frq) to find the list (not bitmap) > of the docs containing the term. These details are codec-specific, but they are correct for the current postings format. You can have a look at https://builds.apache.org/job/Lucene-Artifacts-trunk/javadoc/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Termdictionary for more information. > Let's say i query for the term "hello" many times within different queries, > the O.S will load into memory the matching 4k chunk from the Dictionary and > frq. If most of my terms are garbage, much of the Dictionnary chunk will be > useless, whereas the frq chunk will be more efficiently used as it contains > all the <termFreq> list. Still i'm not sure a typical <termFreqs,skipData> > chunk per term gets to 4k. Postings lists are compressed and most terms are usually present in only a few documents so most postings lists are likely much smaller than 4kb. > If my assumption's right, i should lower down the memory chunks (through > the OS) to about the 0.9th percentile of the <termFreq,skipData> chunk for > a single term in the frq (neglecting for instance the use of prx and > termVectors). Any cons to the idea? Do you have any estimation of the > magnitude of a frq chunk for a N-times occuring term, or how can i check it > on my own. I've never been tuning this myself. I guess the main issue is that it could increase bookkeeping (to keep track of the pages) and thus CPU usage. Unfortunately the size of the postings lists is hard to predict because it depends on the data. They compress better when they are large and evenly distributed across all doc IDs. You could try to compare the sum of your doc freqs with the total byte size of the postings list to get a rough estimate. -- Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org