Re: Too many unique terms

Adrien Grand Mon, 29 Apr 2013 03:33:43 -0700

On Sat, Apr 27, 2013 at 8:41 PM, Manuel Le Normand
<[email protected]> wrote:
> Hi, real thanks for the previous reply.
> For now i'm not able to make a separation between these useless words,
> whether they contain words or digits.
> I liked the idea of iterating with TermsEnum. Will it also delete the
> occurances of these terms in the other file formats (termVectors etc.)?


Yes it will. But since Lucene ony marks documents as deleted, you will
need to force a merge in order to expunge deletes.

> As i understand, the strField implementation is a kind of TrieField ordered
> by the leading char (as searches support wildcards), every term in the
> Dictionnary points to the inverted file (frq) to find the list (not bitmap)
> of the docs containing the term.

These details are codec-specific, but they are correct for the current
postings format. You can have a look at
https://builds.apache.org/job/Lucene-Artifacts-trunk/javadoc/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Termdictionary
for more information.

> Let's say i query for the term "hello" many times within different queries,
> the O.S will load into memory the matching 4k chunk from the Dictionary and
> frq. If most of my terms are garbage, much of the Dictionnary chunk will be
> useless, whereas the frq chunk will be more efficiently used as it contains
> all the <termFreq> list. Still i'm not sure a typical <termFreqs,skipData>
> chunk per term gets to 4k.

Postings lists are compressed and most terms are usually present in
only a few documents so most postings lists are likely much smaller
than 4kb.

> If my assumption's right, i should lower down the memory chunks (through
> the OS) to about the 0.9th percentile of the <termFreq,skipData> chunk for
> a single term in the frq (neglecting for instance the use of prx and
> termVectors). Any cons to the idea? Do you have any estimation of the
> magnitude of a frq chunk for a N-times occuring term, or how can i check it
> on my own.

I've never been tuning this myself. I guess the main issue is that it
could increase bookkeeping (to keep track of the pages) and thus CPU
usage.

Unfortunately the size of the postings lists is hard to predict
because it depends on the data. They compress better when they are
large and evenly distributed across all doc IDs. You could try to
compare the sum of your doc freqs with the total byte size of the
postings list to get a rough estimate.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Too many unique terms

Reply via email to