Re: Too many unique terms

Manuel Le Normand Mon, 29 Apr 2013 13:39:56 -0700

On Mon, Apr 29, 2013 at 1:22 PM, Adrien Grand <[email protected]> wrote:


> On Sat, Apr 27, 2013 at 8:41 PM, Manuel Le Normand
> <[email protected]> wrote:
> > Hi, real thanks for the previous reply.
> > For now i'm not able to make a separation between these useless words,
> > whether they contain words or digits.
> > I liked the idea of iterating with TermsEnum. Will it also delete the
> > occurances of these terms in the other file formats (termVectors etc.)?
>
> Yes it will. But since Lucene ony marks documents as deleted, you will
> need to force a merge in order to expunge deletes.
>

I want to make sure: iterating with the TermsEnum will not delete all the
terms occuring in the same doc that includes the single term, but only the
single term right?
Going through the Class TermEnum i cannot find any "delete" method, how can
i do this?


> > As i understand, the strField implementation is a kind of TrieField
> ordered
> > by the leading char (as searches support wildcards), every term in the
> > Dictionnary points to the inverted file (frq) to find the list (not
> bitmap)
> > of the docs containing the term.
>
> These details are codec-specific, but they are correct for the current
> postings format. You can have a look at
>
> https://builds.apache.org/job/Lucene-Artifacts-trunk/javadoc/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Termdictionary
> for more information.
>
> > Let's say i query for the term "hello" many times within different
> queries,
> > the O.S will load into memory the matching 4k chunk from the Dictionary
> and
> > frq. If most of my terms are garbage, much of the Dictionnary chunk will
> be
> > useless, whereas the frq chunk will be more efficiently used as it
> contains
> > all the <termFreq> list. Still i'm not sure a typical
> <termFreqs,skipData>
> > chunk per term gets to 4k.
>
> Postings lists are compressed and most terms are usually present in
> only a few documents so most postings lists are likely much smaller
> than 4kb.
>
I actually get far smaller entries. Assuming linearity, i get about 30
bytes only for each term in the *.tim files and an average of 5 bytes per
doc frec (=occurance of all the terms) - supprisingly efficient and low.
Anyway, that's not in the order of magnitude of 4k. As of this, i will not
attempt to tune this. Calculations (and assumptions) show ommiting all the
unique Terms will reduce the *.tim file by 80-90%, but as these terms occur
only 10% of the words. Thus they will give about this amount of reduction
from the pos and frq files. I guess the trie tree will be a bit more
efficient, i don't reckon it's worthy.

If someone else ever tuned this param I'd love to know.


> > If my assumption's right, i should lower down the memory chunks (through
> > the OS) to about the 0.9th percentile of the <termFreq,skipData> chunk
> for
> > a single term in the frq (neglecting for instance the use of prx and
> > termVectors). Any cons to the idea? Do you have any estimation of the
> > magnitude of a frq chunk for a N-times occuring term, or how can i check
> it
> > on my own.
>
> I've never been tuning this myself. I guess the main issue is that it
> could increase bookkeeping (to keep track of the pages) and thus CPU
> usage.
>
> Unfortunately the size of the postings lists is hard to predict
> because it depends on the data. They compress better when they are
> large and evenly distributed across all doc IDs. You could try to
> compare the sum of your doc freqs with the total byte size of the
> postings list to get a rough estimate.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Too many unique terms

Reply via email to