Too many unique terms

Manuel LeNormand Wed, 24 Apr 2013 15:29:48 -0700

Hi there,
Looking at my index (about 1M docs) i see lot of unique terms, more
than 8M which is a significant part of my total term count. These are very
likely useless terms, binaries or other meaningless numbers that come with
few of my docs.
I am totally fine with deleting them so these terms would be unsearchable.
Thinking about it i get that
1. It is impossible apriori knowing if it is unique term or not, so i
cannot add them to my stop words.
2. I have a performance decrease cause my cached "hot spot" chuncks (4kb)
do contain useless data. It's a problem for me as im short on memory.


Q:
Assuming a constant index, is there a way of deleting all terms that are
unique from at least the dictionary tim and tip files? Do i need to enter
the source code for this, and if yes what par of it?
 Will i get significant query time performance increase beside the better
RAM use benefit?
Are there any written updateProcessor classes that identify non human
readable terms?

Thanks in advance,
Manu

Too many unique terms

Reply via email to