Hi,

I use Lucene to index user's documents. I have a potential of 2 or more millions users so that i think a per-user index will not be a scalable solution. All my searches are filtered with a user UID field. As far as i know the default similarity calculate Inverse Document Frequency as follow:
Math.log(numDocs/(double)(docFreq+1)) + 1.0)
where numDocs stands for the number of documents within the whole collection and docFreq for the number of times Term t appear in the whole collection. My problem here is that this formula seems not to be reliable for my system because numDocs should correspond to the number of documents in the user's collection and docFreq for the number of times the Term T appears in the user's collection. Because Terms are stored as a single token i was thinking of concatenating terms with a UID in order to separate them because : Term "car" for user1 is different to term "car" for user2. My solution would index "carUSERUID1" "carUSERUID2".

What would you suggest ?

Regards,

Lionel

Reply via email to