index per-user basis and document frequency

Lionel Duboeuf Mon, 15 Jun 2009 14:07:00 -0700

Hi,

I use Lucene to index user's documents. I have a potential of 2 or moremillions users so that i think a per-user index will not be a scalablesolution. All my searches are filtered with a user UID field.As far as i know the default similarity calculate Inverse DocumentFrequency as follow:

Math.log(numDocs/(double)(docFreq+1)) + 1.0)

where numDocs stands for the number of documents within the wholecollection and docFreq for the number of times Term t appear in thewhole collection.My problem here is that this formula seems not to be reliable for mysystem because numDocs should correspond to the number of documents inthe user's collection and docFreq for the number of times the Term Tappears in the user's collection.Because Terms are stored as a single token i was thinking ofconcatenating terms with a UID in order to separate them because :Term "car" for user1 is different to term "car" for user2. My solutionwould index "carUSERUID1" "carUSERUID2".


What would you suggest ?

Regards,

Lionel

index per-user basis and document frequency

Reply via email to