RE: on-the-fly "filters" from docID lists

Burton-West, Tom Fri, 23 Jul 2010 10:04:16 -0700

Hi all,

>>Re scalability of filter construction - the database is likely to hold stable 
>>primary keys not lucene doc ids 
>>which are unstable in the face of updates.


This is the scalability issue I was concerned about.  Assume the database call 
efficiently retrieves a sorted array of 50,000 stable primary keys.  What is 
the best way to efficiently convert that list of primary keys to Lucene docIds. 
  

I was looking at the Lucene in Action example code (which was not designed for 
this use case) where the Lucene docId is retrieved by iteratively calling 
termDocs.read. How expensive is this operation?  Would 50,000 calls return in a 
few seconds or less?  

for (String isbn : isbns) {
        if (isbn != null) {
        TermDocs termDocs =
        reader.termDocs(new Term("isbn", isbn));
        int count = termDocs.read(docs, freqs);
        if (count == 1) {
        bits.set(docs[0]);
}

>>That could involve a lot of disk seeks unless you cache a pk->docid lookup in 
>>ram.
That sounds interesting. How would the pk->docid lookup get populated?
Wouldn't a pk->docid cache be invalidated with each commit or merge?

Tom

-----Original Message-----
From: Mark Harwood [mailto:[email protected]] 
Sent: Friday, July 23, 2010 2:56 AM
To: [email protected]
Subject: Re: on-the-fly "filters" from docID lists

Re scalability of filter construction - the database is likely to hold stable 
primary keys not lucene doc ids which are unstable in the face of updates. You 
therefore need a quick way of converting stable database keys read from the db 
into current lucene doc ids to create the filter. That could involve a lot of 
disk seeks unless you cache a pk->docid lookup in ram.  You should use 
cachingwrapperfilter too to cache the computed  user permissions from one 
search to the next. 
This can get messy. If the access permissions are centred around roles/groups 
it is normally faster to tag docs with these group names and query them with 
the list of roles the user holds. 
If individual user-doc-level perms are required you could also consider 
dynamically looking up perms for just the top n results being shown at the risk 
of needing to repeat the query with a larger n if insufficient matches pass the 
lookup. 

Cheers 
Mark
----------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: on-the-fly "filters" from docID lists

Reply via email to