Making document numbers persistent

Kay Roepke Sat, 13 Jan 2007 10:15:14 -0800

Hi!

With a project we want to use Lucene in, we are running intoperformance problems with regard to building filter sets.

Let me give you a quick overview of what we need to do:

We are indexing information about users (index magnitude is rangingbetween 2 - 10 million documents). Each of thosedocuments has a unique id (which is a primary key from a database).Indexing and querying works fine and is really fast,so that is not the issue. We don't need to store much information inthe Lucene index, we are just interested in

retrieving those unique ids by Lucene queries.

The problem arises with search option we have. All of the users(documents we index) are "connected" to certainother users, in a network fashion. We must be able to restrict thequery (or filter it after searching the complete index)to certain "levels of connectedness", i.e. you can search within saythree hops of yourself. We compute a list of user idswhich are in the set of the applicable "contacts". This informationabout connection cannot be stored in the index, as it

is changing often and is expensive to compute in advance.

The problem now is to construct the filter sets, because we cannotcache the document numbers of the documents in the Luceneindex. Making BooleanQuery objects to programmtically include in thequery is out the question, since the "filter set" canbe enormously huge (we have instances of the set having a magnitudeof 500,000 - 1 million ids, on average they will be much

smaller but they are not trivially small in any case).

Creating the filter involves querying for each document by using ourinternal unique id, and this is the primary bottleneck,for large sets it takes around 35 seconds to build the bit set,several orders magnitudes longer than the search itself.

So what we want to do is to cache the filters, once created. Sincethe document ids would not be the same across the Luceneservers we'll be using, we can only cache the filters per server,which is a big performance loss. We also cannot reasonably controlon which Lucene server the request will be made (no persistence).Caching the document ids externally is also out of the questionbecause they won't be the same across multiple servers and they canpossibly change at any time.

That's the gist of the problem. To further complicate matters, wehave quite a large number of updates per second on that index, so

I expect segment merges to happen quite often.

We have discussed these potential solutions to this problem:

- Ideally we would use our internal unique ids as document ids. Iknow that the document number essentiallyis the offset into the index file. This would potentially wastespace, I'm aware of that, but let's just assume that this doesn'tconcern us right now. One problem to solve would be the segmentfiles, I haven't got a solution for that right now.My question for this solution would be: Where, in general terms,would I have to look to change the meaning of the document id, codewise?I have identified a couple spots, but would like to get somefurther advice from people who are way more familiar with the code ;)

- To keep the document ids from changing we could prevent segmentmerging - I'm not concerned with optimizing indices, this can be doneoffline,and I'm prepared to build the caches after that. What would be theballpark figure for query time degradation, approximately?The code changes are obvious, I think, or are there more placesI'd need to touch, other than maybeMergeSegments?

- Somewhat related is the question why a document cannot be updated"in place", i.e. what is the reason for the "delete/insert document"paradigm? I assume it is because the field list isn't necessarilythe same length for each document, right? From all I have seen, itis stems from the file format used, not something that is inherentto the idea behind Lucene?

To round it off, I'm prepared to do some heavy modification of theLucene codebase to make this work, so even if you have suggestions that

would require that, I'm more than happy to listen ;)

This is probably a hard question to ask, and might not be clearenough, so if I'm not making sense please tell me :)


Thanks in advance,

Kay

--
Kay Röpke
http://classdump.org/





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Making document numbers persistent

Reply via email to