Hi!

With a project we want to use Lucene in, we are running into performance problems with regard to building filter sets.
Let me give you a quick overview of what we need to do:
We are indexing information about users (index magnitude is ranging between 2 - 10 million documents). Each of those documents has a unique id (which is a primary key from a database). Indexing and querying works fine and is really fast, so that is not the issue. We don't need to store much information in the Lucene index, we are just interested in
retrieving those unique ids by Lucene queries.

The problem arises with search option we have. All of the users (documents we index) are "connected" to certain other users, in a network fashion. We must be able to restrict the query (or filter it after searching the complete index) to certain "levels of connectedness", i.e. you can search within say three hops of yourself. We compute a list of user ids which are in the set of the applicable "contacts". This information about connection cannot be stored in the index, as it
is changing often and is expensive to compute in advance.

The problem now is to construct the filter sets, because we cannot cache the document numbers of the documents in the Lucene index. Making BooleanQuery objects to programmtically include in the query is out the question, since the "filter set" can be enormously huge (we have instances of the set having a magnitude of 500,000 - 1 million ids, on average they will be much
smaller but they are not trivially small in any case).
Creating the filter involves querying for each document by using our internal unique id, and this is the primary bottleneck, for large sets it takes around 35 seconds to build the bit set, several orders magnitudes longer than the search itself.

So what we want to do is to cache the filters, once created. Since the document ids would not be the same across the Lucene servers we'll be using, we can only cache the filters per server, which is a big performance loss. We also cannot reasonably control on which Lucene server the request will be made (no persistence). Caching the document ids externally is also out of the question because they won't be the same across multiple servers and they can possibly change at any time.

That's the gist of the problem. To further complicate matters, we have quite a large number of updates per second on that index, so
I expect segment merges to happen quite often.

We have discussed these potential solutions to this problem:

- Ideally we would use our internal unique ids as document ids. I know that the document number essentially is the offset into the index file. This would potentially waste space, I'm aware of that, but let's just assume that this doesn't concern us right now. One problem to solve would be the segment files, I haven't got a solution for that right now. My question for this solution would be: Where, in general terms, would I have to look to change the meaning of the document id, codewise? I have identified a couple spots, but would like to get some further advice from people who are way more familiar with the code ;)

- To keep the document ids from changing we could prevent segment merging - I'm not concerned with optimizing indices, this can be done offline, and I'm prepared to build the caches after that. What would be the ballpark figure for query time degradation, approximately? The code changes are obvious, I think, or are there more places I'd need to touch, other than maybeMergeSegments?

- Somewhat related is the question why a document cannot be updated "in place", i.e. what is the reason for the "delete/insert document" paradigm? I assume it is because the field list isn't necessarily the same length for each document, right? From all I have seen, it is stems from the file format used, not something that is inherent to the idea behind Lucene?

To round it off, I'm prepared to do some heavy modification of the Lucene codebase to make this work, so even if you have suggestions that
would require that, I'm more than happy to listen ;)

This is probably a hard question to ask, and might not be clear enough, so if I'm not making sense please tell me :)

Thanks in advance,

Kay

--
Kay Röpke
http://classdump.org/





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to