: So what we want to do is to cache the filters, once created. Since : the document ids would not be the same across the Lucene : servers we'll be using, we can only cache the filters per server, : which is a big performance loss. We also cannot reasonably control : on which Lucene server the request will be made (no persistence). : Caching the document ids externally is also out of the question : because they won't be the same across multiple servers and they can : possibly change at any time.
if you're talking about multiple identical servers used for load balancing, then there is no reason why those indexes wouldn't be kept in sync (the merge model is deterministic, so if you apply the same operations to every server in the same order, you'll get identical indexes) ... so you could concievably use a shared cache -- if you aren't talking about indentical copies of hte index, then i'm not sure i understand your situation -- since it would seem you'd need something to ensure that the query got sent to the *right* server, and that would imply that subsequent requests would be able to talk advantage of the cache. : That's the gist of the problem. To further complicate matters, we : have quite a large number of updates per second on that index, so : I expect segment merges to happen quite often. practically speaking, there is a differnce between the rate at which updates come in, and the rate at which you have to make updates available to your users ... reopening an indexreader no more frequently then once ever X minutes -- and not reopening at all unless the index has changed -- can drasticly improve your ability to cache things effectively. : - Ideally we would use our internal unique ids as document ids. I : know that the document number essentially : is the offset into the index file. This would potentially waste : space, I'm aware of that, but let's just assume that this doesn't i'm not a fileformat guy, i don't really get into those internals much, but the document number isn't "esentially" the offest -- it *is* the offset -- each segment maintains it's own list of documents from 1 to size of segment, and the document number for a given doc is the number used to identify it in it's segment, plus the offset of that segment in the index as a whole -- i don't believe the document number you use when calling reader.document(docId) is ever stored in the index (unless your index happens to have a single segment. : My question for this solution would be: Where, in general terms, : would I have to look to change the meaning of the document id, codewise? Don't take this the wrong way, but i read that question the same way i would read "Where, in general terms, would i have to look to change the meaning of integer addition, codewise?" ... there is no DocumentNumberFactory, no getNextDocId() ... the concept of document ids is so ingrained and pervasive i can't even imagine what it would take to change it. : - To keep the document ids from changing we could prevent segment : merging - I'm not concerned with optimizing indices, this can be done : offline, : and I'm prepared to build the caches after that. What would be the : ballpark figure for query time degradation, approximately? : The code changes are obvious, I think, or are there more places : I'd need to touch, other than maybeMergeSegments? As i recall, everytime you open a new IndexWriter (which you would need to do frequently since you have to close your old IndexWriter for your multiple updates a second to be visible) a new segment is opened ... if you really manage to competely eliminate segment merges you are talking about potentially having 100,000 segments after 24 hours ... I think the search performance cost there would breaty significantly outway your current Filter building costs ... but that's just speculation since i've never seen an index with that many segments -- at least not on a machine that was acctually *functioning* :) : - Somewhat related is the question why a document cannot be updated : "in place", i.e. what is the reason for the "delete/insert document" : paradigm? I assume it is because the field list isn't necessarily : the same length for each document, right? From all I have seen, it : is stems from the file format used, not something that is inherent : to the idea behind Lucene? it's inherent to the idea of an Inverted Index -- there is no single document stored anywhere, the indexed terms are stored and they refrence the list of docIds (as offsets in the current segment) that the are in ... deleting a document just adds the docId to a list of documents to be ignored (because they are deleted) when searchers are performed ... if you really wanted to update in place, you would need to scan every term, see if it was mapped to the doc, and then zero out that mapping, and then apend the list of <term,doc> mappings for hte terms that are in the updated doc -- which would take a lot longer then the current delete/update approach. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]