Re: Making document numbers persistent

Chris Hostetter Sat, 13 Jan 2007 22:10:50 -0800

: So what we want to do is to cache the filters, once created. Since
: the document ids would not be the same across the Lucene
: servers we'll be using, we can only cache the filters per server,
: which is a big performance loss. We also cannot reasonably control
: on which Lucene server the request will be made (no persistence).
: Caching the document ids externally is also out of the question
: because they won't be the same across multiple servers and they can
: possibly change at any time.


if you're talking about multiple identical servers used for load
balancing, then there is no reason why those indexes wouldn't be kept in
sync (the merge model is deterministic, so if you apply the same
operations to every server in the same order, you'll get identical
indexes) ... so you could concievably use a shared cache -- if you aren't
talking about indentical copies of hte index, then i'm not sure i
understand your situation -- since it would seem you'd need something to
ensure that the query got sent to the *right* server, and that would imply
that subsequent requests would be able to talk advantage of the cache.

: That's the gist of the problem. To further complicate matters, we
: have quite a large number of updates per second on that index, so
: I expect segment merges to happen quite often.

practically speaking, there is a differnce between the rate at which
updates come in, and the rate at which you have to make updates available
to your users ... reopening an indexreader no more frequently then once
ever X minutes -- and not reopening at all unless the index has changed --
can drasticly improve your ability to cache things effectively.

: - Ideally we would use our internal unique ids as document ids. I
: know that the document number essentially
:    is the offset into the index file. This would potentially waste
: space, I'm aware of that, but let's just assume that this doesn't

i'm not a fileformat guy, i don't really get into those internals much,
but the document number isn't "esentially" the offest -- it *is* the
offset -- each segment maintains it's own list of documents from 1 to size
of segment, and the document number for a given doc is the number used to
identify it in it's segment, plus the offset of that segment in the index
as a whole -- i don't believe the document number you use when calling
reader.document(docId) is ever stored in the index (unless your index
happens to have a single segment.

:    My question for this solution would be: Where, in general terms,
: would I have to look to change the meaning of the document id, codewise?

Don't take this the wrong way, but i read that question the same way i
would read "Where, in general terms, would i have to look to change the
meaning of integer addition, codewise?" ... there is no
DocumentNumberFactory, no getNextDocId() ... the concept of document ids
is so ingrained and pervasive i can't even imagine what it would take to
change it.

: - To keep the document ids from changing we could prevent segment
: merging - I'm not concerned with optimizing indices, this can be done
: offline,
:    and I'm prepared to build the caches after that. What would be the
: ballpark figure for query time degradation, approximately?
:    The code changes are obvious, I think, or are there more places
: I'd need to touch, other than maybeMergeSegments?

As i recall, everytime you open a new IndexWriter (which you would need to
do frequently since you have to close your old IndexWriter for your
multiple updates a second to be visible) a new segment is opened ... if
you really manage to competely eliminate segment merges you are talking
about potentially having 100,000 segments after 24 hours ... I think the
search performance cost there would breaty significantly outway your
current Filter building costs ... but that's just speculation since i've
never seen an index with that many segments -- at least not on a machine
that was acctually *functioning* :)

: - Somewhat related is the question why a document cannot be updated
: "in place", i.e. what is the reason for the "delete/insert document"
:    paradigm? I assume it is because the field list isn't necessarily
: the same length for each document, right? From all I have seen, it
:    is stems from the file format used, not something that is inherent
: to the idea behind Lucene?

it's inherent to the idea of an Inverted Index -- there is no single
document stored anywhere, the indexed terms are stored and they refrence
the list of docIds (as offsets in the current segment) that the are in ...
deleting a document just adds the docId to a list of documents to be
ignored (because they are deleted) when searchers are performed ... if you
really wanted to update in place, you would need to scan every term, see
if it was mapped to the doc, and then zero out that mapping, and then
apend the list of <term,doc> mappings for hte terms that are in the
updated doc -- which would take a lot longer then the current
delete/update approach.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Making document numbers persistent

Reply via email to