Re: Making document numbers persistent

Kay Roepke Sun, 14 Jan 2007 04:49:48 -0800


On 14. Jan 2007, at 7:10 , Chris Hostetter wrote:

if you're talking about multiple identical servers used for load
balancing, then there is no reason why those indexes wouldn't bekept in
sync (the merge model is deterministic, so if you apply the same
operations to every server in the same order, you'll get identical
indexes) ... so you could concievably use a shared cache -- if youaren't
talking about indentical copies of hte index, then i'm not sure i
understand your situation -- since it would seem you'd needsomething toensure that the query got sent to the *right* server, and thatwould imply
that subsequent requests would be able to talk advantage of the cache.

In principle the servers would have the exact same index, but itmight be possiblefor them to be out of sync temporarily, say when one or more serversare downfor one reason or another. They would be restarted and then would befeed thechanges they missed. Thus I could end up with different indicesacross the

servers. This is pretty much like database replication.
We have thought about doing some kind of connection persistence like you

suggest, but that has some nasty production level problems. We coulddeal

probably, but I rather would like to avoid that.

practically speaking, there is a differnce between the rate at which
updates come in, and the rate at which you have to make updatesavailableto your users ... reopening an indexreader no more frequently thenonceever X minutes -- and not reopening at all unless the index haschanged --
can drasticly improve your ability to cache things effectively.

mmh, that sounds like something that I have missed ;) I'll have to dosometesting to see at what repopening interval I would end up and stillhave a

decent cache usage. Thanks for the refresher!

i'm not a fileformat guy, i don't really get into those internalsmuch,
but the document number isn't "esentially" the offest -- it *is* the
offset -- each segment maintains it's own list of documents from 1to sizeof segment, and the document number for a given doc is the numberused toidentify it in it's segment, plus the offset of that segment in theindex
as a whole -- i don't believe the document number you use when calling
reader.document(docId) is ever stored in the index (unless your index
happens to have a single segment.


Ok, then I haven't read the code the wrong way :)

:    My question for this solution would be: Where, in general terms,
: would I have to look to change the meaning of the document id,codewise?
Don't take this the wrong way, but i read that question the same way i
would read "Where, in general terms, would i have to look to changethe
meaning of integer addition, codewise?" ... there is no
DocumentNumberFactory, no getNextDocId() ... the concept ofdocument idsis so ingrained and pervasive i can't even imagine what it wouldtake to
change it.

I get it ;) Of course I know that there isn't aDocumentNumberFactory, but at somepoint it must be decided which document number a document will havefor a given pointin time, right? Now this doesn't have to be spelled out in code, andindeed is very implicit,but if I was able to hook into that and maybe record the changingdocument numbers how they relateto our unique ids, I could look up the docid from that table withoutenumerating terms.

As i recall, everytime you open a new IndexWriter (which you wouldneed to
do frequently since you have to close your old IndexWriter for your
multiple updates a second to be visible) a new segment isopened ... ifyou really manage to competely eliminate segment merges you aretalkingabout potentially having 100,000 segments after 24 hours ... Ithink the
search performance cost there would breaty significantly outway your
current Filter building costs ... but that's just speculation sincei'venever seen an index with that many segments -- at least not on amachine
that was acctually *functioning* :)

Ok, that confirms my suspicions. I was afraid that it would seriouslyaffectquery time and general performance (number of open files). Then thatsolution

is out.

it's inherent to the idea of an Inverted Index -- there is no single
document stored anywhere, the indexed terms are stored and theyrefrencethe list of docIds (as offsets in the current segment) that the arein ...
deleting a document just adds the docId to a list of documents to be
ignored (because they are deleted) when searchers are performed ...if youreally wanted to update in place, you would need to scan everyterm, see
if it was mapped to the doc, and then zero out that mapping, and then
apend the list of <term,doc> mappings for hte terms that are in the
updated doc -- which would take a lot longer then the current
delete/update approach.


Ok, it was just that I couldn't find the explanation in the archives.

I agree, it would probably be a performance nightmare to do updatesthat way.



Thanks very much for your input!

cheers,

-k

--
Kay Röpke
http://classdump.org/





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Making document numbers persistent

Reply via email to