On 14. Jan 2007, at 7:10 , Chris Hostetter wrote:

if you're talking about multiple identical servers used for load
balancing, then there is no reason why those indexes wouldn't be kept in
sync (the merge model is deterministic, so if you apply the same
operations to every server in the same order, you'll get identical
indexes) ... so you could concievably use a shared cache -- if you aren't
talking about indentical copies of hte index, then i'm not sure i
understand your situation -- since it would seem you'd need something to ensure that the query got sent to the *right* server, and that would imply
that subsequent requests would be able to talk advantage of the cache.

In principle the servers would have the exact same index, but it might be possible for them to be out of sync temporarily, say when one or more servers are down for one reason or another. They would be restarted and then would be feed the changes they missed. Thus I could end up with different indices across the
servers. This is pretty much like database replication.
We have thought about doing some kind of connection persistence like you
suggest, but that has some nasty production level problems. We could deal
probably, but I rather would like to avoid that.

practically speaking, there is a differnce between the rate at which
updates come in, and the rate at which you have to make updates available to your users ... reopening an indexreader no more frequently then once ever X minutes -- and not reopening at all unless the index has changed --
can drasticly improve your ability to cache things effectively.

mmh, that sounds like something that I have missed ;) I'll have to do some testing to see at what repopening interval I would end up and still have a
decent cache usage. Thanks for the refresher!

i'm not a fileformat guy, i don't really get into those internals much,
but the document number isn't "esentially" the offest -- it *is* the
offset -- each segment maintains it's own list of documents from 1 to size of segment, and the document number for a given doc is the number used to identify it in it's segment, plus the offset of that segment in the index
as a whole -- i don't believe the document number you use when calling
reader.document(docId) is ever stored in the index (unless your index
happens to have a single segment.

Ok, then I haven't read the code the wrong way :)


:    My question for this solution would be: Where, in general terms,
: would I have to look to change the meaning of the document id, codewise?

Don't take this the wrong way, but i read that question the same way i
would read "Where, in general terms, would i have to look to change the
meaning of integer addition, codewise?" ... there is no
DocumentNumberFactory, no getNextDocId() ... the concept of document ids is so ingrained and pervasive i can't even imagine what it would take to
change it.

I get it ;) Of course I know that there isn't a DocumentNumberFactory, but at some point it must be decided which document number a document will have for a given point in time, right? Now this doesn't have to be spelled out in code, and indeed is very implicit, but if I was able to hook into that and maybe record the changing document numbers how they relate to our unique ids, I could look up the docid from that table without enumerating terms.

As i recall, everytime you open a new IndexWriter (which you would need to
do frequently since you have to close your old IndexWriter for your
multiple updates a second to be visible) a new segment is opened ... if you really manage to competely eliminate segment merges you are talking about potentially having 100,000 segments after 24 hours ... I think the
search performance cost there would breaty significantly outway your
current Filter building costs ... but that's just speculation since i've never seen an index with that many segments -- at least not on a machine
that was acctually *functioning* :)

Ok, that confirms my suspicions. I was afraid that it would seriously affect query time and general performance (number of open files). Then that solution
is out.

it's inherent to the idea of an Inverted Index -- there is no single
document stored anywhere, the indexed terms are stored and they refrence the list of docIds (as offsets in the current segment) that the are in ...
deleting a document just adds the docId to a list of documents to be
ignored (because they are deleted) when searchers are performed ... if you really wanted to update in place, you would need to scan every term, see
if it was mapped to the doc, and then zero out that mapping, and then
apend the list of <term,doc> mappings for hte terms that are in the
updated doc -- which would take a lot longer then the current
delete/update approach.

Ok, it was just that I couldn't find the explanation in the archives.
I agree, it would probably be a performance nightmare to do updates that way.


Thanks very much for your input!

cheers,

-k

--
Kay Röpke
http://classdump.org/





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to