One thing that I forgot to mention is that in our implementation the real-time indexing took place with many "folder-based" listeners writing to many tiny in-memory indexes partitioned by "sub-sources" with fewer long-term and archive indexes per box. Overall distributed search across various lucene-based search services was done using a federator component, very much like shard based searches is done today (I believe).
-- Joaquin. l On Fri, Dec 26, 2008 at 10:48 AM, J. Delgado <[email protected]>wrote: > The addition of docs into tiny segments using the current data structures > seems the right way to go. Sometime back one of my engineers implemented > pseudo real-time using MultiSearcher by having an in-memory (RAM based) > "short-term" index that auto-merged into a disk-based "long term" index that > eventually get merged into "archive" indexes. Index optimization would take > place during these merges. The search we required was very time-sensitive > (searching last-minute breaking news wires). The advantage of having an > archive index is that very old documents in our applications were not > usually searched on unless archives were explicitely selected. > > -- Joaquin > > > On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting <[email protected]> wrote: > >> Michael McCandless wrote: >> >>> So then I think we should start with approach #2 (build real-time on >>> top of the Lucene core) and iterate from there. Newly added docs go >>> into a tiny segments, which IndexReader.reopen pulls in. Replaced or >>> deleted docs record the delete against the right SegmentReader (and >>> LUCENE-1314 lets reopen carry those pending deletes forward, in RAM). >>> >>> I would take the simple approach first: use ordinary SegmentReader on >>> a RAMDirectory for the tiny segments. If that proves too slow, swap >>> in Memory/InstantiatedIndex for the tiny segments. If that proves too >>> slow, build a reader impl that reads from DocumentsWriter RAM buffer. >>> >> >> +1 This sounds like a good approach to me. I don't see any fundamental >> reasons why we need different representations, and fewer implementations of >> IndexWriter and IndexReader is generally better, unless they get way too >> hairy. Mostly it seems that real-time can be done with our existing toolbox >> of datastructures, but with some slightly different control structures. >> Once we have the control structure in place then we should look at >> optimizing data structures as needed. >> >> Doug >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >
