Re: Realtime Search

robert engels Wed, 24 Dec 2008 10:38:26 -0800


On Dec 24, 2008, at 12:23 PM, Jason Rutherglen wrote:

> Also, what are the requirements? Must a document be visible tosearch within 10ms of being added?
0-5ms. Otherwise it's not realtime, it's batch indexing. Therealtime system can support small batches by encoding them intoRAMDirectories if they are of sufficient size.
> Or must it be visible to search from the time that the call toadd it returns?
Most people probably expect the update latency offered by SQLdatabases.

This is the problem spot. In an SQL database, when an update/addoccurs, the same connection/transaction will see the changes whenrequested IMMEDIATELY - there is 0 latency.

In order to do this you MUST have the concept of transactions and/orconnections.

OR you must make it so that every update/add is immediately available- this is probably simpler.

You just need to always search the ram and the disk index. Thedeletions must be mapped to the disk index, and the "latest" versionof the document must be obtained from the ram index (if it is there).

You just need to merge the ram and disk in the background... andcontinually create new/merged ram disks.

The memory requirements are going to go up, but you can always add a"block" so that if the background merger gets too far behind, thesystem blocks any current requests (to avoid the system running outof memory).

> As a baseline, how fast is it to simply use RAMDirectory?
It depends on how fast searches over the realtime index need tobe. The detriment to speed occurs with having many small segmentsthat are continuously decoded (terms, postings, etc). Theadvantage of MemoryIndex and InstantiatedIndex is an actualincrease in search speed compared with RAMDirectory (see thePerformance Notes at http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/memory/MemoryIndex.html and )and no need to continuously decode segmentsthat are short lived.
Anecdotal tests indicated the merging overhead of usingRAMDirectory as compared with MI or II is significant enough tomake it only useful for doing batches in the 1000s which does notseem to be what people expect from realtime search.
On Wed, Dec 24, 2008 at 9:53 AM, Doug Cutting <[email protected]>wrote:
Jason Rutherglen wrote:
2) Implement realtime search by incrementally creating and mergingreaders in memory. The system would use MemoryIndex orInstantiatedIndex to quickly (more quickly than RAMDirectory)create indexes from added documents.
As a baseline, how fast is it to simply use RAMDirectory? If one,e.g., flushes changes every 10ms or so, and has a background threadthat uses IndexReader.reopen() to keep a fresh version for reading?
Also, what are the requirements? Must a document be visible tosearch within 10ms of being added? Or must it be visible to searchfrom the time that the call to add it returns? In the latter caseone might still use an approach like the above. Writing a smallnew segment to a RAMDirectory and then, with no merging, callingIndexReader.reopen(), should be quite fast. All merging could bedone in the background, as should post-merge reopens() that involvelarge segments.
In short, I wonder if new reader and writer implementations are infact required or whether, perhaps with a few optimizations, theexisting implementations might meet this need.
Doug

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Realtime Search

Reply via email to