On Dec 24, 2008, at 12:23 PM, Jason Rutherglen wrote:
> Also, what are the requirements? Must a document be visible to
search within 10ms of being added?
0-5ms. Otherwise it's not realtime, it's batch indexing. The
realtime system can support small batches by encoding them into
RAMDirectories if they are of sufficient size.
> Or must it be visible to search from the time that the call to
add it returns?
Most people probably expect the update latency offered by SQL
databases.
This is the problem spot. In an SQL database, when an update/add
occurs, the same connection/transaction will see the changes when
requested IMMEDIATELY - there is 0 latency.
In order to do this you MUST have the concept of transactions and/or
connections.
OR you must make it so that every update/add is immediately available
- this is probably simpler.
You just need to always search the ram and the disk index. The
deletions must be mapped to the disk index, and the "latest" version
of the document must be obtained from the ram index (if it is there).
You just need to merge the ram and disk in the background... and
continually create new/merged ram disks.
The memory requirements are going to go up, but you can always add a
"block" so that if the background merger gets too far behind, the
system blocks any current requests (to avoid the system running out
of memory).
> As a baseline, how fast is it to simply use RAMDirectory?
It depends on how fast searches over the realtime index need to
be. The detriment to speed occurs with having many small segments
that are continuously decoded (terms, postings, etc). The
advantage of MemoryIndex and InstantiatedIndex is an actual
increase in search speed compared with RAMDirectory (see the
Performance Notes at http://hudson.zones.apache.org/hudson/job/
Lucene-trunk/javadoc//org/apache/lucene/index/memory/
MemoryIndex.html and )and no need to continuously decode segments
that are short lived.
Anecdotal tests indicated the merging overhead of using
RAMDirectory as compared with MI or II is significant enough to
make it only useful for doing batches in the 1000s which does not
seem to be what people expect from realtime search.
On Wed, Dec 24, 2008 at 9:53 AM, Doug Cutting <cutt...@apache.org>
wrote:
Jason Rutherglen wrote:
2) Implement realtime search by incrementally creating and merging
readers in memory. The system would use MemoryIndex or
InstantiatedIndex to quickly (more quickly than RAMDirectory)
create indexes from added documents.
As a baseline, how fast is it to simply use RAMDirectory? If one,
e.g., flushes changes every 10ms or so, and has a background thread
that uses IndexReader.reopen() to keep a fresh version for reading?
Also, what are the requirements? Must a document be visible to
search within 10ms of being added? Or must it be visible to search
from the time that the call to add it returns? In the latter case
one might still use an approach like the above. Writing a small
new segment to a RAMDirectory and then, with no merging, calling
IndexReader.reopen(), should be quite fast. All merging could be
done in the background, as should post-merge reopens() that involve
large segments.
In short, I wonder if new reader and writer implementations are in
fact required or whether, perhaps with a few optimizations, the
existing implementations might meet this need.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org