MemoryIndex was designed to maximize performance for a specific use
case: pure in-memory datastructure, at most one document per
MemoryIndex instance, any number of fields, high frequency reads,
high frequency index writes, no thread-safety required, optional
support for storing offsets.
I briefly considered extending it to the multi-document case, but
eventually refrained from doing so, because I didn't really need such
functionality myself (no itch). Here are some issues to consider when
attempting such an extension:
- The internal datastructure would probably look quite different
- Datastructure/algorithmic trade-offs regarding time vs space, read
vs. write frequency, common vs. less common use cases
- Hence, it may well turn out that there's not much to reuse.
- A priori, it isn't clear whether a new solution would be
significantly faster than normal RAMDirectory usage. Thus...
- Need benchmark suite to evaluate the chosen trade-offs.
- Need tests to ensure correctness (in practise, meaning, it behaves
just like the existing alternative).
I'd say it's a non-trival untertaking. For example, right now, I
don't have time for such an effort. That doesn't mean it's impossible
or shouldn't be done, of course. If someone would like to run with it
that would be great, but in light of the above issues, I'd suggest
doing it in a new class (say MultiMemoryIndex or similar).
I believe Mark has dome some initial work in that direction, based on
an independent (and different) implementation strategy.
Wolfgang.
On May 2, 2006, at 12:25 AM, Robert Engels wrote:
Along the lines of Lucene-550, what about having a MemoryIndex that
accepts
multiple documents, then wrote the index once at the end in the
Lucene file
format (so it could be merged) during close.
When adding documents using an IndexWriter, a new segment is
created for
each document, and then the segments are periodically merged in
memory,
and/or with disk segments. It seems that when constructing an Index or
updating a "lot" of documents in an existing index, the write,
read, merge
cycle is inefficient, and if the documents/field information were
maintained
in order (TreeMaps) greater efficiency would be realized.
With a memory index, the memory needed during update will increase
dramatically, but this could still be bounded, and a "disk based"
index
segment written when too many documents are in the memory index (max
buffered documents).
Does this "sound" like an improvement? Has anyone else tried
something like
this?
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]