MemoryIndex was designed to maximize performance for a specific use case: pure in-memory datastructure, at most one document per MemoryIndex instance, any number of fields, high frequency reads, high frequency index writes, no thread-safety required, optional support for storing offsets.

I briefly considered extending it to the multi-document case, but eventually refrained from doing so, because I didn't really need such functionality myself (no itch). Here are some issues to consider when attempting such an extension:

- The internal datastructure would probably look quite different
- Datastructure/algorithmic trade-offs regarding time vs space, read vs. write frequency, common vs. less common use cases
- Hence, it may well turn out that there's not much to reuse.
- A priori, it isn't clear whether a new solution would be significantly faster than normal RAMDirectory usage. Thus...
- Need benchmark suite to evaluate the chosen trade-offs.
- Need tests to ensure correctness (in practise, meaning, it behaves just like the existing alternative).

I'd say it's a non-trival untertaking. For example, right now, I don't have time for such an effort. That doesn't mean it's impossible or shouldn't be done, of course. If someone would like to run with it that would be great, but in light of the above issues, I'd suggest doing it in a new class (say MultiMemoryIndex or similar).

I believe Mark has dome some initial work in that direction, based on an independent (and different) implementation strategy.

Wolfgang.

On May 2, 2006, at 12:25 AM, Robert Engels wrote:

Along the lines of Lucene-550, what about having a MemoryIndex that accepts multiple documents, then wrote the index once at the end in the Lucene file
format (so it could be merged) during close.

When adding documents using an IndexWriter, a new segment is created for each document, and then the segments are periodically merged in memory,
and/or with disk segments. It seems that when constructing an Index or
updating a "lot" of documents in an existing index, the write, read, merge cycle is inefficient, and if the documents/field information were maintained
in order (TreeMaps) greater efficiency would be realized.

With a memory index, the memory needed during update will increase
dramatically, but this could still be bounded, and a "disk based" index
segment written when too many documents are in the memory index (max
buffered documents).

Does this "sound" like an improvement? Has anyone else tried something like
this?



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to