Marvin Humphrey wrote:

I have a bunch of file format changes to push through, and I'm hoping to implement them using pluggable modules. For instance, I'd like to be able to swap out bit-vector-based deletions for tombstone-based deletions, just by
overriding a method or two.

I think Lucene should also aim for this (swappability of index codecs) --
LUCENE-1458 is a step towards that specifically for postings.  The
tombstone approach for deletions sounds compelling too, though first
we need to fix the API to switch to iterator only and stop calling isDeleted
in document(docID).

PFOR, pulsing are other recent examples where if we had swappability,
people could more easily explore.

Jason Rutherglen:

Decoupling IndexReader would for 3.0 would be great. This includes making
public SegmentReader, MultiSegmentReader.

I definitely think that IndexReader can and should be made more pluggable. Is exposing per-segment sub-readers a definite win, though? Does it make sense to leave open the door to index components which don't operate on segments?
Or even to eliminate SegmentReader entirely and have sub-components of
IndexReader manage collation?

I've been thinking about this with regard to tombstone-based deletions, where you can't know everything about a segment unless you've opened up other
segments.

These are good points: it may be exposing too much if we fully expose
SegmentReader now, since some components (deletion tombstones) may
want to skip that API and operate directly on lower level files.
Though, with LUCENE-1483 we are moving to excuting scoring &
collection per-segment.

A constructor like new SegmentReader(TermsDictionary termDictionary,
TermPostings termPostings, ColumnStrideFields csd, DocIdBitSet deletedDocs);

You end up with a proliferation of constructors that way. Term vectors?
Arbitrary auxiliary components such as an R-tree component supporting
geographic search?

My original proposal to clean this up involved an "IndexComponent" class. However, when I started implementing it, I ended up with a slew of new classes
with only two factory methods each.

We could possibly move those factory methods up into Schema, but I'm reluctant to dirty it up, since it's a major public class in KS (as I anticipate it will be
in Lucy) and major public classes should be as simple as possible.

So, how about an IndexArchitecture or IndexPlan class?

 class MyArchitecture extends IndexArchitecture {
   public PostingsWriter PostingsWriter() {
     return new PForDeltaPostingsWriter();
   }
   public PostingsReader PostingsReader() {
     return new PForDeltaPostingsReader();
   }
   public DeletionsWriter DeletionsWriter() {
     return new TombstoneWriter();
   }
   public DeletionsReader DeletionsReader() {
     return new TombstoneReader();
   }
 }

Lucene:

 IndexWriter writer = new IndexWriter("/path/to/index",
   new StandardAnalyzer(), new MyArchitecture());

Lucy with Java bindings:

 class MySchema extends Schema {
   public MySchema() {
     initField("title", "text");
     initField("content", "text");
   }
   public IndexArchitecture indexArchitecture() {
     return new MyArchitecture();
   }
   public Analyzer analyzer() {
     return new PolyAnalyzer("en");
   }
 }

IndexWriter writer = new IndexWriter(MySchema.open("/path/to/ index"));

I think this is a reasonable approach.  I might name it IndexCodec(s)
though, and I agree conceptually it's orthogonal to a "schema".

Decouple rollback, commit, IndexDeletionPolicy from DirectoryIndexReader into a class like SegmentsVersionSystem which could act as the controller for reopen types of methods. There could be a SegmentVersionSystem that
manages the versioning of a single segment.

I like it. :)

Sometimes you want to change up the merge policy for different writers against
the same index.  How does that fit into your plan?

My thought is that merge-policies would be application-specific rather than
index-specific.

This one I'm a little hazy on.  It would be nice to have a single
source for IndexWriter & IndexReader-acting-as-writer to share this
logic, but then we are [very, very slowly] migrating towards
IndexWriter being the only thing that writes to an index so it seems
like eventually it's OK if this logic is managed via the IndexWriter.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to