Re: Fieldable, AbstractField, Field

Michael McCandless Fri, 21 Mar 2008 03:46:16 -0700


Chris Hostetter wrote:

: Wouldn't subclassing ReadOnlyDocument also work in this case, ifyou override: the getField* to do your own new logic if it applies elsefallback to super?
Sure, but how will IndexReader (or really FieldsReader) know which
subclass to instantiate?

IndexReader would not instantiate a subclass in this example. I wasjust saying your wrapping example doesn't require an interface: itworks equally well if ReadOnlyDocument is a concrete class. I thinkmoving away from interfaces for 3.0 makes sense.

I think in LUCENE-778 the notion of passing a
DocumentFactory to IndexReader was brought up; in another reply tothisthread robert suggests having a setter that takes in a Class whichhas a
noarg constructor and using reflect -- both of these could work, but
frankly having a really simple API that allows for objectdecoration seems
cleaner to me.

I think whether IndexReader should be able to instantiate an app-specific subclass for a returned document is a somewhat orthogonalquestion here. I agree it would be nice to be able to do this,though I do wonder if we are "going too far in being flexible".Would we also offer app-specific per-field classes? Lucene is asearch engine and if you want to do neat custom application classesabove it it seems like it should remain above the API withdecoration. Ie I do like the simplicity of Yonik's document()returning Map<String,String[]>.

: approach. People who are careful (store enough fields, don't useboosting or: have separate store for their boosting) could pull Documents froma reader,
: tweak them, and build a new index.
this to me is the strongest reason to have seperate classes, whereneither
is the base class for the other -- we want to discourage the casual
observer from assuming this is simple. we want to make sure thatusers
who want to do this have to go out of their way to have to write the
translation from SearchResultDocument to IndexableDocument so it'sclearto them what they might be losing in the translation. (incidently:this
is another topic that came up in LUCENE-778...
https://issues.apache.org/jira/browse/LUCENE-778?focusedCommentId=12478169#action_12478169
)

Agreed, it's a dangerous trap to make this look simple. EG, I justdiscovered that if you call "isIndexed()" on a Fieldable in thedocument returned by an IndexReader, it can lie, because it fallsback to FieldInfos to check that field. But this is not right sinceon adding the one doc you could have specified NOT to index that onefield and only store it. Also, getBoost() doesn't survive indexing.In short, it's buggy, today, that we try to act like a retrieveddocument has faithfully preserved these detailed index-time settings.

fundementally, i believe
* the API for objects that are passed to "indexing" should onlyrequire
"getters" that are essential for understanding how to indexing
that document. ie: if indexing happens by calling
IndexWriter.addDocument(IndexableDoc) then the IndexableDoc API should
only require that there is a way to "get" IndexableFields and other
options, IndexableField should only describe how to "get" the fieldnameand field value and; the client providing the IndexableDoc shouldbe freeto provide an object that implements those methods any way it seesfit --we will most likely want to provde "simple" concrete subclasses ofthese
APIs that have simple setters to meet the common cases -- but the
IndexWriter methods should know only about the "API" classes.
* the API for objects that are returned from "reading/searching"should
only specify "getters" that are essential for the client to understand
the data the index knows about those documents.  "setters" used by
IndexReader (or it's internals) shouldn't be exposed by this APIunless we
*truely* want to let clients treat those objects as mutabl "Beans" for
their own purposes ... personally, i think it's safer to make it easy
to decorate those objects then to allow arbitrary modifications.


I like this approach.

So the IndexableDocument abstract base class would only definegetters. Document (in oal.index package) would subclass this, addingbasic setters. Likewise for Field, along with the getter/setters forindexing details like different term vector, stored, index settings.

StoredDocument abstract base class would only define getters. Therewould be a (package-)private class in oal.search that subclassesthis and adds private setters. Likewise for Field, or, maybeStoredDocument deals directly in String and doesn't make Fieldclasses (Yonik's idea).

: Actually I think they do share alot more than just name of thefield?: Accessing the "stored" value of a document is exactly whatindexing needs to: do when it indexes the document in the first place? Ie, a"stored" document: "looks alot like" the document at indexing time that had beenstored. And
i think the stored value looks alot like the value stored during index
because of the perspective we look at them from ... but a Document
returned from a search just has Fields with "Values" while adocument that
needs to to be indexed will have Fields with "Values To Store" and
Fields with "Values To Index" and Fields with both ... there'sreally no
reason for a single method in a common base class to describe those
things.

OK I now agree with this. I don't think Lucene should be trying topreserve all the index-time settings through to search-time, becauseit's buggy now and it's not really necessary.

: things like isTokenized, isTermVectorStored,isStoreOffsetWithTermVector,: isBinary are actually preserved in the index and known to thereader, so it's
: worth having these methods available at search time?

I admit, some of those is*() methods might make sense in a common base
class -- but not all of them are specific to each instance ofField. Someare the least common denominator of all Fields indexed with thesame name
added -- it's convinient to return that info as part of each Field
instance so thatthe client doesn't have to go call some method in
IndexReader (ie: reader.isTermVectorStored(fieldName)) butproviding it to
the client using the same method (not just a method with a similar, or
identical name, but the *exact* same method as defined by a common
baseclass) is very confusing when that method returns somethingdifferent
then what is was when the client orriginally indexed that document.

I've flipped here and would be happy not to expose any of this API inthe returned document.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fieldable, AbstractField, Field

Reply via email to