Marvin Humphrey wrote on 07/08/2006 11:13 PM: > > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote: > >> Many things would be cleaner in Lucene if fields had a global semantics, >> i.e., if properties like text vs. binary, Index, Store, TermVector, the >> appropriate Analyzer, the assignment of Directory in ParallelReader (or >> ParallelWriter), etc. were a function of just the field name and the >> index. > > In June, Dave Balmain and I discussed the issue extensively on the > Ferret list. It might have been nice to use the Lucy list, since a > lot of the discussion was about Lucy, but the Lucy lists didn't exist > at the time. > > http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html > I think there are a number of problems with that proposal and hope it was not adopted. As my earlier example showed, there is at least one valid use case where storing a term vector is not an invariant property of a field; specifically, when using term vectors to optimize excerpt generation, it is best to store them only for fields that have long values. This is even a counter-example to Karl's proposal, since a single Document may have multiple fields of the same name, some with long values and others with short values; multiple fields of the same name may legitimately have different TermVector settings even on a single Document.
As another counter-example from my own app which I'd forgotten yesterday, an important case where the Analyzer will vary across documents is for i18n, where different languages require different analyzers. Refuting again my own argument about this not being consistent with query parsing, the language of the query is a distinct property from the languages of various documents in the collection. In my app, I let the user specify the language of the query, while the language of each Document is determined automatically. So, analyzers vary for both queries and documents, but independently. I haven't thought of cases where Index or Store would legitimately vary across Fields or Documents, but am less convinced there aren't important use cases for these as well. Similarly, although it is important to allow term vectors to be on or off at the field level, I don't see any obvious need to vary the type of term vector (positions, offsets or both). There are significant benefits to global semantics, as evidenced by the fact that several of us independently came to desire this. However, deciding what can be global and what cannot is more subtle. Perhaps the best thing at the Lucene level is to have a notion of default semantics for a field name. Whenever a Field of that name is constructed, those semantics would be used unless the constructor overrides them. This would allow additional constructors on Field with simpler signatures for the common case of invariant Field properties. It would also allow applications to access the class that holds the default field information for an index. The application will know which properties it can rely on as invariant and whether or not the set of fields is closed. This approach would preserve upward compatibility and provide, I believe, most of the benefits we all seek. Thoughts? Chuck --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]