On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:

Many things would be cleaner in Lucene if fields had a global semantics, i.e., if properties like text vs. binary, Index, Store, TermVector, the appropriate Analyzer, the assignment of Directory in ParallelReader (or
ParallelWriter), etc. were a function of just the field name and the
index.

This is the direction I would like to go.

This approach would naturally admit a class, say IndexFieldSet,
that would hold global field semantics for an index.

Lucene today allows many field properties to vary at the Field level.
E.g., the same field name might be tokenized in one Field on a Document
while it is untokenized in another Field on the same or different
Document.  Does anybody know how often this flexibility is used?  Are
there interesting use cases for which it is important?  It seems to me
this functionality is already problematic and not fully supported; e.g.,
indexing can manage tokenization-variant fields, but query parsing
cannot.  Various extensions to Lucene exacerbate this kind of problem.

Perhaps more controversially, the notion of global field semantics would be even stronger if the set of fields is closed. This would allow, for
example, QueryParser to validate field names.  This has a number of
benefits, including for example avoiding false-negative "no results" due
to misspelling a field name.

Has this been considered before?

Robert Kirchgessner made some of the same arguments in a January thread. They were compelling then, and they're compelling now.

http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200601.mbox/% [EMAIL PROTECTED]

In June, Dave Balmain and I discussed the issue extensively on the Ferret list. It might have been nice to use the Lucy list, since a lot of the discussion was about Lucy, but the Lucy lists didn't exist at the time.

http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html

Thoughts on the document storage that occurred to me after that discussion: maybe the fdx file should spec two numbers: a file pointer, and a integer which indicates the class of object stored at that position in the fdt file. The registry which maps integers to classes could be stored in some centralized file. Perhaps one of these classes -- a LazyDoc -- could specify that only a few integer file pointers should be read right away, deferring reading of field data until later.

Are there good reasons this path has not been followed?

Hoss, that's your cue.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to