On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote:
Marvin Humphrey wrote on 07/08/2006 11:13 PM:
>
> On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:
>
>> Many things would be cleaner in Lucene if fields had a global semantics,
>> i.e., if properties like text vs. binary, Index, Store, TermVector, the
>> appropriate Analyzer, the assignment of Directory in ParallelReader (or
>> ParallelWriter), etc. were a function of just the field name and the
>> index.
>
> In June, Dave Balmain and I discussed the issue extensively on the
> Ferret list.  It might have been nice to use the Lucy list, since a
> lot of the discussion was about Lucy, but the Lucy lists didn't exist
> at the time.
>
> http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html
>
I think there are a number of problems with that proposal and hope it
was not adopted.

Hi Chuck,

Actually, it was adopted and I'm quite happy with the solution. I'd be
very interested to hear what the number of problems are, besides the
example you've already given. Even if you never use Ferret, it can
only help me improve my software.

I'll start by covering your term-vector example. By adding fixed
index-wide field properties to Ferret I was able to obtain up to a
huge speed improvement during indexing. I believe Marvin has had
similar success using his own merge model and with fixed field
properties in KinoSearch. With the CPU time I gain in Ferret I could
easily re-analyze large fields and build term vectors for them
separately. It's a little more work for less common use cases like
yours but in the end, everyone benifits in terms of performance.

As my earlier example showed, there is at least one
valid use case where storing a term vector is not an invariant property
of a field; specifically, when using term vectors to optimize excerpt
generation, it is best to store them only for fields that have long
values.  This is even a counter-example to Karl's proposal, since a
single Document may have multiple fields of the same name, some with
long values and others with short values; multiple fields of the same
name may legitimately have different TermVector settings even on a
single Document.

I think you'll find if you look at the DocumentWriter#writePostings
method that it's "one in, all in" in terms of storing term vectors for
a field. That is, if you have 5 "content" fields and only one of those
is set to store term vectors, then all of the fields will store term
vectors.

As another counter-example from my own app which I'd forgotten
yesterday, an important case where the Analyzer will vary across
documents is for i18n, where different languages require different
analyzers.  Refuting again my own argument about this not being
consistent with query parsing, the language of the query is a distinct
property from the languages of various documents in the collection.  In
my app, I let the user specify the language of the query, while the
language of each Document is determined automatically.  So, analyzers
vary for both queries and documents, but independently.

Ferret doesn't record any details about analysis in the field
properties. I definitely agree with you here.

I haven't thought of cases where Index or Store would legitimately vary
across Fields or Documents, but am less convinced there aren't important
use cases for these as well.  Similarly, although it is important to
allow term vectors to be on or off at the field level, I don't see any
obvious need to vary the type of term vector (positions, offsets or both).

I think Store could definitely legitimately vary across Fields or
Documents for the same reason your term vectors do. Perhaps you are
indexing pages from the web and you want to cache only the smaller
pages.

There are significant benefits to global semantics, as evidenced by the
fact that several of us independently came to desire this.  However,
deciding what can be global and what cannot is more subtle.

I agree. I can't see global field semantics making it into Lucene in
the short term. It's a rather large change, particularly if you want
to make full use of the performance benifits it affords.

Perhaps the best thing at the Lucene level is to have a notion of
default semantics for a field name.  Whenever a Field of that name is
constructed, those semantics would be used unless the constructor
overrides them.  This would allow additional constructors on Field with
simpler signatures for the common case of invariant Field properties.
It would also allow applications to access the class that holds the
default field information for an index.  The application will know which
properties it can rely on as invariant and whether or not the set of
fields is closed.

This approach would preserve upward compatibility and provide, I
believe, most of the benefits we all seek.

Thoughts?

If this is all you are going to add, I don't think you'd need to
change Lucene. You could just implement a DocumentFactory in your own
application. Perhaps something like this could go in the contrib
section of Lucene.

Also, you mentioned earlier having a field validating query parser.
You can already use
IndexWriter#getFieldNames(IndexReader.FieldOption.INDEXED) to get all
the indexed fields.

Cheers,
Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to