Re: Global field semantics

David Balmain Sun, 09 Jul 2006 21:58:28 -0700

On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote:

David Balmain wrote on 07/09/2006 06:44 PM:
> On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote:
>> Marvin Humphrey wrote on 07/08/2006 11:13 PM:
>> >
>> > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:
>> >
>> >> Many things would be cleaner in Lucene if fields had a global
>> semantics,
>> >> i.e., if properties like text vs. binary, Index, Store,
>> TermVector, the
>> >> appropriate Analyzer, the assignment of Directory in
>> ParallelReader (or
>> >> ParallelWriter), etc. were a function of just the field name and the
>> >> index.
>> >
>> > In June, Dave Balmain and I discussed the issue extensively on the
>> > Ferret list.  It might have been nice to use the Lucy list, since a
>> > lot of the discussion was about Lucy, but the Lucy lists didn't exist
>> > at the time.
>> >
>> > http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html
>> >
>> I think there are a number of problems with that proposal and hope it
>> was not adopted.
>
> Hi Chuck,
>
> Actually, it was adopted and I'm quite happy with the solution. I'd be
> very interested to hear what the number of problems are, besides the
> example you've already given. Even if you never use Ferret, it can
> only help me improve my software.


Hi David,

Thanks for your reply.

I'm not aware of other problems beyond the ones I've already cited.
After thinking of these, my confidence that there were not others waned.

>
> I'll start by covering your term-vector example. By adding fixed
> index-wide field properties to Ferret I was able to obtain up to a
> huge speed improvement during indexing.

This is very interesting.  Can you say how much?


About a factor of 5 times. I won't compare it to Lucenes speed though
as I know that's asking for trouble. You'll be able to try it yourself
in a week or so when I finally release it.

> With the CPU time I gain in Ferret I could
> easily re-analyze large fields and build term vectors for them
> separately. It's a little more work for less common use cases like
> yours but in the end, everyone benifits in terms of performance.

Does Ferret work this way, or would that be up to the application?


Currently that would be up to the application.

>> As my earlier example showed, there is at least one
>> valid use case where storing a term vector is not an invariant property
>> of a field; specifically, when using term vectors to optimize excerpt
>> generation, it is best to store them only for fields that have long
>> values.  This is even a counter-example to Karl's proposal, since a
>> single Document may have multiple fields of the same name, some with
>> long values and others with short values; multiple fields of the same
>> name may legitimately have different TermVector settings even on a
>> single Document.
>
> I think you'll find if you look at the DocumentWriter#writePostings
> method that it's "one in, all in" in terms of storing term vectors for
> a field. That is, if you have 5 "content" fields and only one of those
> is set to store term vectors, then all of the fields will store term
> vectors.

Right you are, and clearly necessarily so since the values of the
multiple fields are implicitly concatenated (with
positionIncrementGap).  So, Lucene already limits my term vector
optimization to the Document level.  As it happens, I only use it for
large body fields, of which each of my Documents has at most one.

>
>> I haven't thought of cases where Index or Store would legitimately vary
>> across Fields or Documents, but am less convinced there aren't important
>> use cases for these as well.  Similarly, although it is important to
>> allow term vectors to be on or off at the field level, I don't see any
>> obvious need to vary the type of term vector (positions, offsets or
>> both).
>
> I think Store could definitely legitimately vary across Fields or
> Documents for the same reason your term vectors do. Perhaps you are
> indexing pages from the web and you want to cache only the smaller
> pages.

That's an interesting example, but not as compelling an objection to me
(and seemingly not to you either!).  The app could always store an empty
string without much consequence in this scenario.

>
>> There are significant benefits to global semantics, as evidenced by the
>> fact that several of us independently came to desire this.  However,
>> deciding what can be global and what cannot is more subtle.
>
> I agree. I can't see global field semantics making it into Lucene in
> the short term. It's a rather large change, particularly if you want
> to make full use of the performance benifits it affords.

Could you summarize where these derive from?


I'm afraid I don't have time to go into detail. The main benefit comes
from having constant field numbers for each field. So when segments
merge I don't need to read in documents and term vectors and then
rewrite them to the new segment. I can just copy the data directly
from the old segment to the new segment. As far as TermInfos go the
techiques I use in Ferret probably would't translate well into Java.
But the merge model we'll be using for Lucy is Marvin Humphrey's
KinoSearch merge model which you can read about here;

   http://wiki.apache.org/jakarta-lucene/KinoSearchMergeModel

I think this would work well in Lucene. His results with KinoSearch
are very impressive.

>
>> Perhaps the best thing at the Lucene level is to have a notion of
>> default semantics for a field name.  Whenever a Field of that name is
>> constructed, those semantics would be used unless the constructor
>> overrides them.  This would allow additional constructors on Field with
>> simpler signatures for the common case of invariant Field properties.
>> It would also allow applications to access the class that holds the
>> default field information for an index.  The application will know which
>> properties it can rely on as invariant and whether or not the set of
>> fields is closed.
>>
>> This approach would preserve upward compatibility and provide, I
>> believe, most of the benefits we all seek.
>>
>> Thoughts?
>
> If this is all you are going to add, I don't think you'd need to
> change Lucene. You could just implement a DocumentFactory in your own
> application. Perhaps something like this could go in the contrib
> section of Lucene.

I've already done it in my application (this weekend).  I think Lucene
would be better with a mechanism like this built-in as field semantics
are usually globally invariant.  I'm left wondering whether many of the
performance optimizations you've realized might be preserved in a model
that allowed selected exceptions, such as the term vector example.


Sure they would. As I already mentioned, most of my performance
benefits come from having constant field numbers for fields. I could
easily implement the model you've described in Ferret without a
performance hit, but I'm going to wait and see if "exceptional fields"
is a requested feature before I do.

>
> Also, you mentioned earlier having a field validating query parser.
> You can already use
> IndexWriter#getFieldNames(IndexReader.FieldOption.INDEXED) to get all
> the indexed fields.

At least in Lucene, I believe you mean IndexReader.getFieldNames().


Whoops! Yes I did.

However, this is not the same thing.  In fact, I submitted bug fixes to
ParallelReader a while back (now committed) that were in part due to a
similar assumption.  The issue is that this method only finds fields
that have already been indexed.  The model may provide fields that no
document in a specific collection has yet used.  At least in my
application, this distinction is important.  I have a common model used
to build many indexes, with search and indexing performed
simultaneously.  At any point in time in any given collection, a field
available in the model may or may not have occurred.  Queries need to be
validated against the model, not against the specific collection.


Sounds to me like you just need to add a validFieldNames collection in
QueryParser. I'm sure you could easily determine which field names are
valid from your common model without having to have a global field
specification within Lucene itself.

Don't get me wrong. I really like the global field spec with
exceptions idea and I personally think it would be an improvement the
current Lucene model. That's why I've done something similar in
Ferret. But Ferret is still in an alpha stage so I can afford to break
backwards compatability a little. I just think that it's a lot of work
for two little benefit and it's going to be difficult to stay
backwards compatable.

Cheers,
Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Global field semantics

Reply via email to