Re: [Ferret-talk] Proposal of some radical changes to API

David Balmain Tue, 06 Jun 2006 18:08:53 -0700

On 6/7/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
> On Jun 6, 2006, at 11:37 AM, Jan Prill wrote:
>
> > this statement tempted me to jump in, even without using something
> > like dynamic field creation myself __right now__. But I have been -
> > especially on cms like projects badly in need for dynamic fields.
> >
> > That something isn't common in sql doesn't mean that there is no
> > need for this "something". This limitation of sql is the reason for
> > doing things like storing xml in relational dbs as well as the
> > reason for people using object dbs. I don't know if you had a look
> > at dabble db, but imagine something like this with a relational
> > dbms. not funny! Because of this they haven't even thought about
> > using sql for dabble db. So maybe it's just me but the argument:
> > you can't do this in sql either doesn't sound too convincing...
>
> Jan, I don't understand the requirement, and I'm not familiar with
> the either dabble db or Rails, so neither that example nor the
> "models" example Dave cited earlier has spoken to me.  I asked the
> question because I honestly wanted to see a concrete example of an
> application that couldn't be handled within the constraint of pre-
> defined fields.
>
> Behind the scenes in Lucene is an elaborate, expensive apparatus for
> dealing with dynamic fields.  Each document gets turned into its own
> miniature inverted index, complete with its own FieldInfos,
> FieldsWriter, DocumentWriter, TermInfosWriter, and so on.  When these
> mini-indexes get merged, field definitions have to be reconciled.
> This merge stage is one of the bottlenecks which slow down
> interpreted-language ports of Lucene so severely, because there's a
> lot of object creation and destruction and a lot of method calls.


The way I'm dealing with this now is by having all the field
definitions in a single file. When a field is defined it gets assigned
a field number which is set for the life of the index. Hence, dynamic
fields without the expense.

> KinoSearch uses a fixed-field-definition model.  Before you add any
> documents to an index, you have to tell the index writer about all
> the possible fields you might use.  When you add the first document,
> it creates the FieldInfos, FieldsWriter, etc, which persist
> throughout the life of the index writer.  Instead of reconciling
> field definitions each time a document gets added, the field defs are
> defined as invariant for that indexing session.  This is much faster,
> because there is far less object creation and destruction, and far
> less disk shuffling as well -- no segment merging, therefore no
> movement of stored fields, term vectors, etc.

What happens when there are deletes? Which files should I look in to
see how this works? I really need to get my head around the KinoSearch
merge model.

> There are several possible ways to add dynamic fields back in to the
> fixed-field-def model.  My main priority in doing so, if it proves to
> be necessary, is to keep table-alteration logic separate from
> insertion operations.  Having the two conflated introduces needless
> complexity and computational expense at the back end.  It's also just
> plain confusing -- if you accidentally forget to set OMIT_NORMS just
> once, all of a sudden that field is going to have norms for ever and
> ever amen.  I think the user ought to have absolute control over
> field definitions.  Inserting a field with a conflicting definition
> ought to be an error.

I mostly agree but I don't think it is too expensive (computationally
or with regard to complexity) to dynamically add unknown fields with
default properties.

> Lucy is going to start with the KinoSearch merge model.  I will do a
> better job of adding dynamic capabilities to it if you or someone
> else can articulate some specific examples of situations where static
> definitions would not suffice.  I can think of a few tasks which
> would be slightly more convenient if new fields could be added on the
> fly, but maybe you can go one better and illustrate why dynamic field
> defs are essential.

Hopefully Lee will be able to describe his needs in a little more
detail. I must admit that in most cases dynamic fields just make
things a little easier, but you could do without them. Having said
that I don't think Ferret would be a very ruby-like search library if
it didn't allow dynamic fields. Ruby allows me to add methods not only
to the core classes but also to already instantiated objects. Coming
from a language that didn't allow you to do things like this, you'd
probably think this feature is totally unnessecary. Earlier I said I'd
be using Hashes as documents. Here is an example of how I could add
lazy loading to documents in Ferret:

    def get_doc(doc_num)
        doc = {}
        class <<doc
            attr_accessor :ferret_index, :ferret_doc_num
            def [](key)
                if val = super
                    return val
                else
                    return self[key] =
@ferret_index.get_doc_field(@ferret_doc_num, key)
                end
            end
        end
        doc.ferret_index = self
        doc.ferret_doc_num = doc_num
        return doc
    end

This example may be difficult to understand coming from Perl.
Basically what it does is return an empty Hash object when get_doc is
called. Now, whenever you reference a field in that Hash object, for
example doc[:title], it lazily loads that field from the index. All
other Hash objects are unnaffected. Perhaps you can do this sort of
thing in Perl also but I suspect it's a lot more common in Ruby. A
language like this definitely deserves a search library with dynamic
fields. Not necessarily because they solve an otherwise impossible
problem but because they make other problems much easier to solve.
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Proposal of some radical changes to API

Reply via email to