Re: [Ferret-talk] Proposal of some radical changes to API

Marvin Humphrey Tue, 06 Jun 2006 14:08:25 -0700

On Jun 6, 2006, at 11:37 AM, Jan Prill wrote:

> this statement tempted me to jump in, even without using something  
> like dynamic field creation myself __right now__. But I have been -  
> especially on cms like projects badly in need for dynamic fields.
>
> That something isn't common in sql doesn't mean that there is no  
> need for this "something". This limitation of sql is the reason for  
> doing things like storing xml in relational dbs as well as the  
> reason for people using object dbs. I don't know if you had a look  
> at dabble db, but imagine something like this with a relational  
> dbms. not funny! Because of this they haven't even thought about  
> using sql for dabble db. So maybe it's just me but the argument:  
> you can't do this in sql either doesn't sound too convincing...


Jan, I don't understand the requirement, and I'm not familiar with  
the either dabble db or Rails, so neither that example nor the  
"models" example Dave cited earlier has spoken to me.  I asked the  
question because I honestly wanted to see a concrete example of an  
application that couldn't be handled within the constraint of pre- 
defined fields.

Behind the scenes in Lucene is an elaborate, expensive apparatus for  
dealing with dynamic fields.  Each document gets turned into its own  
miniature inverted index, complete with its own FieldInfos,  
FieldsWriter, DocumentWriter, TermInfosWriter, and so on.  When these  
mini-indexes get merged, field definitions have to be reconciled.   
This merge stage is one of the bottlenecks which slow down  
interpreted-language ports of Lucene so severely, because there's a  
lot of object creation and destruction and a lot of method calls.

KinoSearch uses a fixed-field-definition model.  Before you add any  
documents to an index, you have to tell the index writer about all  
the possible fields you might use.  When you add the first document,  
it creates the FieldInfos, FieldsWriter, etc, which persist  
throughout the life of the index writer.  Instead of reconciling  
field definitions each time a document gets added, the field defs are  
defined as invariant for that indexing session.  This is much faster,  
because there is far less object creation and destruction, and far  
less disk shuffling as well -- no segment merging, therefore no  
movement of stored fields, term vectors, etc.

There are several possible ways to add dynamic fields back in to the  
fixed-field-def model.  My main priority in doing so, if it proves to  
be necessary, is to keep table-alteration logic separate from  
insertion operations.  Having the two conflated introduces needless  
complexity and computational expense at the back end.  It's also just  
plain confusing -- if you accidentally forget to set OMIT_NORMS just  
once, all of a sudden that field is going to have norms for ever and  
ever amen.  I think the user ought to have absolute control over  
field definitions.  Inserting a field with a conflicting definition  
ought to be an error.

Lucy is going to start with the KinoSearch merge model.  I will do a  
better job of adding dynamic capabilities to it if you or someone  
else can articulate some specific examples of situations where static  
definitions would not suffice.  I can think of a few tasks which  
would be slightly more convenient if new fields could be added on the  
fly, but maybe you can go one better and illustrate why dynamic field  
defs are essential.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Proposal of some radical changes to API

Reply via email to