On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote:

>>> I asked the question because I honestly wanted to see a concrete
>>> example of an application that couldn't be handled within the
>>> constraint of pre- defined fields.
>
> My current application involves writing a web application which can
> seach a ferret index built from a SQL database.
>
> The idea is that the customer supplies SQLs for say customers,
> suppliers, sales and puchases etc. The app then retrieves the rows  
> from
> the datasource and indexes using Ferret. The app provides both a html
> website as an interface to the index, and also an XML api which can be
> used by non browser clients.
>
> The field set is quite different for each SQL [and is essentially  
> out of
> our control].

So at what point does your app learn the structure of the SQL table?   
Would it work if you were to start each session by telling the index  
writer about the fields that were coming?

   def connect(field_names)
     field_names.each do |field_name|
       index.spec_field(field_name)   # use default properties
     end
   end

   def add_to_index(submission)
     index.add_hash_as_doc(submission)
   end

I can imagine a scenario where that's not possible, and the fields  
may change up on each insert.  In that case, under the interface I  
envision, you'd have to do something like...

   def add_to_index(submission)
     submission.each do |field_name, value|
       index.spec_field(field_name)   # use default properties
     end
     index.add_hash_as_doc(submission)
   end

FWIW, this stuff is happening anyway, behind the scenes.   
Essentially, every time you add a field to an index, Ferret asks,  
"Say, is this field indexed?  And how about TermVectors, you want  
those?"  The 10_000th time you add the field, Ferret asks, "This  
field wasn't indexed before -- have you changed your mind? OK, I'll  
check back again later."... 1_000_000th doc: "You sure?  How about I  
make it indexed?  Awwwww, c'mon... Hey, could you use some TermVectors?"

When it makes sense, of course you want to simplify the interface and  
hide the complexity inside the library.  However, given that it's not  
possible to make coherent updates to existing data within a Lucene- 
esque file format, my argument is that field definitions should never  
change.  So the repeated calls to spec_field above would be  
completely redundant -- you'd get an error if you ever tried to  
change the field def.

Your app would be a little less elegant, it's true (performance  
impact would be somewhere between insignificant and tiny unless you  
had a zillion very short fields).  However, I think the use case  
where the fields are not known in advance is the exception rather  
than the rule.

It would also be possible to use Dave's polymorphic hash-as-doc  
technique, where if the hash value is a Field object, you spec out  
the field definition using that Field object's properties -- you  
would just use full-on Field objects for each field.  My argument  
would be, again, that the field definitions should not change.  If  
you don't agree with that and the definition has to be modifiable  
(within the current constraints), then that single-method technique  
is probably better.  However, if the definition is not modifiable,  
then I'd argue it's cleaner to separate the two functions.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to