Re: Configurable indexing of an RDBMS, has it been done before?

Aad Nales Mon, 07 Feb 2005 02:29:54 -0800

Yep,

This is how we do it.

We have a search.xml that maps database fields to search fields and a parameter part that describes the 'click for detailed result url' and the parameter names (based on the search fields). In this xml we also describe how the different fields should be stored we have for instance a number of large text fields for we use the unstored option.

The framework that we have build around has an element that we call detailer. This detailer creates a lucene Document with the fields as specified in the search.xml

To illustrate here is the code that specifies the detailer for a forum.

-------------------------------------------------- XML --------------------- <documenttype id="FORUM" index="general" defaultfield="body"> <fields> <field property="messageid" searchfield="messageid" type="unindexed" key="true"/> <field property="instanceid" searchfield="instanceid" type="unindexed" /> <field property="subject" searchfield="title" type="split" maxwords="8" /> <field property="body" searchfield="default" type="split" maxwords="20" /> <field property="aka_username" searchfield="username" type="keyword" /> <field property="modifiedDateAsDate" searchfield="modifieddate" type="keyword" /> </fields> <action uri="/forum/viewMessage.do" image="/htmlarea/images/cops_insert_threadlink.gif"> <parameter property="messageid" name="messageid"/> <parameter property="instanceid" name="instanceid"/> </action> <analyzer classname="org.apache.lucene.analysis.standard.StandardAnalyzer"/> </documenttype> -------------------------------- END XML -------------------

Please note:

Messageid is the keyfield here when we search the index we use a combined TYPE + KEY id to filter out double hits on the same document (not unusual in for instance a long forum thread).

Per type of document we also specifify what picture to show in the result (image), and we specify in what index the result should be written and what the general search field is (if the user submits a query without search all and without a field specified).

We have added the 'split' key word which makes it possbile to search a long text but only store a bit in the resulting hit.

The reindex is pretty straightforward we build a series of detailers for all possible document types and we run through the database and call the right detailer from a HashMap.

We have not included the JDBC stuff since the application is always running in Tomcat-Struts and since we cache most of the database reads. (a completely differnt story).

Queries on new and changed records seem to only make sense if asked in a context of time. (Right?). We have not needed it yet. The mapping can be query from a singleton java class. (SearchConfiguration).

We are currently adding functionality to store 'user structured data' best imagined as user defined input forms that are described in XML and are then stored as XML in the database. We query these documents using Lucene. These documents end up in the same index but this is quite manageable by using specialized detailers. For these document the type is more important then for the 'normally' stored documents. For this latter situation the search logic assumes that the query is appropriately configured by the application.

I am not sure if this is the kind of solution that you are looking for, but everything we produce is 100% open source.

Cheers,
Aad


David Spencer wrote:

Many times I've written ad-hoc code that pulls in data from an RDBMS and builds a Lucene index. The use case is a typical database-driven dynamic website which would be a hassle to spider (say, due to tricky authentication).

I had a feeling this had been done in a general manner but didn't see any code in the sandbox, nor did any searches turn it up.

I've spent a few mins thinking this thru - what I'd expect is to be able to configure is:

1. JDBC Driver + conn params 2. Query to do a 1 time full index 3. Query to show new records 4. Query to show changed records 5. Query to show deleted records 6. Query columns to Lucene Field name mapping 7. "Type" of each field name (e.g. the equivalent of the args to the Field ctr)
So a simple example, taking item 2 is
query: "select url, name, body from foo"
(now the column to field mapping)
col 1 => url
col 2 => title
col 3 => contents
(now the field types for each named field)
url => Field( ...store=true, index=false)
title => Field( ...store=true, index=true)
contents => Field( ...store=false, index=true)
And voilla, nice, elegant, data driven indexing.
Does it exist?
Should it? :)
PS I know in the more general form, "query" needs to be replaced by "queries" above, and the updated query may need some time stamp variable expansion, and possibly the queries need "paging" to deal w/ lamo DBs like mysql that don't have cursors for large result sets...
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Configurable indexing of an RDBMS, has it been done before?

Reply via email to