MMBase Lucene module

Wouter Heijke Thu, 10 Jun 2004 07:42:40 -0700

Hi All,

After yesterday's presentation at the MMEvent I'd like to present the Lucene
full-text search module for MMBase to all of you.
What is it?
This module is a real MMBase module, so you have to install
'lucenemodule.xml' in de modules directory to run it.
What is does is make the content of your cloud searchable.
This is done by indexing your cloud, and only those builders that you
specify in a config file, also the fields of these builders that need to be
searched through have to be configured:


<?xml version="1.0" encoding="UTF-8"?>
<lucenemodule>
   <index name="MyNewsIndex">
      <table name="news">
         <field name="title" />
         <field name="subtitle" />
         <field name="intro">introduction</field>
         <field name="body" />
         <related name="attachments">
            <field name="title">rel.title</field>
            <field name="handle" type="binary">rel.body</field>
         </related>
      </table>
      <table name="mags">
         <field name="title" />
         <field name="body" />
      </table>
   </index>
</lucenemodule>

The example from my slides abuses the MyNews example to show how you could
configure the module.
Now when i search the 'MyNewsIndex' for a string in 'title' I'm searching
through both news and mags, or if  you specifically want this in mags or
news only.
So ideally all your searchable content should have the same kind of fields,
or if this isn't the case you can rename them to get a uniform naming. In
the example I renamed the 'intro' field to be called introduction in the
search index.
Each 'table' mentioned in the config file will result in a 'document' to be
created by Lucene in it's index, each of these will have the corresponding
MMBase node number and (builder) name indexed automatically. When you search
the results will be a list of node numbers.

Relations can be indexed also, like attachments in the example, this can be
any kind of builder. If you specify type is 'binary' on the field then this
field will be treated like a binary file and all text will be extracted from
it and indexed. Now PDF and Word are supported. Related content will be
indexed in the Lucene document of the parent of the relation, so you won't
get the node number of the related MMBase object in your results.

Lucene creates it's own database on the file system, this database will be
rebuild each time the module runs, which is configurable in the
lucenemodule.xml file. This database or 'index' is named to the name
specified in the configuration file in the name attribute of index. This
index is only used for searching by Lucene, results of a search will only be
the node numbers.

Right now the module is not available for download yet, it needs some work
(the usual, cleaningup, documentation etc), but since my presentation came
quite unexpected and there seemed to be some demand yesterday I'm trying to
see how big the demand is to make this available.

Wouter

MMBase Lucene module

Reply via email to