Adding XML searching with Lucene

Stefano Mazzocchi Wed, 05 Dec 2001 11:21:50 -0800

I've integrated Bernhard's excellent code into my local copy of Cocoon
to see how it worked and unfortunately it doesn't :(


Well, it *should* work since the crawler works and the indexing phase is
being performed (the work/index directory is created) but at the end of
the indexing, only one file get written inside the work/index directory,
called "segments" which contains 64 bits set to zero.

It seems that Lucene is not receiving any input to index, but Cocoon
does receive the requests and does emit the responses. Very strange.

Anyway, here are a few comments on Berni's code:

 1) it uses the package "org.apache.cocoon.components.optional.lucene",
I would suggest something like "org.apache.cocoon.components.search" or
anything else that is not directly bound to Lucene. We might never get
multiple implementations of that engine, I know that, but it's good to
keep the behavioral abstraction that Avalon components suggest.

 1) it defines 4 different new components:

    - CocoonCrawler -> performs crawling on a cocoon-hosted site
    - LuceneCocoonIndexer -> performs indexing of a collection of
documents
    - LuceneXMLIndexer -> performs indexing of a single document
    - LuceneCocoonSearcher -> performs searching on a given index

I like your design but I'd love to have better and more abstract names
and implementations for this:

 a) crawling should be a separate component and should provide two
different implementations: internal (directly calling the engine) and
external (using regular http:// requests). The internal crawling will be
used by the CLI and the local indexer, while the external could be
performed on other Cocoon sites (and might be useful to provide a
centralized indexing of a distributed Cocoon federation).

I propose to place this into "org.apache.cocoon.components.crawler" with
the Crawler as behavioral interface. Then having ExternalCrawler and
InternalCrawler as implementations.

 b) the "search" package should then contain the components that perform
both Indexing and Searching. The interfaces should not contain
Lucene-specific code even if, admittedly, this would be hard. If this is
not possible, the package should be called "lucene" and be
Lucene-specific.

 c) the XML-2-Lucene indexer is a critical piece of this architecture:
in short, Lucene is a text-based indexing engine and is not structured.
The XML-2-Lucene indexer performs mapping between the tree-shaped XML
document and the map-shaped Lucene document (composed of name:value
pairs like hashtables).

I've taken a pretty serious look at Lucene's internals and it's a very
general engine since it allows you to add any name:value pairs to your
documents and indicate whether or not they should be indexed. This
useful to specify keyworks or other metadata, you can later restrict
your query into a specific area.

Bernhard created a XML2Lucene mapping by submitting every element and
attribute as name:value pairs of Lucene docs, plus collecting all the
text inside the document and submit that in the 'body' field (which is
the default field for lucene queries).

So, this allows you to search for any text inside the document, as well
as searching for a specific text inside an element or attribute.

I don't find this very useful, but it's a very good first step.

Comments?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Adding XML searching with Lucene

Reply via email to