Re: Adding XML searching with Lucene

Bernhard Huber Wed, 05 Dec 2001 14:32:53 -0800

Hi,

>I've integrated Bernhard's excellent code into my local copy of Cocoon
>to see how it worked and unfortunately it doesn't :(
>
>Well, it *should* work since the crawler works and the indexing phase is
>being performed (the work/index directory is created) but at the end of
>the indexing, only one file get written inside the work/index directory,
>called "segments" which contains 64 bits set to zero.
>
>It seems that Lucene is not receiving any input to index, but Cocoon
>does receive the requests and does emit the responses. Very strange.
>
Well I have hardcoded quite some things. createindex.xsp will only 
create an index of
http://localhost:8080/cocoon/documents/index.html, it will always write 
the index into
{workdir}/index. The crawler will always append the query-string 
?cocoon-view=links, expecting
content-type application/x-cocoon-links. SimpleLuceneXMLIndexerImpl will 
always append
the query ?cocoon-view=content, and indexing only content-type text/xml, 
and text/xhtml.


I have changed the documents/sitemap.xmap changing:
.....
   <map:match pattern="*.html">
    <map:aggregate element="site">
     <map:part src="cocoon:/book-{1}.xml"/>
     <map:part src="cocoon:/body-{1}.xml" label="content"/>
    </map:aggregate>
.....

If you don't do this a query will return content-type text/html, which 
will not get indexed.
You can check interactivly by querying 
"http://localhost:8080/cocoon/documents/index.html?cocoon-view=content";
if there are some images, especially the top sitemap header of the 
documentation you are getting text/html.
I hope it helps to make the createindex.xsp running properly.

>
>
>Anyway, here are a few comments on Berni's code:
>
> 1) it uses the package "org.apache.cocoon.components.optional.lucene",
>I would suggest something like "org.apache.cocoon.components.search" or
>anything else that is not directly bound to Lucene. We might never get
>multiple implementations of that engine, I know that, but it's good to
>keep the behavioral abstraction that Avalon components suggest.
>
okay

>
>
> 1) it defines 4 different new components:
>
>    - CocoonCrawler -> performs crawling on a cocoon-hosted site
>    - LuceneCocoonIndexer -> performs indexing of a collection of
>documents
>    - LuceneXMLIndexer -> performs indexing of a single document
>    - LuceneCocoonSearcher -> performs searching on a given index
>
>I like your design but I'd love to have better and more abstract names
>and implementations for this:
>
> a) crawling should be a separate component and should provide two
>different implementations: internal (directly calling the engine) and
>external (using regular http:// requests). The internal crawling will be
>used by the CLI and the local indexer, while the external could be
>performed on other Cocoon sites (and might be useful to provide a
>centralized indexing of a distributed Cocoon federation).
>
Yes, separating is quite a good idea. It will speed up the indexing of 
the local sites deployed in
the same servlet engine.
I have even thought about that the indexing step may act like the 
profiler. Instead of collecting profile data about how long something 
takes, update, or create the index information. This way the index is 
kept up-to-date.
This way no explicit crawling is necessary for the internal docs.

>
>I propose to place this into "org.apache.cocoon.components.crawler" with
>the Crawler as behavioral interface. Then having ExternalCrawler and
>InternalCrawler as implementations.
>
> b) the "search" package should then contain the components that perform
>both Indexing and Searching. The interfaces should not contain
>Lucene-specific code even if, admittedly, this would be hard. If this is
>not possible, the package should be called "lucene" and be
>Lucene-specific.
>
I feel that the way you do the indexing has strong influence about how 
you search. Thus
I once merged indexing and searching, I splitted just for seeing, and 
playing. The abstraction is somewhat
difficult as the lucene API is not that flexible. The biggest problem 
was writting into the index, and closing
the index. I didn't know when to close the IndexWriter.

>c) the XML-2-Lucene indexer is a critical piece of this architecture:
>in short, Lucene is a text-based indexing engine and is not structured.
>The XML-2-Lucene indexer performs mapping between the tree-shaped XML
>document and the map-shaped Lucene document (composed of name:value
>pairs like hashtables).
>
Yes it is critical, as it is very dependant from the xml content you 
want to index. Ideally you only have to replace the 
 LuceneIndexContentHandler to change the way you want to index. I didn't 
make this class a component but
want to make it configurable, as this ContentHandler is responsible for 
creating the lucene document.

>Bernhard created a XML2Lucene mapping by submitting every element and
>attribute as name:value pairs of Lucene docs, plus collecting all the
>text inside the document and submit that in the 'body' field (which is
>the default field for lucene queries).
>
>So, this allows you to search for any text inside the document, as well
>as searching for a specific text inside an element or attribute.
>
>I don't find this very useful, but it's a very good first step.
>
The reason for building this way the lucene document was more or less 
flexibility, not knowing yet how to index
in an optimal way. And there were some short discussing in lucene user 
mailing list, presenting this schema of indexing, not knowing any better 
way i implemented it this way.
Moreover I thought about indexing different kind of xml using the same 
LuceneIndexContentHandler.

For example:
I want to index DublinCore xml content. Now the xml content of 
cocoon/document are no DublinCore documents, but apache-xml documents. 
But I don't want to write much new java. Hence I want to keep the 
java-code, and
change the sitemap, adding another view, and adding some 
apache-xml-document2dublin-core xml, like:
<map:views>
  <map:view name="dublin-core-content" from-label="content">
   <map:transform src="xml2dc.xsl"/>
   <map:serialize type="xml"/>
  </map:view>

Thus the xml-content of this view should look like:


    <?xml version="1.0"?>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
             xmlns:dc="http://purl.org/dc/elements/1.1/";>
      <rdf:Description rdf:about="/cocoon/docs/userdocs/index.html">
        <dc:creator>Smith John</dc:creator>
        <dc:title>Cocoon2 User Documentation</dc:title>
        <dc:description>Describes Cocoon2 components actions, generators, matchers,
          selectors, serializers and transformers.
        </dc:description>
        <dc:date>2001-01-20</dc:date>
        <dc:language>en</dc:language>
        <dc:identifier>/cocoon/docs/userdocs/index.html</dc:identifier>
      </rdf:Description>
    </rdf:RDF> 
  

I must confess that I'm no dublin-core expert, but I think the more or 
less general indexing schema will help
to reduce writing new ContentHandler for each new xml-content.

Some more comments:
The index-update mechanism is not implemented yet in the code. But this 
is crucial re-generating the index for
document which have changed. I have stolen the idea of the uid index 
field from the html-samples of lucene.
But I didn't implemented it yet.
Moreover I'm not happy about the cocoon integration finding no 
generator/transformer/searializer pattern for the indexing/searching.
I thought about the indexing as a transformer copying the xml-content, 
and writing the index, but I had problems knwoing when to close the 
index-writer, perhaps the index-transformer is only okay for updating an 
index, if at all.
The searcher might be a generator. Generating the results of the search 
as xml-document.
But perhaps all this trying to fit into the 
generator/transformer/serializer pattern is not really necessary.

Well, that's all. I hope with the comments it will be possible to make 
the indexer work. I might send the lucene
as an zip file, too, if it is helpful.

bye bernhard.





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Re: Adding XML searching with Lucene

Reply via email to