Re: Adding XML searching with Lucene

Stefano Mazzocchi Thu, 06 Dec 2001 03:25:26 -0800

Bernhard Huber wrote:
> 
> Hi,
> 
> >I've integrated Bernhard's excellent code into my local copy of Cocoon
> >to see how it worked and unfortunately it doesn't :(
> >
> >Well, it *should* work since the crawler works and the indexing phase is
> >being performed (the work/index directory is created) but at the end of
> >the indexing, only one file get written inside the work/index directory,
> >called "segments" which contains 64 bits set to zero.
> >
> >It seems that Lucene is not receiving any input to index, but Cocoon
> >does receive the requests and does emit the responses. Very strange.
> >
> Well I have hardcoded quite some things. createindex.xsp will only
> create an index of
> http://localhost:8080/cocoon/documents/index.html


Yes, I changed that.

>, it will always write
> the index into
> {workdir}/index. 

got that also.

> The crawler will always append the query-string
> ?cocoon-view=links, expecting
> content-type application/x-cocoon-links. 

This is a good thing.

> SimpleLuceneXMLIndexerImpl will
> always append
> the query ?cocoon-view=content, and indexing only content-type text/xml,
> and text/xhtml.

Yep, got that.
 
> I have changed the documents/sitemap.xmap changing:
> .....
>    <map:match pattern="*.html">
>     <map:aggregate element="site">
>      <map:part src="cocoon:/book-{1}.xml"/>
>      <map:part src="cocoon:/body-{1}.xml" label="content"/>

Oh, damn, that's the missing part!!!

>     </map:aggregate>
> .....
> 
> If you don't do this a query will return content-type text/html, which
> will not get indexed.
> You can check interactivly by querying
> "http://localhost:8080/cocoon/documents/index.html?cocoon-view=content";
> if there are some images, especially the top sitemap header of the
> documentation you are getting text/html.
> I hope it helps to make the createindex.xsp running properly.

It does!!! Way cool, I'll start working on it right away!

> >
> >
> >Anyway, here are a few comments on Berni's code:
> >
> > 1) it uses the package "org.apache.cocoon.components.optional.lucene",
> >I would suggest something like "org.apache.cocoon.components.search" or
> >anything else that is not directly bound to Lucene. We might never get
> >multiple implementations of that engine, I know that, but it's good to
> >keep the behavioral abstraction that Avalon components suggest.
> >
> okay
> 
> >
> >
> > 1) it defines 4 different new components:
> >
> >    - CocoonCrawler -> performs crawling on a cocoon-hosted site
> >    - LuceneCocoonIndexer -> performs indexing of a collection of
> >documents
> >    - LuceneXMLIndexer -> performs indexing of a single document
> >    - LuceneCocoonSearcher -> performs searching on a given index
> >
> >I like your design but I'd love to have better and more abstract names
> >and implementations for this:
> >
> > a) crawling should be a separate component and should provide two
> >different implementations: internal (directly calling the engine) and
> >external (using regular http:// requests). The internal crawling will be
> >used by the CLI and the local indexer, while the external could be
> >performed on other Cocoon sites (and might be useful to provide a
> >centralized indexing of a distributed Cocoon federation).
> >
> Yes, separating is quite a good idea. It will speed up the indexing of
> the local sites deployed in
> the same servlet engine.

same cocoon, you mean.

> I have even thought about that the indexing step may act like the
> profiler. Instead of collecting profile data about how long something
> takes, update, or create the index information. This way the index is
> kept up-to-date.
> This way no explicit crawling is necessary for the internal docs.

sorry but I didn't get it.
 
> >
> >I propose to place this into "org.apache.cocoon.components.crawler" with
> >the Crawler as behavioral interface. Then having ExternalCrawler and
> >InternalCrawler as implementations.
> >
> > b) the "search" package should then contain the components that perform
> >both Indexing and Searching. The interfaces should not contain
> >Lucene-specific code even if, admittedly, this would be hard. If this is
> >not possible, the package should be called "lucene" and be
> >Lucene-specific.
> >
> I feel that the way you do the indexing has strong influence about how
> you search. 

I have the same feeling.

> Thus
> I once merged indexing and searching, I splitted just for seeing, and
> playing. The abstraction is somewhat
> difficult as the lucene API is not that flexible. The biggest problem
> was writting into the index, and closing
> the index. I didn't know when to close the IndexWriter.

I'll take a look at it.
 
> >c) the XML-2-Lucene indexer is a critical piece of this architecture:
> >in short, Lucene is a text-based indexing engine and is not structured.
> >The XML-2-Lucene indexer performs mapping between the tree-shaped XML
> >document and the map-shaped Lucene document (composed of name:value
> >pairs like hashtables).
> >
> Yes it is critical, as it is very dependant from the xml content you
> want to index. Ideally you only have to replace the
>  LuceneIndexContentHandler to change the way you want to index. I didn't
> make this class a component but
> want to make it configurable, as this ContentHandler is responsible for
> creating the lucene document.

yes, or at least, pluggable.
 
> >Bernhard created a XML2Lucene mapping by submitting every element and
> >attribute as name:value pairs of Lucene docs, plus collecting all the
> >text inside the document and submit that in the 'body' field (which is
> >the default field for lucene queries).
> >
> >So, this allows you to search for any text inside the document, as well
> >as searching for a specific text inside an element or attribute.
> >
> >I don't find this very useful, but it's a very good first step.
> >
> The reason for building this way the lucene document was more or less
> flexibility, not knowing yet how to index in an optimal way. 

I have some ideas on this that I can share, but let's do something that
works first.

> And there were some short discussing in lucene user
> mailing list, presenting this schema of indexing, not knowing any better
> way i implemented it this way.
> Moreover I thought about indexing different kind of xml using the same
> LuceneIndexContentHandler.
> 
> For example:
> I want to index DublinCore xml content. Now the xml content of
> cocoon/document are no DublinCore documents, but apache-xml documents.
> But I don't want to write much new java. Hence I want to keep the
> java-code, and
> change the sitemap, adding another view, and adding some
> apache-xml-document2dublin-core xml, like:
> <map:views>
>   <map:view name="dublin-core-content" from-label="content">
>    <map:transform src="xml2dc.xsl"/>
>    <map:serialize type="xml"/>
>   </map:view>
> 
> Thus the xml-content of this view should look like:
> 
>     <?xml version="1.0"?>
>     <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
>              xmlns:dc="http://purl.org/dc/elements/1.1/";>
>       <rdf:Description rdf:about="/cocoon/docs/userdocs/index.html">
>         <dc:creator>Smith John</dc:creator>
>         <dc:title>Cocoon2 User Documentation</dc:title>
>         <dc:description>Describes Cocoon2 components actions, generators, matchers,
>           selectors, serializers and transformers.
>         </dc:description>
>         <dc:date>2001-01-20</dc:date>
>         <dc:language>en</dc:language>
>         <dc:identifier>/cocoon/docs/userdocs/index.html</dc:identifier>
>       </rdf:Description>
>     </rdf:RDF>
> 
> 
> I must confess that I'm no dublin-core expert, but I think the more or
> less general indexing schema will help
> to reduce writing new ContentHandler for each new xml-content.

I absolutely agree with you!
 
> Some more comments:
> The index-update mechanism is not implemented yet in the code. But this
> is crucial re-generating the index for
> document which have changed. I have stolen the idea of the uid index
> field from the html-samples of lucene.
> But I didn't implemented it yet.
> Moreover I'm not happy about the cocoon integration finding no
> generator/transformer/searializer pattern for the indexing/searching.

Yeah, I was thinking about a SearchGenerator, but still have no idea on
when to perform the indexing part :/

> I thought about the indexing as a transformer copying the xml-content,
> and writing the index, but I had problems knwoing when to close the
> index-writer, perhaps the index-transformer is only okay for updating an
> index, if at all.

hmmm, maybe we should make the indexer a component on its own and have
some time-driven events in Cocoon that trigger its execution. Just
random thoughts, as usual.

> The searcher might be a generator. Generating the results of the search
> as xml-document.

Yes, that's what I'd like to have.

> But perhaps all this trying to fit into the
> generator/transformer/serializer pattern is not really necessary.

I don't mind your XSP at all, even if the search part screams for a
generator, IMO. I think that any indexing accessing code (such as the
statistics) are better off as XSP (so you can tune the result as you
like) while the search part should come up with a strong-typed
search-result markup and the skinning is performed at stylesheet level.
 
> Well, that's all. I hope with the comments it will be possible to make
> the indexer work. I might send the lucene
> as an zip file, too, if it is helpful.

It worked. I'll play with it tomorrow.

Thanks for this, it's a great toy :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Re: Adding XML searching with Lucene

Reply via email to