Hi, >I've integrated Bernhard's excellent code into my local copy of Cocoon >to see how it worked and unfortunately it doesn't :( > >Well, it *should* work since the crawler works and the indexing phase is >being performed (the work/index directory is created) but at the end of >the indexing, only one file get written inside the work/index directory, >called "segments" which contains 64 bits set to zero. > >It seems that Lucene is not receiving any input to index, but Cocoon >does receive the requests and does emit the responses. Very strange. > Well I have hardcoded quite some things. createindex.xsp will only create an index of http://localhost:8080/cocoon/documents/index.html, it will always write the index into {workdir}/index. The crawler will always append the query-string ?cocoon-view=links, expecting content-type application/x-cocoon-links. SimpleLuceneXMLIndexerImpl will always append the query ?cocoon-view=content, and indexing only content-type text/xml, and text/xhtml.
I have changed the documents/sitemap.xmap changing: ..... <map:match pattern="*.html"> <map:aggregate element="site"> <map:part src="cocoon:/book-{1}.xml"/> <map:part src="cocoon:/body-{1}.xml" label="content"/> </map:aggregate> ..... If you don't do this a query will return content-type text/html, which will not get indexed. You can check interactivly by querying "http://localhost:8080/cocoon/documents/index.html?cocoon-view=content" if there are some images, especially the top sitemap header of the documentation you are getting text/html. I hope it helps to make the createindex.xsp running properly. > > >Anyway, here are a few comments on Berni's code: > > 1) it uses the package "org.apache.cocoon.components.optional.lucene", >I would suggest something like "org.apache.cocoon.components.search" or >anything else that is not directly bound to Lucene. We might never get >multiple implementations of that engine, I know that, but it's good to >keep the behavioral abstraction that Avalon components suggest. > okay > > > 1) it defines 4 different new components: > > - CocoonCrawler -> performs crawling on a cocoon-hosted site > - LuceneCocoonIndexer -> performs indexing of a collection of >documents > - LuceneXMLIndexer -> performs indexing of a single document > - LuceneCocoonSearcher -> performs searching on a given index > >I like your design but I'd love to have better and more abstract names >and implementations for this: > > a) crawling should be a separate component and should provide two >different implementations: internal (directly calling the engine) and >external (using regular http:// requests). The internal crawling will be >used by the CLI and the local indexer, while the external could be >performed on other Cocoon sites (and might be useful to provide a >centralized indexing of a distributed Cocoon federation). > Yes, separating is quite a good idea. It will speed up the indexing of the local sites deployed in the same servlet engine. I have even thought about that the indexing step may act like the profiler. Instead of collecting profile data about how long something takes, update, or create the index information. This way the index is kept up-to-date. This way no explicit crawling is necessary for the internal docs. > >I propose to place this into "org.apache.cocoon.components.crawler" with >the Crawler as behavioral interface. Then having ExternalCrawler and >InternalCrawler as implementations. > > b) the "search" package should then contain the components that perform >both Indexing and Searching. The interfaces should not contain >Lucene-specific code even if, admittedly, this would be hard. If this is >not possible, the package should be called "lucene" and be >Lucene-specific. > I feel that the way you do the indexing has strong influence about how you search. Thus I once merged indexing and searching, I splitted just for seeing, and playing. The abstraction is somewhat difficult as the lucene API is not that flexible. The biggest problem was writting into the index, and closing the index. I didn't know when to close the IndexWriter. >c) the XML-2-Lucene indexer is a critical piece of this architecture: >in short, Lucene is a text-based indexing engine and is not structured. >The XML-2-Lucene indexer performs mapping between the tree-shaped XML >document and the map-shaped Lucene document (composed of name:value >pairs like hashtables). > Yes it is critical, as it is very dependant from the xml content you want to index. Ideally you only have to replace the LuceneIndexContentHandler to change the way you want to index. I didn't make this class a component but want to make it configurable, as this ContentHandler is responsible for creating the lucene document. >Bernhard created a XML2Lucene mapping by submitting every element and >attribute as name:value pairs of Lucene docs, plus collecting all the >text inside the document and submit that in the 'body' field (which is >the default field for lucene queries). > >So, this allows you to search for any text inside the document, as well >as searching for a specific text inside an element or attribute. > >I don't find this very useful, but it's a very good first step. > The reason for building this way the lucene document was more or less flexibility, not knowing yet how to index in an optimal way. And there were some short discussing in lucene user mailing list, presenting this schema of indexing, not knowing any better way i implemented it this way. Moreover I thought about indexing different kind of xml using the same LuceneIndexContentHandler. For example: I want to index DublinCore xml content. Now the xml content of cocoon/document are no DublinCore documents, but apache-xml documents. But I don't want to write much new java. Hence I want to keep the java-code, and change the sitemap, adding another view, and adding some apache-xml-document2dublin-core xml, like: <map:views> <map:view name="dublin-core-content" from-label="content"> <map:transform src="xml2dc.xsl"/> <map:serialize type="xml"/> </map:view> Thus the xml-content of this view should look like: <?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="/cocoon/docs/userdocs/index.html"> <dc:creator>Smith John</dc:creator> <dc:title>Cocoon2 User Documentation</dc:title> <dc:description>Describes Cocoon2 components actions, generators, matchers, selectors, serializers and transformers. </dc:description> <dc:date>2001-01-20</dc:date> <dc:language>en</dc:language> <dc:identifier>/cocoon/docs/userdocs/index.html</dc:identifier> </rdf:Description> </rdf:RDF> I must confess that I'm no dublin-core expert, but I think the more or less general indexing schema will help to reduce writing new ContentHandler for each new xml-content. Some more comments: The index-update mechanism is not implemented yet in the code. But this is crucial re-generating the index for document which have changed. I have stolen the idea of the uid index field from the html-samples of lucene. But I didn't implemented it yet. Moreover I'm not happy about the cocoon integration finding no generator/transformer/searializer pattern for the indexing/searching. I thought about the indexing as a transformer copying the xml-content, and writing the index, but I had problems knwoing when to close the index-writer, perhaps the index-transformer is only okay for updating an index, if at all. The searcher might be a generator. Generating the results of the search as xml-document. But perhaps all this trying to fit into the generator/transformer/serializer pattern is not really necessary. Well, that's all. I hope with the comments it will be possible to make the indexer work. I might send the lucene as an zip file, too, if it is helpful. bye bernhard. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]