I've integrated Bernhard's excellent code into my local copy of Cocoon to see how it worked and unfortunately it doesn't :(
Well, it *should* work since the crawler works and the indexing phase is being performed (the work/index directory is created) but at the end of the indexing, only one file get written inside the work/index directory, called "segments" which contains 64 bits set to zero. It seems that Lucene is not receiving any input to index, but Cocoon does receive the requests and does emit the responses. Very strange. Anyway, here are a few comments on Berni's code: 1) it uses the package "org.apache.cocoon.components.optional.lucene", I would suggest something like "org.apache.cocoon.components.search" or anything else that is not directly bound to Lucene. We might never get multiple implementations of that engine, I know that, but it's good to keep the behavioral abstraction that Avalon components suggest. 1) it defines 4 different new components: - CocoonCrawler -> performs crawling on a cocoon-hosted site - LuceneCocoonIndexer -> performs indexing of a collection of documents - LuceneXMLIndexer -> performs indexing of a single document - LuceneCocoonSearcher -> performs searching on a given index I like your design but I'd love to have better and more abstract names and implementations for this: a) crawling should be a separate component and should provide two different implementations: internal (directly calling the engine) and external (using regular http:// requests). The internal crawling will be used by the CLI and the local indexer, while the external could be performed on other Cocoon sites (and might be useful to provide a centralized indexing of a distributed Cocoon federation). I propose to place this into "org.apache.cocoon.components.crawler" with the Crawler as behavioral interface. Then having ExternalCrawler and InternalCrawler as implementations. b) the "search" package should then contain the components that perform both Indexing and Searching. The interfaces should not contain Lucene-specific code even if, admittedly, this would be hard. If this is not possible, the package should be called "lucene" and be Lucene-specific. c) the XML-2-Lucene indexer is a critical piece of this architecture: in short, Lucene is a text-based indexing engine and is not structured. The XML-2-Lucene indexer performs mapping between the tree-shaped XML document and the map-shaped Lucene document (composed of name:value pairs like hashtables). I've taken a pretty serious look at Lucene's internals and it's a very general engine since it allows you to add any name:value pairs to your documents and indicate whether or not they should be indexed. This useful to specify keyworks or other metadata, you can later restrict your query into a specific area. Bernhard created a XML2Lucene mapping by submitting every element and attribute as name:value pairs of Lucene docs, plus collecting all the text inside the document and submit that in the 'body' field (which is the default field for lucene queries). So, this allows you to search for any text inside the document, as well as searching for a specific text inside an element or attribute. I don't find this very useful, but it's a very good first step. Comments? -- Stefano Mazzocchi One must still have chaos in oneself to be able to give birth to a dancing star. <[EMAIL PROTECTED]> Friedrich Nietzsche -------------------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]