A document has been updated: http://cocoon.zones.apache.org/daisy/documentation/1246.html
Document ID: 1246 Branch: main Language: default Name: Introduction (unchanged) Document Type: Cocoon Document (unchanged) Updated on: 7/10/07 1:07:25 PM Updated by: Dominique De Munck A new version has been created, state: draft Parts ===== Content ------- This part has been updated. Mime type: text/xml (unchanged) File name: (unchanged) Size: 10767 bytes (previous version: 37 bytes) Content diff: --- <html><body><p>TODO</p></body></html> +++ <html> +++ <body> +++ +++ <h4 id="head-0b39056584778d584af2f2cdd81c6998caa13ba5">LuceneIndexTransformer is +++ a component that creates or updates Lucene indexes.</h4> +++ +++ <p>This component only writes the index: to search the index, use the +++ SearchGenerator component.</p> +++ +++ <h3 id="head-9b35088110dfcf121e63a9a2b67ec652d667a784">Why use it?</h3> +++ +++ <p>Instead of using LuceneIndexTransformer, you could generate an index by +++ crawling your website. However, the LuceneIndexTransformer is <em>much, +++ much</em> faster than crawling.</p> +++ +++ <p>The big differences for the developer are:</p> +++ +++ <ul> +++ <li> +++ <p>Using the LuceneIndexTransformer requires you to write a pipeline that can +++ generate a <tt>lucene:index</tt> document describing your searchable URI space, +++ so it's necessary to have a well-defined URI space. For a site with a consistent +++ structure this should not be too hard. This pipeline can use aggregation and +++ inclusion mechanisms to produce a full list of the pages you want to search. In +++ this way it's also possible to generate an index for websites with forms which +++ are not crawlable.</p> +++ </li> +++ <li> +++ <p>On the other hand the crawler is a more generic solution, though far less +++ efficient. It doesn't require a pipeline to "document" the entire searchable URI +++ space. Instead, you must create a <tt>content</tt> view and a <tt>links</tt> +++ view for each of the searchable pipelines. The URI space is then defined by +++ crawling the <tt>links</tt> view.</p> +++ </li> +++ </ul> +++ +++ <h3 id="head-953c351734de75a525b9777e976c0812a5618736">Declaring the +++ LuceneIndexTransformer</h3> +++ +++ <p>The transformer must be declared in the <tt><transformers></tt> +++ section of your sitemap:</p> +++ +++ <pre><map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0"> +++ +++ <map:components> +++ ... +++ <map:transformers default="xslt"> +++ <map:transformer name="index" +++ logger="sitemap.transformer.luceneindextransformer" +++ src="org.apache.cocoon.transformation.LuceneIndexTransformer"/> +++ </map:transformers> +++ ... +++ </map:components> +++ ... +++ </map:sitemap> +++ </pre> +++ +++ <h3 id="head-cea5eb78d3cf27bf4fdf96d1049365b4fa984307">Input document for the +++ LuceneIndexTransformer</h3> +++ +++ <p>This is a sample of the kind of document that the transformer expects. NB In +++ this example, I've chosen a couple of simple XHTML documents as the content to +++ be indexed. This is only because everyone knows XHTML - in practice you should +++ typically generate the index from an early stage in the pipeline; indexing +++ DocBook, TEI, etc, rather than a presentation format like HTML.</p> +++ +++ <pre><lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" +++ analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" +++ directory="index" +++ create="false" +++ merge-factor="20"> +++ +++ <lucene:document url="http://localhost/sample.html"> +++ <!-- here is some sample content --> +++ <html> +++ <head> +++ <title lucene:store="true">Sample</title> +++ </head> +++ <body> +++ <h1>Blah</h1> +++ <a href="blah.jpg" title="download blah image" +++ lucene:text-attr="title"> +++ <img src="blah-small.jpg" alt="Blah" +++ lucene:text-attr="alt"/> +++ </a> +++ </body> +++ </html> +++ </lucene:document> +++ +++ <lucene:document url="http://localhost/sample-2.html"> +++ <!-- Another sample doc --> +++ <html> +++ <head> +++ <title lucene:store="true">Second Sample</title> +++ </head> +++ <body> +++ <h1>Foo</h1> +++ <p>Lorem ipsum dolor sit amet, +++ consectetuer adipiscing elit. </p> +++ </body> +++ </html> +++ </lucene:document> +++ +++ </lucene:index> +++ </pre> +++ +++ <h3 id="head-97d27647f366081a18adc8469538e908e6354ed4">What the lucene:index +++ document means</h3> +++ +++ <h4 id="head-9e412039c4f6090a2aaac081c56f522ac97b8985">The lucene:index element +++ </h4> +++ +++ <p>The root element is <tt>lucene:index</tt>. The attributes of the +++ <tt>lucene:index</tt> in the sample above are shown with their default values - +++ so the effect is as if they were not specified at all.</p> +++ +++ <h4 id="head-40afef17a5a56ab2e729d18163f1bc960a8ce2cc">The merge-factor and +++ analyzer attributes</h4> +++ +++ <p>See +++ <a href="http://jakarta.apache.org/lucene/docs/index.html"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> +++ the Lucene documentation</a> for explanations of what they mean.</p> +++ +++ <h4 id="head-84967edae247fc0739e57bc3af497f832b880582">The optimize-frequency +++ attribute (since version 2.2)</h4> +++ +++ <p>Determines how often the lucene index will be optimized. When you have 1000's +++ of documents, optimizing the index can become quite slow (eg. 7 seconds for 9000 +++ small docs, P4).</p> +++ +++ <ul> +++ <li> +++ <p>1: always optimize (default)</p> +++ </li> +++ <li> +++ <p>0: never optimize</p> +++ </li> +++ <li> +++ <p>x: update every x times. You can use any number, it is a random generator +++ which will determine to optimize or not.</p> +++ </li> +++ </ul> +++ +++ <p>You can eg. create a pipe without optimizing, which is used to index you're +++ document everytime when it's modified. You can then create another pipe which +++ will optimize, which is called manually. For more info see the Lucene FAQ , What +++ is index optimization and when should I use it? :</p> +++ +++ <p> +++ <a href="http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> +++ http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8</a> +++ </p> +++ +++ <h4 id="head-51123b488fc39c0a36b69c0e24608052fd45a86d">The directory attribute +++ </h4> +++ +++ <p>This attribute controls where the index files are stored. The path is +++ relative to the Cocoon <tt>work</tt> directory.</p> +++ +++ <h4 id="head-9b03e7cb891515af05d6a3bde919087262b146aa">The create attribute</h4> +++ +++ <p>This attribute controls whether the index is recreated.</p> +++ +++ <ul> +++ <li> +++ <p>If create = "false" and the index already exists then the index will be +++ updated. Documents which are already indexed will be removed from the index and +++ reinserted.</p> +++ </li> +++ <li> +++ <p>If the index does not exist then it will be created even if +++ <tt>create = "false"</tt>.</p> +++ </li> +++ <li> +++ <p>If <tt>create = "true"</tt> then any existing index will be destroyed and a +++ new index created. If you are rebuilding your entire index then you should use +++ <tt>create = "true"</tt> because the indexer doesn't need to remove old +++ documents from the index, so it will be faster.</p> +++ </li> +++ </ul> +++ +++ <h4 id="head-9585e2ebba0108dc71917a21a4d9ed1edca00732">The lucene:document +++ element</h4> +++ +++ <p>Lucene will index the content of each <tt>lucene:document</tt>, which may +++ contain any xml content. The index is associated with the url specified by the +++ <tt>url</tt> attribute. So this url will be returned as the results of a search. +++ </p> +++ +++ <h4 id="head-5f2ae3b3aceb65a1fd0cb0942a0385fa7c4a4e2e">The lucene:text-attr +++ attribute</h4> +++ +++ <p>Normally Lucene will only index the content of these elements, not attribute +++ values. To index the attributes of an element as well, give it an attribute +++ called <tt>lucene:text-attr</tt>, containing a list of the names of the +++ attributes you want indexed. For example, to index the value of the <tt>alt</tt> +++ attribute of an <tt>img</tt> element, in <tt>html</tt>:</p> +++ +++ <pre><img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/> +++ </pre> +++ +++ <p>This would index the text "Blah".</p> +++ +++ <h4 id="head-b85bebbca6ee9807e0a7165b5208677c1616aca7">The lucene:store +++ attribute</h4> +++ +++ <p>Normally Lucene will only index the text of an element, not store it. To +++ store the text of an element in Lucene's index, add a +++ <tt>lucene:store="true"</tt> attribute to the element. It's a good idea to store +++ the title of a document in Lucene, so that your search results can show a +++ document title as well as a URL.</p> +++ +++ <h3 id="head-c55afa96d19d0ca7161da59bedf6409cbbfd78c2">The transformation</h3> +++ +++ <p>The transformer copies the source document to the output, except for the +++ content of the <tt>lucene:document</tt> elements.</p> +++ +++ <p>The transformer also adds an <tt>elapsed-time</tt> attribute to the output +++ <tt>lucene:document</tt> elements, showing the time (in milliseconds) taken to +++ index that document. You can use XSLT to transform the results into a report on +++ the indexing operation.</p> +++ +++ <h4 id="head-c9a731f4df69c482e3c1d40fcc39e94b3fb16307">Sample output</h4> +++ +++ <pre><?xml version="1.0" encoding="UTF-8"?> +++ <lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" +++ merge-factor="20" +++ create="false" +++ directory="index" +++ analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"> +++ <lucene:document url="JCB-001/full.html" elapsed-time="3846"/> +++ <lucene:document url="JCB-001/_div1-N1017B.html" elapsed-time="3735"/> +++ <lucene:document url="JCB-002/full.html" elapsed-time="361"/> +++ <lucene:document url="JCB-002/_div1-N10190.html" elapsed-time="1302"/> +++ <lucene:document url="JCB-003/full.html" elapsed-time="300"/> +++ <lucene:document url="JCB-003/_div1-N10188.html" elapsed-time="1352"/> +++ </lucene:index> +++ </pre> +++ +++ <h5 id="head-24e83f0c8063ca175d6e8a1a80e51e1ed9fbc20b">Note to users of Mac OS X +++ </h5> +++ +++ <p>Java can not open more than 256 files at a time by default, so you may get an +++ error like the following:</p> +++ +++ <pre>Description: org.apache.cocoon.ProcessingException: +++ Failed to execute pipeline.: java.lang.RuntimeException: +++ java.io.FileNotFoundException: +++ /usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 +++ (Too many open files) +++ </pre> +++ +++ <p>To avoid this error, you should set your ulimit in the shell script that +++ starts Tomcat. My line reads as follows:</p> +++ +++ <pre>ulimit -S -n 1000 +++ </pre> +++ +++ <p>Read more about this here: +++ <a href="http://www.amug.org/%7Eglguerin/howto/More-open-files.html"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> +++ http://www.amug.org/~glguerin/howto/More-open-files.html</a></p> +++ +++ <h5 id="head-f9fcf2cc3f693a586067cd49d3cbe85a6297d60e">Note to users of Redhat +++ Linux</h5> +++ +++ <p>If you get the following error: (Empty StackException) while creating the +++ index with the LuceneIndexTransformer try to alter your merge-factor to a lower +++ value (default should be 10). Look at the +++ <a href="http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#mergeFactor"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> +++ Lucene documentation</a> for more information.</p> +++ +++ </body> +++ </html>