A document has been updated: http://cocoon.zones.apache.org/daisy/documentation/1246.html
Document ID: 1246 Branch: main Language: default Name: Introduction (unchanged) Document Type: Cocoon Document (unchanged) Updated on: 7/13/07 10:19:59 PM Updated by: Grzegorz Kossakowski A new version has been created, state: publish Parts ===== Content ------- This part has been updated. Mime type: text/xml (unchanged) File name: (unchanged) Size: 43 bytes (previous version: 10767 bytes) Content diff: <html> <body> --- <h4 id="head-0b39056584778d584af2f2cdd81c6998caa13ba5">LuceneIndexTransformer is --- a component that creates or updates Lucene indexes.</h4> +++ <p>TODO</p> --- <p>This component only writes the index: to search the index, use the --- SearchGenerator component.</p> --- --- <h3 id="head-9b35088110dfcf121e63a9a2b67ec652d667a784">Why use it?</h3> --- --- <p>Instead of using LuceneIndexTransformer, you could generate an index by --- crawling your website. However, the LuceneIndexTransformer is <em>much, --- much</em> faster than crawling.</p> --- --- <p>The big differences for the developer are:</p> --- --- <ul> --- <li> --- <p>Using the LuceneIndexTransformer requires you to write a pipeline that can --- generate a <tt>lucene:index</tt> document describing your searchable URI space, --- so it's necessary to have a well-defined URI space. For a site with a consistent --- structure this should not be too hard. This pipeline can use aggregation and --- inclusion mechanisms to produce a full list of the pages you want to search. In --- this way it's also possible to generate an index for websites with forms which --- are not crawlable.</p> --- </li> --- <li> --- <p>On the other hand the crawler is a more generic solution, though far less --- efficient. It doesn't require a pipeline to "document" the entire searchable URI --- space. Instead, you must create a <tt>content</tt> view and a <tt>links</tt> --- view for each of the searchable pipelines. The URI space is then defined by --- crawling the <tt>links</tt> view.</p> --- </li> --- </ul> --- --- <h3 id="head-953c351734de75a525b9777e976c0812a5618736">Declaring the --- LuceneIndexTransformer</h3> --- --- <p>The transformer must be declared in the <tt><transformers></tt> --- section of your sitemap:</p> --- --- <pre><map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0"> --- --- <map:components> --- ... --- <map:transformers default="xslt"> --- <map:transformer name="index" --- logger="sitemap.transformer.luceneindextransformer" --- src="org.apache.cocoon.transformation.LuceneIndexTransformer"/> --- </map:transformers> --- ... --- </map:components> --- ... --- </map:sitemap> --- </pre> --- --- <h3 id="head-cea5eb78d3cf27bf4fdf96d1049365b4fa984307">Input document for the --- LuceneIndexTransformer</h3> --- --- <p>This is a sample of the kind of document that the transformer expects. NB In --- this example, I've chosen a couple of simple XHTML documents as the content to --- be indexed. This is only because everyone knows XHTML - in practice you should --- typically generate the index from an early stage in the pipeline; indexing --- DocBook, TEI, etc, rather than a presentation format like HTML.</p> --- --- <pre><lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" --- analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" --- directory="index" --- create="false" --- merge-factor="20"> --- --- <lucene:document url="http://localhost/sample.html"> --- <!-- here is some sample content --> --- <html> --- <head> --- <title lucene:store="true">Sample</title> --- </head> --- <body> --- <h1>Blah</h1> --- <a href="blah.jpg" title="download blah image" --- lucene:text-attr="title"> --- <img src="blah-small.jpg" alt="Blah" --- lucene:text-attr="alt"/> --- </a> --- </body> --- </html> --- </lucene:document> --- --- <lucene:document url="http://localhost/sample-2.html"> --- <!-- Another sample doc --> --- <html> --- <head> --- <title lucene:store="true">Second Sample</title> --- </head> --- <body> --- <h1>Foo</h1> --- <p>Lorem ipsum dolor sit amet, --- consectetuer adipiscing elit. </p> --- </body> --- </html> --- </lucene:document> --- --- </lucene:index> --- </pre> --- --- <h3 id="head-97d27647f366081a18adc8469538e908e6354ed4">What the lucene:index --- document means</h3> --- --- <h4 id="head-9e412039c4f6090a2aaac081c56f522ac97b8985">The lucene:index element --- </h4> --- --- <p>The root element is <tt>lucene:index</tt>. The attributes of the --- <tt>lucene:index</tt> in the sample above are shown with their default values - --- so the effect is as if they were not specified at all.</p> --- --- <h4 id="head-40afef17a5a56ab2e729d18163f1bc960a8ce2cc">The merge-factor and --- analyzer attributes</h4> --- --- <p>See --- <a href="http://jakarta.apache.org/lucene/docs/index.html"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> --- the Lucene documentation</a> for explanations of what they mean.</p> --- --- <h4 id="head-84967edae247fc0739e57bc3af497f832b880582">The optimize-frequency --- attribute (since version 2.2)</h4> --- --- <p>Determines how often the lucene index will be optimized. When you have 1000's --- of documents, optimizing the index can become quite slow (eg. 7 seconds for 9000 --- small docs, P4).</p> --- --- <ul> --- <li> --- <p>1: always optimize (default)</p> --- </li> --- <li> --- <p>0: never optimize</p> --- </li> --- <li> --- <p>x: update every x times. You can use any number, it is a random generator --- which will determine to optimize or not.</p> --- </li> --- </ul> --- --- <p>You can eg. create a pipe without optimizing, which is used to index you're --- document everytime when it's modified. You can then create another pipe which --- will optimize, which is called manually. For more info see the Lucene FAQ , What --- is index optimization and when should I use it? :</p> --- --- <p> --- <a href="http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> --- http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8</a> --- </p> --- --- <h4 id="head-51123b488fc39c0a36b69c0e24608052fd45a86d">The directory attribute --- </h4> --- --- <p>This attribute controls where the index files are stored. The path is --- relative to the Cocoon <tt>work</tt> directory.</p> --- --- <h4 id="head-9b03e7cb891515af05d6a3bde919087262b146aa">The create attribute</h4> --- --- <p>This attribute controls whether the index is recreated.</p> --- --- <ul> --- <li> --- <p>If create = "false" and the index already exists then the index will be --- updated. Documents which are already indexed will be removed from the index and --- reinserted.</p> --- </li> --- <li> --- <p>If the index does not exist then it will be created even if --- <tt>create = "false"</tt>.</p> --- </li> --- <li> --- <p>If <tt>create = "true"</tt> then any existing index will be destroyed and a --- new index created. If you are rebuilding your entire index then you should use --- <tt>create = "true"</tt> because the indexer doesn't need to remove old --- documents from the index, so it will be faster.</p> --- </li> --- </ul> --- --- <h4 id="head-9585e2ebba0108dc71917a21a4d9ed1edca00732">The lucene:document --- element</h4> --- --- <p>Lucene will index the content of each <tt>lucene:document</tt>, which may --- contain any xml content. The index is associated with the url specified by the --- <tt>url</tt> attribute. So this url will be returned as the results of a search. --- </p> --- --- <h4 id="head-5f2ae3b3aceb65a1fd0cb0942a0385fa7c4a4e2e">The lucene:text-attr --- attribute</h4> --- --- <p>Normally Lucene will only index the content of these elements, not attribute --- values. To index the attributes of an element as well, give it an attribute --- called <tt>lucene:text-attr</tt>, containing a list of the names of the --- attributes you want indexed. For example, to index the value of the <tt>alt</tt> --- attribute of an <tt>img</tt> element, in <tt>html</tt>:</p> --- --- <pre><img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/> --- </pre> --- --- <p>This would index the text "Blah".</p> --- --- <h4 id="head-b85bebbca6ee9807e0a7165b5208677c1616aca7">The lucene:store --- attribute</h4> --- --- <p>Normally Lucene will only index the text of an element, not store it. To --- store the text of an element in Lucene's index, add a --- <tt>lucene:store="true"</tt> attribute to the element. It's a good idea to store --- the title of a document in Lucene, so that your search results can show a --- document title as well as a URL.</p> --- --- <h3 id="head-c55afa96d19d0ca7161da59bedf6409cbbfd78c2">The transformation</h3> --- --- <p>The transformer copies the source document to the output, except for the --- content of the <tt>lucene:document</tt> elements.</p> --- --- <p>The transformer also adds an <tt>elapsed-time</tt> attribute to the output --- <tt>lucene:document</tt> elements, showing the time (in milliseconds) taken to --- index that document. You can use XSLT to transform the results into a report on --- the indexing operation.</p> --- --- <h4 id="head-c9a731f4df69c482e3c1d40fcc39e94b3fb16307">Sample output</h4> --- --- <pre><?xml version="1.0" encoding="UTF-8"?> --- <lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" --- merge-factor="20" --- create="false" --- directory="index" --- analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"> --- <lucene:document url="JCB-001/full.html" elapsed-time="3846"/> --- <lucene:document url="JCB-001/_div1-N1017B.html" elapsed-time="3735"/> --- <lucene:document url="JCB-002/full.html" elapsed-time="361"/> --- <lucene:document url="JCB-002/_div1-N10190.html" elapsed-time="1302"/> --- <lucene:document url="JCB-003/full.html" elapsed-time="300"/> --- <lucene:document url="JCB-003/_div1-N10188.html" elapsed-time="1352"/> --- </lucene:index> --- </pre> --- --- <h5 id="head-24e83f0c8063ca175d6e8a1a80e51e1ed9fbc20b">Note to users of Mac OS X --- </h5> --- --- <p>Java can not open more than 256 files at a time by default, so you may get an --- error like the following:</p> --- --- <pre>Description: org.apache.cocoon.ProcessingException: --- Failed to execute pipeline.: java.lang.RuntimeException: --- java.io.FileNotFoundException: --- /usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 --- (Too many open files) --- </pre> --- --- <p>To avoid this error, you should set your ulimit in the shell script that --- starts Tomcat. My line reads as follows:</p> --- --- <pre>ulimit -S -n 1000 --- </pre> --- --- <p>Read more about this here: --- <a href="http://www.amug.org/%7Eglguerin/howto/More-open-files.html"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> --- http://www.amug.org/~glguerin/howto/More-open-files.html</a></p> --- --- <h5 id="head-f9fcf2cc3f693a586067cd49d3cbe85a6297d60e">Note to users of Redhat --- Linux</h5> --- --- <p>If you get the following error: (Empty StackException) while creating the --- index with the LuceneIndexTransformer try to alter your merge-factor to a lower --- value (default should be 10). Look at the --- <a href="http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#mergeFactor"><img width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/> --- Lucene documentation</a> for more information.</p> --- </body> </html>