[DAISY] Updated: Introduction

daisy Tue, 10 Jul 2007 06:07:49 -0700

A document has been updated:

http://cocoon.zones.apache.org/daisy/documentation/1246.html


Document ID: 1246
Branch: main
Language: default
Name: Introduction (unchanged)
Document Type: Cocoon Document (unchanged)
Updated on: 7/10/07 1:07:25 PM
Updated by: Dominique De Munck

A new version has been created, state: draft

Parts
=====

Content
-------
This part has been updated.
Mime type: text/xml (unchanged)
File name:  (unchanged)
Size: 10767 bytes (previous version: 37 bytes)
Content diff:
--- <html><body><p>TODO</p></body></html>
+++ <html>
+++ <body>
+++ 
+++ <h4 
id="head-0b39056584778d584af2f2cdd81c6998caa13ba5">LuceneIndexTransformer is
+++ a component that creates or updates Lucene indexes.</h4>
+++ 
+++ <p>This component only writes the index: to search the index, use the
+++ SearchGenerator component.</p>
+++ 
+++ <h3 id="head-9b35088110dfcf121e63a9a2b67ec652d667a784">Why use it?</h3>
+++ 
+++ <p>Instead of using LuceneIndexTransformer, you could generate an index by
+++ crawling your website. However, the LuceneIndexTransformer is <em>much,
+++ much</em> faster than crawling.</p>
+++ 
+++ <p>The big differences for the developer are:</p>
+++ 
+++ <ul>
+++ <li>
+++ <p>Using the LuceneIndexTransformer requires you to write a pipeline that 
can
+++ generate a <tt>lucene:index</tt> document describing your searchable URI 
space,
+++ so it's necessary to have a well-defined URI space. For a site with a 
consistent
+++ structure this should not be too hard. This pipeline can use aggregation and
+++ inclusion mechanisms to produce a full list of the pages you want to 
search. In
+++ this way it's also possible to generate an index for websites with forms 
which
+++ are not crawlable.</p>
+++ </li>
+++ <li>
+++ <p>On the other hand the crawler is a more generic solution, though far less
+++ efficient. It doesn't require a pipeline to "document" the entire 
searchable URI
+++ space. Instead, you must create a <tt>content</tt> view and a <tt>links</tt>
+++ view for each of the searchable pipelines. The URI space is then defined by
+++ crawling the <tt>links</tt> view.</p>
+++ </li>
+++ </ul>
+++ 
+++ <h3 id="head-953c351734de75a525b9777e976c0812a5618736">Declaring the
+++ LuceneIndexTransformer</h3>
+++ 
+++ <p>The transformer must be declared in the <tt>&lt;transformers&gt;</tt>
+++ section of your sitemap:</p>
+++ 
+++ <pre>&lt;map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0"&gt;
+++ 
+++    &lt;map:components&gt;
+++       ...
+++       &lt;map:transformers default="xslt"&gt;
+++          &lt;map:transformer name="index" 
+++             logger="sitemap.transformer.luceneindextransformer" 
+++             
src="org.apache.cocoon.transformation.LuceneIndexTransformer"/&gt;
+++       &lt;/map:transformers&gt;
+++       ...
+++    &lt;/map:components&gt;
+++    ...
+++ &lt;/map:sitemap&gt;
+++ </pre>
+++ 
+++ <h3 id="head-cea5eb78d3cf27bf4fdf96d1049365b4fa984307">Input document for 
the
+++ LuceneIndexTransformer</h3>
+++ 
+++ <p>This is a sample of the kind of document that the transformer expects. 
NB In
+++ this example, I've chosen a couple of simple XHTML documents as the content 
to
+++ be indexed. This is only because everyone knows XHTML - in practice you 
should
+++ typically generate the index from an early stage in the pipeline; indexing
+++ DocBook, TEI, etc, rather than a presentation format like HTML.</p>
+++ 
+++ <pre>&lt;lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0"; 
+++    analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" 
+++    directory="index" 
+++    create="false" 
+++    merge-factor="20"&gt;
+++ 
+++    &lt;lucene:document url="http://localhost/sample.html"&gt;
+++       &lt;!-- here is some sample content --&gt;
+++       &lt;html&gt;
+++          &lt;head&gt;
+++             &lt;title lucene:store="true"&gt;Sample&lt;/title&gt;
+++          &lt;/head&gt;
+++          &lt;body&gt;
+++             &lt;h1&gt;Blah&lt;/h1&gt;
+++             &lt;a href="blah.jpg" title="download blah image"
+++                lucene:text-attr="title"&gt;
+++                &lt;img src="blah-small.jpg" alt="Blah"
+++                   lucene:text-attr="alt"/&gt;
+++             &lt;/a&gt;
+++          &lt;/body&gt;
+++       &lt;/html&gt;
+++    &lt;/lucene:document&gt;
+++ 
+++    &lt;lucene:document url="http://localhost/sample-2.html"&gt;
+++       &lt;!-- Another sample doc --&gt;
+++       &lt;html&gt;
+++          &lt;head&gt;
+++             &lt;title lucene:store="true"&gt;Second Sample&lt;/title&gt;
+++          &lt;/head&gt;
+++          &lt;body&gt;
+++             &lt;h1&gt;Foo&lt;/h1&gt;
+++             &lt;p&gt;Lorem ipsum dolor sit amet, 
+++             consectetuer adipiscing elit. &lt;/p&gt;
+++          &lt;/body&gt;
+++       &lt;/html&gt;
+++    &lt;/lucene:document&gt;
+++ 
+++ &lt;/lucene:index&gt;
+++ </pre>
+++ 
+++ <h3 id="head-97d27647f366081a18adc8469538e908e6354ed4">What the lucene:index
+++ document means</h3>
+++ 
+++ <h4 id="head-9e412039c4f6090a2aaac081c56f522ac97b8985">The lucene:index 
element
+++ </h4>
+++ 
+++ <p>The root element is <tt>lucene:index</tt>. The attributes of the
+++ <tt>lucene:index</tt> in the sample above are shown with their default 
values -
+++ so the effect is as if they were not specified at all.</p>
+++ 
+++ <h4 id="head-40afef17a5a56ab2e729d18163f1bc960a8ce2cc">The merge-factor and
+++ analyzer attributes</h4>
+++ 
+++ <p>See
+++ <a href="http://jakarta.apache.org/lucene/docs/index.html";><img width="11" 
height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
+++ the Lucene documentation</a> for explanations of what they mean.</p>
+++ 
+++ <h4 id="head-84967edae247fc0739e57bc3af497f832b880582">The 
optimize-frequency
+++ attribute (since version 2.2)</h4>
+++ 
+++ <p>Determines how often the lucene index will be optimized. When you have 
1000's
+++ of documents, optimizing the index can become quite slow (eg. 7 seconds for 
9000
+++ small docs, P4).</p>
+++ 
+++ <ul>
+++ <li>
+++ <p>1: always optimize (default)</p>
+++ </li>
+++ <li>
+++ <p>0: never optimize</p>
+++ </li>
+++ <li>
+++ <p>x: update every x times. You can use any number, it is a random generator
+++ which will determine to optimize or not.</p>
+++ </li>
+++ </ul>
+++ 
+++ <p>You can eg. create a pipe without optimizing, which is used to index 
you're
+++ document everytime when it's modified. You can then create another pipe 
which
+++ will optimize, which is called manually. For more info see the Lucene FAQ , 
What
+++ is index optimization and when should I use it? :</p>
+++ 
+++ <p>
+++ <a 
href="http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8";><img
 width="11" height="11" 
src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
+++ 
http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8</a>
+++ </p>
+++ 
+++ <h4 id="head-51123b488fc39c0a36b69c0e24608052fd45a86d">The directory 
attribute
+++ </h4>
+++ 
+++ <p>This attribute controls where the index files are stored. The path is
+++ relative to the Cocoon <tt>work</tt> directory.</p>
+++ 
+++ <h4 id="head-9b03e7cb891515af05d6a3bde919087262b146aa">The create 
attribute</h4>
+++ 
+++ <p>This attribute controls whether the index is recreated.</p>
+++ 
+++ <ul>
+++ <li>
+++ <p>If create = "false" and the index already exists then the index will be
+++ updated. Documents which are already indexed will be removed from the index 
and
+++ reinserted.</p>
+++ </li>
+++ <li>
+++ <p>If the index does not exist then it will be created even if
+++ <tt>create = "false"</tt>.</p>
+++ </li>
+++ <li>
+++ <p>If <tt>create = "true"</tt> then any existing index will be destroyed 
and a
+++ new index created. If you are rebuilding your entire index then you should 
use
+++ <tt>create = "true"</tt> because the indexer doesn't need to remove old
+++ documents from the index, so it will be faster.</p>
+++ </li>
+++ </ul>
+++ 
+++ <h4 id="head-9585e2ebba0108dc71917a21a4d9ed1edca00732">The lucene:document
+++ element</h4>
+++ 
+++ <p>Lucene will index the content of each <tt>lucene:document</tt>, which may
+++ contain any xml content. The index is associated with the url specified by 
the
+++ <tt>url</tt> attribute. So this url will be returned as the results of a 
search.
+++ </p>
+++ 
+++ <h4 id="head-5f2ae3b3aceb65a1fd0cb0942a0385fa7c4a4e2e">The lucene:text-attr
+++ attribute</h4>
+++ 
+++ <p>Normally Lucene will only index the content of these elements, not 
attribute
+++ values. To index the attributes of an element as well, give it an attribute
+++ called <tt>lucene:text-attr</tt>, containing a list of the names of the
+++ attributes you want indexed. For example, to index the value of the 
<tt>alt</tt>
+++ attribute of an <tt>img</tt> element, in <tt>html</tt>:</p>
+++ 
+++ <pre>&lt;img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/&gt;
+++ </pre>
+++ 
+++ <p>This would index the text "Blah".</p>
+++ 
+++ <h4 id="head-b85bebbca6ee9807e0a7165b5208677c1616aca7">The lucene:store
+++ attribute</h4>
+++ 
+++ <p>Normally Lucene will only index the text of an element, not store it. To
+++ store the text of an element in Lucene's index, add a
+++ <tt>lucene:store="true"</tt> attribute to the element. It's a good idea to 
store
+++ the title of a document in Lucene, so that your search results can show a
+++ document title as well as a URL.</p>
+++ 
+++ <h3 id="head-c55afa96d19d0ca7161da59bedf6409cbbfd78c2">The 
transformation</h3>
+++ 
+++ <p>The transformer copies the source document to the output, except for the
+++ content of the <tt>lucene:document</tt> elements.</p>
+++ 
+++ <p>The transformer also adds an <tt>elapsed-time</tt> attribute to the 
output
+++ <tt>lucene:document</tt> elements, showing the time (in milliseconds) taken 
to
+++ index that document. You can use XSLT to transform the results into a 
report on
+++ the indexing operation.</p>
+++ 
+++ <h4 id="head-c9a731f4df69c482e3c1d40fcc39e94b3fb16307">Sample output</h4>
+++ 
+++ <pre>&lt;?xml version="1.0" encoding="UTF-8"?&gt;
+++ &lt;lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0"; 
+++         merge-factor="20" 
+++         create="false" 
+++         directory="index" 
+++         analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"&gt;
+++         &lt;lucene:document url="JCB-001/full.html" elapsed-time="3846"/&gt;
+++         &lt;lucene:document url="JCB-001/_div1-N1017B.html" 
elapsed-time="3735"/&gt;
+++         &lt;lucene:document url="JCB-002/full.html" elapsed-time="361"/&gt;
+++         &lt;lucene:document url="JCB-002/_div1-N10190.html" 
elapsed-time="1302"/&gt;
+++         &lt;lucene:document url="JCB-003/full.html" elapsed-time="300"/&gt;
+++         &lt;lucene:document url="JCB-003/_div1-N10188.html" 
elapsed-time="1352"/&gt;
+++ &lt;/lucene:index&gt;
+++ </pre>
+++ 
+++ <h5 id="head-24e83f0c8063ca175d6e8a1a80e51e1ed9fbc20b">Note to users of Mac 
OS X
+++ </h5>
+++ 
+++ <p>Java can not open more than 256 files at a time by default, so you may 
get an
+++ error like the following:</p>
+++ 
+++ <pre>Description: org.apache.cocoon.ProcessingException: 
+++ Failed to execute pipeline.: java.lang.RuntimeException: 
+++ java.io.FileNotFoundException:  
+++ /usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 
+++ (Too many open files)
+++ </pre>
+++ 
+++ <p>To avoid this error, you should set your ulimit in the shell script that
+++ starts Tomcat. My line reads as follows:</p>
+++ 
+++ <pre>ulimit -S -n 1000
+++ </pre>
+++ 
+++ <p>Read more about this here:
+++ <a href="http://www.amug.org/%7Eglguerin/howto/More-open-files.html";><img 
width="11" height="11" 
src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
+++ http://www.amug.org/~glguerin/howto/More-open-files.html</a></p>
+++ 
+++ <h5 id="head-f9fcf2cc3f693a586067cd49d3cbe85a6297d60e">Note to users of 
Redhat
+++ Linux</h5>
+++ 
+++ <p>If you get the following error: (Empty StackException) while creating the
+++ index with the LuceneIndexTransformer try to alter your merge-factor to a 
lower
+++ value (default should be 10). Look at the
+++ <a 
href="http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#mergeFactor";><img
 width="11" height="11" 
src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
+++ Lucene documentation</a> for more information.</p>
+++ 
+++ </body>
+++ </html>

[DAISY] Updated: Introduction

Reply via email to