[DAISY] Updated: Introduction

daisy Fri, 13 Jul 2007 15:20:34 -0700

A document has been updated:

http://cocoon.zones.apache.org/daisy/documentation/1246.html


Document ID: 1246
Branch: main
Language: default
Name: Introduction (unchanged)
Document Type: Cocoon Document (unchanged)
Updated on: 7/13/07 10:19:59 PM
Updated by: Grzegorz Kossakowski

A new version has been created, state: publish

Parts
=====

Content
-------
This part has been updated.
Mime type: text/xml (unchanged)
File name:  (unchanged)
Size: 43 bytes (previous version: 10767 bytes)
Content diff:
    <html>
    <body>
    
--- <h4 
id="head-0b39056584778d584af2f2cdd81c6998caa13ba5">LuceneIndexTransformer is
--- a component that creates or updates Lucene indexes.</h4>
+++ <p>TODO</p>
    
--- <p>This component only writes the index: to search the index, use the
--- SearchGenerator component.</p>
--- 
--- <h3 id="head-9b35088110dfcf121e63a9a2b67ec652d667a784">Why use it?</h3>
--- 
--- <p>Instead of using LuceneIndexTransformer, you could generate an index by
--- crawling your website. However, the LuceneIndexTransformer is <em>much,
--- much</em> faster than crawling.</p>
--- 
--- <p>The big differences for the developer are:</p>
--- 
--- <ul>
--- <li>
--- <p>Using the LuceneIndexTransformer requires you to write a pipeline that 
can
--- generate a <tt>lucene:index</tt> document describing your searchable URI 
space,
--- so it's necessary to have a well-defined URI space. For a site with a 
consistent
--- structure this should not be too hard. This pipeline can use aggregation and
--- inclusion mechanisms to produce a full list of the pages you want to 
search. In
--- this way it's also possible to generate an index for websites with forms 
which
--- are not crawlable.</p>
--- </li>
--- <li>
--- <p>On the other hand the crawler is a more generic solution, though far less
--- efficient. It doesn't require a pipeline to "document" the entire 
searchable URI
--- space. Instead, you must create a <tt>content</tt> view and a <tt>links</tt>
--- view for each of the searchable pipelines. The URI space is then defined by
--- crawling the <tt>links</tt> view.</p>
--- </li>
--- </ul>
--- 
--- <h3 id="head-953c351734de75a525b9777e976c0812a5618736">Declaring the
--- LuceneIndexTransformer</h3>
--- 
--- <p>The transformer must be declared in the <tt>&lt;transformers&gt;</tt>
--- section of your sitemap:</p>
--- 
--- <pre>&lt;map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0"&gt;
--- 
---    &lt;map:components&gt;
---       ...
---       &lt;map:transformers default="xslt"&gt;
---          &lt;map:transformer name="index" 
---             logger="sitemap.transformer.luceneindextransformer" 
---             
src="org.apache.cocoon.transformation.LuceneIndexTransformer"/&gt;
---       &lt;/map:transformers&gt;
---       ...
---    &lt;/map:components&gt;
---    ...
--- &lt;/map:sitemap&gt;
--- </pre>
--- 
--- <h3 id="head-cea5eb78d3cf27bf4fdf96d1049365b4fa984307">Input document for 
the
--- LuceneIndexTransformer</h3>
--- 
--- <p>This is a sample of the kind of document that the transformer expects. 
NB In
--- this example, I've chosen a couple of simple XHTML documents as the content 
to
--- be indexed. This is only because everyone knows XHTML - in practice you 
should
--- typically generate the index from an early stage in the pipeline; indexing
--- DocBook, TEI, etc, rather than a presentation format like HTML.</p>
--- 
--- <pre>&lt;lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0"; 
---    analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" 
---    directory="index" 
---    create="false" 
---    merge-factor="20"&gt;
--- 
---    &lt;lucene:document url="http://localhost/sample.html"&gt;
---       &lt;!-- here is some sample content --&gt;
---       &lt;html&gt;
---          &lt;head&gt;
---             &lt;title lucene:store="true"&gt;Sample&lt;/title&gt;
---          &lt;/head&gt;
---          &lt;body&gt;
---             &lt;h1&gt;Blah&lt;/h1&gt;
---             &lt;a href="blah.jpg" title="download blah image"
---                lucene:text-attr="title"&gt;
---                &lt;img src="blah-small.jpg" alt="Blah"
---                   lucene:text-attr="alt"/&gt;
---             &lt;/a&gt;
---          &lt;/body&gt;
---       &lt;/html&gt;
---    &lt;/lucene:document&gt;
--- 
---    &lt;lucene:document url="http://localhost/sample-2.html"&gt;
---       &lt;!-- Another sample doc --&gt;
---       &lt;html&gt;
---          &lt;head&gt;
---             &lt;title lucene:store="true"&gt;Second Sample&lt;/title&gt;
---          &lt;/head&gt;
---          &lt;body&gt;
---             &lt;h1&gt;Foo&lt;/h1&gt;
---             &lt;p&gt;Lorem ipsum dolor sit amet, 
---             consectetuer adipiscing elit. &lt;/p&gt;
---          &lt;/body&gt;
---       &lt;/html&gt;
---    &lt;/lucene:document&gt;
--- 
--- &lt;/lucene:index&gt;
--- </pre>
--- 
--- <h3 id="head-97d27647f366081a18adc8469538e908e6354ed4">What the lucene:index
--- document means</h3>
--- 
--- <h4 id="head-9e412039c4f6090a2aaac081c56f522ac97b8985">The lucene:index 
element
--- </h4>
--- 
--- <p>The root element is <tt>lucene:index</tt>. The attributes of the
--- <tt>lucene:index</tt> in the sample above are shown with their default 
values -
--- so the effect is as if they were not specified at all.</p>
--- 
--- <h4 id="head-40afef17a5a56ab2e729d18163f1bc960a8ce2cc">The merge-factor and
--- analyzer attributes</h4>
--- 
--- <p>See
--- <a href="http://jakarta.apache.org/lucene/docs/index.html";><img width="11" 
height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
--- the Lucene documentation</a> for explanations of what they mean.</p>
--- 
--- <h4 id="head-84967edae247fc0739e57bc3af497f832b880582">The 
optimize-frequency
--- attribute (since version 2.2)</h4>
--- 
--- <p>Determines how often the lucene index will be optimized. When you have 
1000's
--- of documents, optimizing the index can become quite slow (eg. 7 seconds for 
9000
--- small docs, P4).</p>
--- 
--- <ul>
--- <li>
--- <p>1: always optimize (default)</p>
--- </li>
--- <li>
--- <p>0: never optimize</p>
--- </li>
--- <li>
--- <p>x: update every x times. You can use any number, it is a random generator
--- which will determine to optimize or not.</p>
--- </li>
--- </ul>
--- 
--- <p>You can eg. create a pipe without optimizing, which is used to index 
you're
--- document everytime when it's modified. You can then create another pipe 
which
--- will optimize, which is called manually. For more info see the Lucene FAQ , 
What
--- is index optimization and when should I use it? :</p>
--- 
--- <p>
--- <a 
href="http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8";><img
 width="11" height="11" 
src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
--- 
http://wiki.apache.org/lucene-java/LuceneFAQ#head-fd848c31f4dc7b91727be6f40a7f5fbe2c66cfb8</a>
--- </p>
--- 
--- <h4 id="head-51123b488fc39c0a36b69c0e24608052fd45a86d">The directory 
attribute
--- </h4>
--- 
--- <p>This attribute controls where the index files are stored. The path is
--- relative to the Cocoon <tt>work</tt> directory.</p>
--- 
--- <h4 id="head-9b03e7cb891515af05d6a3bde919087262b146aa">The create 
attribute</h4>
--- 
--- <p>This attribute controls whether the index is recreated.</p>
--- 
--- <ul>
--- <li>
--- <p>If create = "false" and the index already exists then the index will be
--- updated. Documents which are already indexed will be removed from the index 
and
--- reinserted.</p>
--- </li>
--- <li>
--- <p>If the index does not exist then it will be created even if
--- <tt>create = "false"</tt>.</p>
--- </li>
--- <li>
--- <p>If <tt>create = "true"</tt> then any existing index will be destroyed 
and a
--- new index created. If you are rebuilding your entire index then you should 
use
--- <tt>create = "true"</tt> because the indexer doesn't need to remove old
--- documents from the index, so it will be faster.</p>
--- </li>
--- </ul>
--- 
--- <h4 id="head-9585e2ebba0108dc71917a21a4d9ed1edca00732">The lucene:document
--- element</h4>
--- 
--- <p>Lucene will index the content of each <tt>lucene:document</tt>, which may
--- contain any xml content. The index is associated with the url specified by 
the
--- <tt>url</tt> attribute. So this url will be returned as the results of a 
search.
--- </p>
--- 
--- <h4 id="head-5f2ae3b3aceb65a1fd0cb0942a0385fa7c4a4e2e">The lucene:text-attr
--- attribute</h4>
--- 
--- <p>Normally Lucene will only index the content of these elements, not 
attribute
--- values. To index the attributes of an element as well, give it an attribute
--- called <tt>lucene:text-attr</tt>, containing a list of the names of the
--- attributes you want indexed. For example, to index the value of the 
<tt>alt</tt>
--- attribute of an <tt>img</tt> element, in <tt>html</tt>:</p>
--- 
--- <pre>&lt;img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/&gt;
--- </pre>
--- 
--- <p>This would index the text "Blah".</p>
--- 
--- <h4 id="head-b85bebbca6ee9807e0a7165b5208677c1616aca7">The lucene:store
--- attribute</h4>
--- 
--- <p>Normally Lucene will only index the text of an element, not store it. To
--- store the text of an element in Lucene's index, add a
--- <tt>lucene:store="true"</tt> attribute to the element. It's a good idea to 
store
--- the title of a document in Lucene, so that your search results can show a
--- document title as well as a URL.</p>
--- 
--- <h3 id="head-c55afa96d19d0ca7161da59bedf6409cbbfd78c2">The 
transformation</h3>
--- 
--- <p>The transformer copies the source document to the output, except for the
--- content of the <tt>lucene:document</tt> elements.</p>
--- 
--- <p>The transformer also adds an <tt>elapsed-time</tt> attribute to the 
output
--- <tt>lucene:document</tt> elements, showing the time (in milliseconds) taken 
to
--- index that document. You can use XSLT to transform the results into a 
report on
--- the indexing operation.</p>
--- 
--- <h4 id="head-c9a731f4df69c482e3c1d40fcc39e94b3fb16307">Sample output</h4>
--- 
--- <pre>&lt;?xml version="1.0" encoding="UTF-8"?&gt;
--- &lt;lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0"; 
---         merge-factor="20" 
---         create="false" 
---         directory="index" 
---         analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer"&gt;
---         &lt;lucene:document url="JCB-001/full.html" elapsed-time="3846"/&gt;
---         &lt;lucene:document url="JCB-001/_div1-N1017B.html" 
elapsed-time="3735"/&gt;
---         &lt;lucene:document url="JCB-002/full.html" elapsed-time="361"/&gt;
---         &lt;lucene:document url="JCB-002/_div1-N10190.html" 
elapsed-time="1302"/&gt;
---         &lt;lucene:document url="JCB-003/full.html" elapsed-time="300"/&gt;
---         &lt;lucene:document url="JCB-003/_div1-N10188.html" 
elapsed-time="1352"/&gt;
--- &lt;/lucene:index&gt;
--- </pre>
--- 
--- <h5 id="head-24e83f0c8063ca175d6e8a1a80e51e1ed9fbc20b">Note to users of Mac 
OS X
--- </h5>
--- 
--- <p>Java can not open more than 256 files at a time by default, so you may 
get an
--- error like the following:</p>
--- 
--- <pre>Description: org.apache.cocoon.ProcessingException: 
--- Failed to execute pipeline.: java.lang.RuntimeException: 
--- java.io.FileNotFoundException:  
--- /usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 
--- (Too many open files)
--- </pre>
--- 
--- <p>To avoid this error, you should set your ulimit in the shell script that
--- starts Tomcat. My line reads as follows:</p>
--- 
--- <pre>ulimit -S -n 1000
--- </pre>
--- 
--- <p>Read more about this here:
--- <a href="http://www.amug.org/%7Eglguerin/howto/More-open-files.html";><img 
width="11" height="11" 
src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
--- http://www.amug.org/~glguerin/howto/More-open-files.html</a></p>
--- 
--- <h5 id="head-f9fcf2cc3f693a586067cd49d3cbe85a6297d60e">Note to users of 
Redhat
--- Linux</h5>
--- 
--- <p>If you get the following error: (Empty StackException) while creating the
--- index with the LuceneIndexTransformer try to alter your merge-factor to a 
lower
--- value (default should be 10). Look at the
--- <a 
href="http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#mergeFactor";><img
 width="11" height="11" 
src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
--- Lucene documentation</a> for more information.</p>
--- 
    </body>
    </html>

[DAISY] Updated: Introduction

Reply via email to