concepts xmlsearching.xml

huber Sun, 20 Jan 2002 01:41:16 -0800

huber       02/01/20 01:44:18

  Added:       src/documentation/xdocs/userdocs/concepts xmlsearching.xml
  Log:
  Documentation about crawler, and search Cocoon components
  
  Revision  Changes    Path
  1.1                  
xml-cocoon2/src/documentation/xdocs/userdocs/concepts/xmlsearching.xml
  
  Index: xmlsearching.xml
  ===================================================================
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.0//EN" 
"../../dtd/document-v10.dtd">
  
  <document>
    <header>
      <title>XML Searching</title>
      <subtitle>Search resources in Apache Cocoon</subtitle>
      <version>1.0</version> 
      <type>Technical document</type> 
      <authors>
        <person name="Bernhard Huber" email="[EMAIL PROTECTED]"/>
      </authors>
    </header>
  
    <body>
      <s1 title="Introduction">
        <p>
          This document describes indexing, and searching XML documents
          in Apache Cocoon.
        </p>
        <p>
          Indexing describes the process of fetching XML documents from an Apache 
Cocoon
          instance, and building an index file.
          Searching describes the process of querying the once built index.
        </p>
      </s1>
   
      <s1 title="Decomposition of XMLSearching">
        <p>
          The indexing process is split up into crawling, fetching URL resource, 
          and generating the index.
        </p>
        <p>
          The searching process is split up into searching, 
          and feeding search result into the
          Apache Cocoon pipeline.
        </p>
        <s2 title="Crawling">
          <p>
            The crawling process is specified by  
          </p>
          <ol>
            <li>Base URL to start crawling from</li>
            <li>Included, and excluded URLs</li>
            <li>Cocoon view to use for requesting links from an XML resource</li>
          </ol>
          <p>
            Specifying the base URL determines the protocol for fetching XML resources.
            The  implementation offers to specify <code>http:</code> URLs, 
            crawling an Apache Cocoon instance deployed in a servlet-engine.
            Alternativly you may specify an URI, e.g.: 
<code>/documents/index.html</code>,
            offering to crawl the local Apache Cocoon instance only, either
            servlet-deployed, or in commandline-mode.
          </p>
        </s2>
        <s2 title="Fetching URL resource">
          <p>
            This processing step fetches an URL resource from Apache Cocoon.
          </p>
          <p>
            Apache Cocoon offers the feature of views.
            This feature is used to fetch the 'bare' content of an URL.
          </p>
          <p>
            The above described crawling component is used by the this processing step
            to retrieve a link of an XML document. 
            The link name is augmented by a cocoon view name for fetching the XML 
resource.
          </p>
          <p>
            The Avalon component <code>CocoonCrawler</code> defines the interface
            of a crawler.
          </p>
        </s2>
        <s2 title="Generating index">
          <p>
            A xml resource is fed into a indexing engine.
            Generating an index specifies which elements of an XML resources
            should get indexed, how the elements are stored in the indexed.
            Moreover the physical file location of the index is specified by
            this processing step.
          </p>
          <p>
            The current implementation splits up an XML resource the following way:
          </p>
          <ul>
            <li>Use an Lucene Analyzer for splitting up text</li>
            <li>Each XML element is indexed using its name as Lucene field name.</li>
            <li>Each XML attribute is indexed using its element name and the attribute 
name
              as field name. An attribute has following field name 
              <code>{element-name}@{attribute-name}</code>.
            </li>
          </ul>
          <p>
            The Avalon component <code>LuceneCocoonIndexer</code> defines the interface
            of an indexer.
          </p>
          <p>
            The Avalon component <code>LuceneXMLIndexer</code> defines an interface
            for building an lucene index from an XML document. It uses an SAX content 
handler
            for parsing an XML document, and generating Lucene fields, the current 
            index layout is implemented by <code>SimpleLuceneXMLIndexerImpl</code>,
            and <code>LuceneIndexContentHandler</code>.
          </p>
        </s2>
  
        <s2 title="Searching">
          <p>
            This process uses a search engine for querying the index. 
            The input of this process is a search query string, the result is the
            search result of the search engine.
          </p>
          <p>
            The Avalon component <code>LuceneCocoonSearcher</code> defines an interface
            for searching a Lucene index.
          </p>
        </s2>
        <s2 title="Feeding Search Results">
          <p>
            This is the final step for presenting information stored in the index.
            The result of search engine is feed into the Cocoon processing pipeline.
          </p>
          <p>
            A GUI for the searching process may be developed using any
            java enabled script language, like JSP, or XSP. 
            Moreover a sitemap generator component <code>SearchGenerator</code>
            is provided which transforms the search result to XML, and feeds it
            into the Cocoon processing pipeline.
          </p>
        </s2>
      </s1>
      
      <s1 title="Interdependencies">
        <p>
          As both Avalon components <code>LuceneXMLIndexer</code>, and 
          <code>LuceneCocoonSearcher</code> may use the same Lucene index, you must
          take care of the Lucene index structure in both compoents.
        </p>
        <p>
          The current implementation uses following Lucene index layout
        </p>
        <ul>
          <li>Lucene field <code>body</code> indexed field of the pure text of an XML 
document.
            The <code>body</code> field is the default field name for searching. Thus 
the
            query-string <code>foo</code>, and <code>body:foo</code> is equivalent.
          </li>
          <li>Each XML element generates a Lucene field having the same name as the 
XML element name.
            For example searching for occurences of <code>Cocoon</code> inside of an 
XML abstract
            elemen, use query-string <code>abstact:Cocoon</code>.
          </li>
          <li>Each XML attribute generates a Lucene field having the name
            <code>{element-name}@{attribute-name}</code>. 
            For example searching for occurences of <code>Cocoon</code> inside of an 
XML title attribute
            of s1 element, use query-string <code>s1@title:Cocoon</code>.
          </li>
          <li>
            The Lucene field <code>url</code> stores the URI of the indexed document. 
As
            all fields described above are only indexed information, and no XML 
document 
            is stored inside the Lucene index, this field is the only reference to the 
            XML document resource.
          </li>
          <li>
            The Lucene field <code>uid</code> stores an unique id for implementing 
updating
            the index. This field is used for checking if the XML resource is newer 
than
            the information stored in the Lucene index.
          </li>
        </ul>
      </s1>
        
      <s1 title="Configuration">
        <p>
          Configuring the indexing, and searching Avalon components is specified
          in the <code>cocoon.xconf</code> file.
        </p>
        <p>
          Setting up the sitemap component SearchGenerator takes place in the
          <code>sitemap.xmap</code> file.
        </p>
      </s1>
      
      <s1 title="Implementation notes">
        <p>
          The package <code>org.apache.cocoon.components.search</code> holds
          all searching relevant components.
          The current implementation uses
          <link href="jakarta.apache.org/lucene">Jakarta Lucene</link>
          as its indexing, and searching engine.
        </p>
        <p>
          SearchGenerator is sitemap generator and is available in
          the package <code>org.apache.cocoon.generation</code>.
        </p>
        <p>
          The package <code>org.apache.cocoon.components.crawler</code> holds
          all crawling relevant sources.
        </p>
      </s1>
      
      <s1 title="WebApp Sample usage">
        <p>
          The Cocoon sample webapplication has a link for generating,
          an index of the Cocoon documentation, and searching the 
          Cocoon documentation.
        </p>
        <p>
          The following list describes step by step how to make use of
          webapp sample page:
        </p>
        <ol>
          <li>Go to the page "Search the docs".
          </li>
          <li>Create an index, follow the link "create".
            Creating an index may take some time, as the implementation
            accesses the XML resources via http: protocol.
          </li>
          <li>Next you may query the index, by following the 
            link "XSP", or "Cocoon Generators". Typing in a query will
            result in the table of hits orderer by relevance.
          </li>
        </ol>
        <p>
          As a result of the creation step, there should exist an
          Lucene index in the directory <code>index</code> 
          below the temporary working directory of the servlet engine.
        </p>
        <p>
          The "XSP" link for searching shows an XSP implementation of
          invoking the Avalon component <code>CocoonSearch</code>. 
          Using this approach gives fine grained control
          over the searching process.
        </p>
        <p>
          The "Cocoon Generator" links defines in the sitemap using
          the SearchGenerator, and transforming the XML search result to HTML.
          This approach tries to minimize your effort of using searching,
          as you need to adapt the XSLT transformation step only to your
          needs.
        </p>
      </s1>
      
      <s1 title="Summary">
        <p>
          This document gives an overview of the components for
          using an indexing, and searching engine in Cocoon.
          It described the component decomposition of the Cocoon
          XMLSearch subsystem.
        </p>
      </s1>
    </body>
  </document>


----------------------------------------------------------------------
In case of troubles, e-mail:     [EMAIL PROTECTED]
To unsubscribe, e-mail:          [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

cvs commit: xml-cocoon2/src/documentation/xdocs/userdocs/concepts xmlsearching.xml

Reply via email to