Added: websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/FileFormat.html ============================================================================== --- websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/FileFormat.html (added) +++ websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/FileFormat.html Wed Sep 28 12:07:48 2016 @@ -0,0 +1,260 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html lang="en"> + <head> + <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> + <title>Lucy::Docs::FileFormat</title> + <link rel="stylesheet" type="text/css" media="screen" href="/css/lucy.css"> + </head> + + <body> + + <div id="lucy-rigid_wrapper"> + + <div id="lucy-top" class="container_16 lucy-white_box_3d"> + + <div id="lucy-logo_box" class="grid_8"> + <a href="/"><img src="/images/lucy_logo_150x100.png" alt="Apache Lucyâ¢"></a> + </div> <!-- lucy-logo_box --> + + <div #id="lucy-top_nav_box" class="grid_8"> + <div id="lucy-top_nav_bar" class="container_8"> + <ul> + <li><a href="http://www.apache.org/" title="Apache Software Foundation">Apache Software Foundation</a></li> + <li><a href="http://www.apache.org/licenses/" title="License">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsorship">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li> + <li><a href="http://www.apache.org/security/ " title="Security">Security</a></li> + </ul> + </div> <!-- lucy-top_nav_bar --> + <p><a href="http://www.apache.org/">Apache</a> » <a href="/">Lucy</a> » <a href="/docs/">Docs</a> » <a href="/docs/0.5.0/">0.5.0</a> » <a href="/docs/0.5.0/c/">C</a> » <a href="/docs/0.5.0/c/Lucy/">Lucy</a> » <a href="/docs/0.5.0/c/Lucy/Docs/">Docs</a></p> + <form name="lucy-top_search_box" id="lucy-top_search_box" action="http://www.google.com/search" method="get"> + <input value="*.apache.org" name="sitesearch" type="hidden"/> + <input type="text" name="q" id="query" style="width:85%"> + <input type="submit" id="submit" value="Search"> + </form> + </div> <!-- lucy-top_nav_box --> + + <div class="clear"></div> + + </div> <!-- lucy-top --> + + <div id="lucy-main_content" class="container_16 lucy-white_box_3d"> + + <div class="grid_4" id="lucy-left_nav_box"> + <h6>About</h6> + <ul> + <li><a href="/">Welcome</a></li> + <li><a href="/clownfish.html">Clownfish</a></li> + <li><a href="/faq.html">FAQ</a></li> + <li><a href="/people.html">People</a></li> + </ul> + <h6>Resources</h6> + <ul> + <li><a href="/download.html">Download</a></li> + <li><a href="/mailing_lists.html">Mailing Lists</a></li> + <li><a href="/docs/">Documentation</a></li> + <li><a href="http://wiki.apache.org/lucy/">Wiki</a></li> + <li><a href="https://issues.apache.org/jira/browse/LUCY">Issue Tracker</a></li> + <li><a href="/version_control.html">Version Control</a></li> + </ul> + <h6>Related Projects</h6> + <ul> + <li><a href="http://lucene.apache.org/core/">Lucene</a></li> + <li><a href="http://dezi.org/">Dezi</a></li> + <li><a href="http://lucene.apache.org/solr/">Solr</a></li> + <li><a href="http://lucenenet.apache.org/">Lucene.NET</a></li> + <li><a href="http://lucene.apache.org/pylucene/">PyLucene</a></li> + </ul> + </div> <!-- lucy-left_nav_box --> + + <div id="lucy-main_content_box" class="grid_9"> + <div class="c-api"> +<h2>Overview of index file format</h2> +<p>It is not necessary to understand the current implementation details of the +index file format in order to use Apache Lucy effectively, but it may be +helpful if you are interested in tweaking for high performance, exotic usage, +or debugging and development.</p> +<p>On a file system, an index is a directory. The files inside have a +hierarchical relationship: an index is made up of âsegmentsâ, each of which is +an independent inverted index with its own subdirectory; each segment is made +up of several component parts.</p> +<pre><code>[index]--| + |--snapshot_XXX.json + |--schema_XXX.json + |--write.lock + | + |--seg_1--| + | |--segmeta.json + | |--cfmeta.json + | |--cf.dat-------| + | |--[lexicon] + | |--[postings] + | |--[documents] + | |--[highlight] + | |--[deletions] + | + |--seg_2--| + | |--segmeta.json + | |--cfmeta.json + | |--cf.dat-------| + | |--[lexicon] + | |--[postings] + | |--[documents] + | |--[highlight] + | |--[deletions] + | + |--[...]--| +</code></pre> +<h3>Write-once philosophy</h3> +<p>All segment directory names consist of the string âseg_â followed by a number +in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher numbers indicating +more recent segments. Once a segment is finished and committed, its name is +never re-used and its files are never modified.</p> +<p>Old segments become obsolete and can be removed when their data has been +consolidated into new segments during the process of segment merging and +optimization. A fully-optimized index has only one segment.</p> +<h3>Top-level entries</h3> +<p>There are a handful of âtop-levelâ files and directories which belong to the +entire index rather than to a particular segment.</p> +<h4>snapshot_XXX.json</h4> +<p>A âsnapshotâ file, e.g. <code>snapshot_m7p.json</code>, is list of index files and +directories. Because index files, once written, are never modified, the list +of entries in a snapshot defines a point-in-time view of the data in an index.</p> +<p>Like segment directories, snapshot files also utilize the +unique-base-36-number naming convention; the higher the number, the more +recent the file. The appearance of a new snapshot file within the index +directory constitutes an index update. While a new segment is being written +new files may be added to the index directory, but until a new snapshot file +gets written, a Searcher opening the index for reading wonât know about them.</p> +<h4>schema_XXX.json</h4> +<p>The schema file is a Schema object describing the indexâs format, serialized +as JSON. It, too, is versioned, and a given snapshot file will reference one +and only one schema file.</p> +<h4>locks</h4> +<p>By default, only one indexing process may safely modify the index at any given +time. Processes reserve an index by laying claim to the <code>write.lock</code> file +within the <code>locks/</code> directory. A smattering of other lock files may be used +from time to time, as well.</p> +<h3>A segmentâs component parts</h3> +<p>By default, each segment has up to five logical components: lexicon, postings, +document storage, highlight data, and deletions. Binary data from these +components gets stored in virtual files within the âcf.datâ compound file; +metadata is stored in a shared âsegmeta.jsonâ file.</p> +<h4>segmeta.json</h4> +<p>The segmeta.json file is a central repository for segment metadata. In +addition to information such as document counts and field numbers, it also +warehouses arbitrary metadata on behalf of individual index components.</p> +<h4>Lexicon</h4> +<p>Each indexed field gets its own lexicon in each segment. The exact files +involved depend on the fieldâs type, but generally speaking there will be two +parts. First, thereâs a primary <code>lexicon-XXX.dat</code> file which houses a +complete term list associating terms with corpus frequency statistics, +postings file locations, etc. Second, one or more âlexicon indexâ files may +be present which contain periodic samples from the primary lexicon file to +facilitate fast lookups.</p> +<h4>Postings</h4> +<p>âPostingâ is a technical term from the field of +<a href="../../Lucy/Docs/IRTheory.html">information retrieval</a>, defined as a single +instance of a one term indexing one document. If you are looking at the index +in the back of a book, and you see that âfreedomâ is referenced on pages 8, +86, and 240, that would be three postings, which taken together form a +âposting listâ. The same terminology applies to an index in electronic form.</p> +<p>Each segment has one postings file per indexed field. When a search is +performed for a single term, first that term is looked up in the lexicon. If +the term exists in the segment, the record in the lexicon will contain +information about which postings file to look at and where to look.</p> +<p>The first thing any posting record tells you is a document id. By iterating +over all the postings associated with a term, you can find all the documents +that match that term, a process which is analogous to looking up page numbers +in a bookâs index. However, each posting record typically contains other +information in addition to document id, e.g. the positions at which the term +occurs within the field.</p> +<h4>Documents</h4> +<p>The document storage section is a simple database, organized into two files:</p> +<ul> +<li> +<p><strong>documents.dat</strong> - Serialized documents.</p> +</li> +<li> +<p><strong>documents.ix</strong> - Document storage index, a solid array of 64-bit integers +where each integer location corresponds to a document id, and the value at +that location points at a file position in the documents.dat file.</p> +</li> +</ul> +<h4>Highlight data</h4> +<p>The files which store data used for excerpting and highlighting are organized +similarly to the files used to store documents.</p> +<ul> +<li> +<p><strong>highlight.dat</strong> - Chunks of serialized highlight data, one per doc id.</p> +</li> +<li> +<p><strong>highlight.ix</strong> - Highlight data index â as with the <code>documents.ix</code> file, a +solid array of 64-bit file pointers.</p> +</li> +</ul> +<h4>Deletions</h4> +<p>When a document is âdeletedâ from a segment, it is not actually purged right +away; it is merely marked as âdeletedâ via a deletions file. Deletions files +contains bit vectors with one bit for each document in the segment; if bit +#254 is set then document 254 is deleted, and if that document turns up in a +search it will be masked out.</p> +<p>It is only when a segmentâs contents are rewritten to a new segment during the +segment-merging process that deleted documents truly go away.</p> +<h3>Compound Files</h3> +<p>If you peer inside an index directory, you wonât actually find any files named +âdocuments.datâ, âhighlight.ixâ, etc. unless there is an indexing process +underway. What you will find instead is one âcf.datâ and one âcfmeta.jsonâ +file per segment.</p> +<p>To minimize the need for file descriptors at search-time, all per-segment +binary data files are concatenated together in âcf.datâ at the close of each +indexing session. Information about where each file begins and ends is stored +in <code>cfmeta.json</code>. When the segment is opened for reading, a single file +descriptor per âcf.datâ file can be shared among several readers.</p> +<h3>A Typical Search</h3> +<p>Hereâs a simplified narrative, dramatizing how a search for âfreedomâ against +a given segment plays out:</p> +<ol> +<li> +<p>The searcher asks the relevant Lexicon Index, âDo you know anything about +âfreedomâ?â Lexicon Index replies, âCanât say for sure, but if the main +Lexicon file does, âfreedomâ is probably somewhere around byte 21008â.</p> +</li> +<li> +<p>The main Lexicon tells the searcher âOne moment, let me scan our records⦠+Yes, we have 2 documents which contain âfreedomâ. Youâll find them in +seg_6/postings-4.dat starting at byte 66991.â</p> +</li> +<li> +<p>The Postings file says âYep, we have âfreedomâ, all right! Document id 40 +has 1 âfreedomâ, and document 44 has 8. If you need to know more, like if any +âfreedomâ is part of the phrase âfreedom of speechâ, ask me about positions!</p> +</li> +<li> +<p>If the searcher is only looking for âfreedomâ in isolation, thatâs where it +stops. It now knows enough to assign the documents scores against âfreedomâ, +with the 8-freedom document likely ranking higher than the single-freedom +document.</p> +</li> +</ol> +</div> + + </div> <!-- lucy-main_content_box --> + <div class="clear"></div> + + </div> <!-- lucy-main_content --> + + <div id="lucy-copyright" class="container_16"> + <p>Copyright © 2010-2015 The Apache Software Foundation, Licensed under the + <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br/> + Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The + Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their + respective owners. + </p> + </div> <!-- lucy-copyright --> + + </div> <!-- lucy-rigid_wrapper --> + + </body> +</html>
Added: websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/FileLocking.html ============================================================================== --- websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/FileLocking.html (added) +++ websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/FileLocking.html Wed Sep 28 12:07:48 2016 @@ -0,0 +1,144 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html lang="en"> + <head> + <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> + <title>Lucy::Docs::FileLocking</title> + <link rel="stylesheet" type="text/css" media="screen" href="/css/lucy.css"> + </head> + + <body> + + <div id="lucy-rigid_wrapper"> + + <div id="lucy-top" class="container_16 lucy-white_box_3d"> + + <div id="lucy-logo_box" class="grid_8"> + <a href="/"><img src="/images/lucy_logo_150x100.png" alt="Apache Lucyâ¢"></a> + </div> <!-- lucy-logo_box --> + + <div #id="lucy-top_nav_box" class="grid_8"> + <div id="lucy-top_nav_bar" class="container_8"> + <ul> + <li><a href="http://www.apache.org/" title="Apache Software Foundation">Apache Software Foundation</a></li> + <li><a href="http://www.apache.org/licenses/" title="License">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsorship">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li> + <li><a href="http://www.apache.org/security/ " title="Security">Security</a></li> + </ul> + </div> <!-- lucy-top_nav_bar --> + <p><a href="http://www.apache.org/">Apache</a> » <a href="/">Lucy</a> » <a href="/docs/">Docs</a> » <a href="/docs/0.5.0/">0.5.0</a> » <a href="/docs/0.5.0/c/">C</a> » <a href="/docs/0.5.0/c/Lucy/">Lucy</a> » <a href="/docs/0.5.0/c/Lucy/Docs/">Docs</a></p> + <form name="lucy-top_search_box" id="lucy-top_search_box" action="http://www.google.com/search" method="get"> + <input value="*.apache.org" name="sitesearch" type="hidden"/> + <input type="text" name="q" id="query" style="width:85%"> + <input type="submit" id="submit" value="Search"> + </form> + </div> <!-- lucy-top_nav_box --> + + <div class="clear"></div> + + </div> <!-- lucy-top --> + + <div id="lucy-main_content" class="container_16 lucy-white_box_3d"> + + <div class="grid_4" id="lucy-left_nav_box"> + <h6>About</h6> + <ul> + <li><a href="/">Welcome</a></li> + <li><a href="/clownfish.html">Clownfish</a></li> + <li><a href="/faq.html">FAQ</a></li> + <li><a href="/people.html">People</a></li> + </ul> + <h6>Resources</h6> + <ul> + <li><a href="/download.html">Download</a></li> + <li><a href="/mailing_lists.html">Mailing Lists</a></li> + <li><a href="/docs/">Documentation</a></li> + <li><a href="http://wiki.apache.org/lucy/">Wiki</a></li> + <li><a href="https://issues.apache.org/jira/browse/LUCY">Issue Tracker</a></li> + <li><a href="/version_control.html">Version Control</a></li> + </ul> + <h6>Related Projects</h6> + <ul> + <li><a href="http://lucene.apache.org/core/">Lucene</a></li> + <li><a href="http://dezi.org/">Dezi</a></li> + <li><a href="http://lucene.apache.org/solr/">Solr</a></li> + <li><a href="http://lucenenet.apache.org/">Lucene.NET</a></li> + <li><a href="http://lucene.apache.org/pylucene/">PyLucene</a></li> + </ul> + </div> <!-- lucy-left_nav_box --> + + <div id="lucy-main_content_box" class="grid_9"> + <div class="c-api"> +<h2>Manage indexes on shared volumes.</h2> +<p>Normally, index locking is an invisible process. Exclusive write access is +controlled via lockfiles within the index directory and problems only arise +if multiple processes attempt to acquire the write lock simultaneously; +search-time processes do not ordinarily require locking at all.</p> +<p>On shared volumes, however, the default locking mechanism fails, and manual +intervention becomes necessary.</p> +<p>Both read and write applications accessing an index on a shared volume need +to identify themselves with a unique <code>host</code> id, e.g. hostname or +ip address. Knowing the host id makes it possible to tell which lockfiles +belong to other machines and therefore must not be removed when the +lockfileâs pid number appears not to correspond to an active process.</p> +<p>At index-time, the danger is that multiple indexing processes from +different machines which fail to specify a unique <code>host</code> id can +delete each othersâ lockfiles and then attempt to modify the index at the +same time, causing index corruption. The search-time problem is more +complex.</p> +<p>Once an index file is no longer listed in the most recent snapshot, Indexer +attempts to delete it as part of a post-<a href="lucy:Indexer.Commit"></a> cleanup routine. It is +possible that at the moment an Indexer is deleting files which it believes +no longer needed, a Searcher referencing an earlier snapshot is in fact +using them. The more often that an index is either updated or searched, +the more likely it is that this conflict will arise from time to time.</p> +<p>Ordinarily, the deletion attempts are not a problem. On a typical unix +volume, the files will be deleted in name only: any process which holds an +open filehandle against a given file will continue to have access, and the +file wonât actually get vaporized until the last filehandle is cleared. +Thanks to âdelete on last close semanticsâ, an Indexer canât truly delete +the file out from underneath an active Searcher. On Windows, where file +deletion fails whenever any process holds an open handle, the situation is +different but still workable: Indexer just keeps retrying after each commit +until deletion finally succeeds.</p> +<p>On NFS, however, the system breaks, because NFS allows files to be deleted +out from underneath active processes. Should this happen, the unlucky read +process will crash with a âStale NFS filehandleâ exception.</p> +<p>Under normal circumstances, it is neither necessary nor desirable for +IndexReaders to secure read locks against an index, but for NFS we have to +make an exception. LockFactoryâs <a href="lucy:LockFactory.Make_Shared_Lock"></a> method exists for this +reason; supplying an IndexManager instance to IndexReaderâs constructor +activates an internal locking mechanism using <a href="lucy:LockFactory.Make_Shared_Lock"></a> which +prevents concurrent indexing processes from deleting files that are needed +by active readers.</p> +<pre><code>Code example for C is missing</code></pre> +<p>Since shared locks are implemented using lockfiles located in the index +directory (as are exclusive locks), reader applications must have write +access for read locking to work. Stale lock files from crashed processes +are ordinarily cleared away the next time the same machine â as identified +by the <code>host</code> parameter â opens another IndexReader. (The +classic technique of timing out lock files is not feasible because search +processes may lie dormant indefinitely.) However, please be aware that if +the last thing a given machine does is crash, lock files belonging to it +may persist, preventing deletion of obsolete index data.</p> +</div> + + </div> <!-- lucy-main_content_box --> + <div class="clear"></div> + + </div> <!-- lucy-main_content --> + + <div id="lucy-copyright" class="container_16"> + <p>Copyright © 2010-2015 The Apache Software Foundation, Licensed under the + <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br/> + Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The + Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their + respective owners. + </p> + </div> <!-- lucy-copyright --> + + </div> <!-- lucy-rigid_wrapper --> + + </body> +</html> Added: websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/IRTheory.html ============================================================================== --- websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/IRTheory.html (added) +++ websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/IRTheory.html Wed Sep 28 12:07:48 2016 @@ -0,0 +1,133 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html lang="en"> + <head> + <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> + <title>Lucy::Docs::IRTheory</title> + <link rel="stylesheet" type="text/css" media="screen" href="/css/lucy.css"> + </head> + + <body> + + <div id="lucy-rigid_wrapper"> + + <div id="lucy-top" class="container_16 lucy-white_box_3d"> + + <div id="lucy-logo_box" class="grid_8"> + <a href="/"><img src="/images/lucy_logo_150x100.png" alt="Apache Lucyâ¢"></a> + </div> <!-- lucy-logo_box --> + + <div #id="lucy-top_nav_box" class="grid_8"> + <div id="lucy-top_nav_bar" class="container_8"> + <ul> + <li><a href="http://www.apache.org/" title="Apache Software Foundation">Apache Software Foundation</a></li> + <li><a href="http://www.apache.org/licenses/" title="License">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsorship">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li> + <li><a href="http://www.apache.org/security/ " title="Security">Security</a></li> + </ul> + </div> <!-- lucy-top_nav_bar --> + <p><a href="http://www.apache.org/">Apache</a> » <a href="/">Lucy</a> » <a href="/docs/">Docs</a> » <a href="/docs/0.5.0/">0.5.0</a> » <a href="/docs/0.5.0/c/">C</a> » <a href="/docs/0.5.0/c/Lucy/">Lucy</a> » <a href="/docs/0.5.0/c/Lucy/Docs/">Docs</a></p> + <form name="lucy-top_search_box" id="lucy-top_search_box" action="http://www.google.com/search" method="get"> + <input value="*.apache.org" name="sitesearch" type="hidden"/> + <input type="text" name="q" id="query" style="width:85%"> + <input type="submit" id="submit" value="Search"> + </form> + </div> <!-- lucy-top_nav_box --> + + <div class="clear"></div> + + </div> <!-- lucy-top --> + + <div id="lucy-main_content" class="container_16 lucy-white_box_3d"> + + <div class="grid_4" id="lucy-left_nav_box"> + <h6>About</h6> + <ul> + <li><a href="/">Welcome</a></li> + <li><a href="/clownfish.html">Clownfish</a></li> + <li><a href="/faq.html">FAQ</a></li> + <li><a href="/people.html">People</a></li> + </ul> + <h6>Resources</h6> + <ul> + <li><a href="/download.html">Download</a></li> + <li><a href="/mailing_lists.html">Mailing Lists</a></li> + <li><a href="/docs/">Documentation</a></li> + <li><a href="http://wiki.apache.org/lucy/">Wiki</a></li> + <li><a href="https://issues.apache.org/jira/browse/LUCY">Issue Tracker</a></li> + <li><a href="/version_control.html">Version Control</a></li> + </ul> + <h6>Related Projects</h6> + <ul> + <li><a href="http://lucene.apache.org/core/">Lucene</a></li> + <li><a href="http://dezi.org/">Dezi</a></li> + <li><a href="http://lucene.apache.org/solr/">Solr</a></li> + <li><a href="http://lucenenet.apache.org/">Lucene.NET</a></li> + <li><a href="http://lucene.apache.org/pylucene/">PyLucene</a></li> + </ul> + </div> <!-- lucy-left_nav_box --> + + <div id="lucy-main_content_box" class="grid_9"> + <div class="c-api"> +<h2>Crash course in information retrieval</h2> +<p>Just enough Information Retrieval theory to find your way around Apache Lucy.</p> +<h3>Terminology</h3> +<p>Lucy uses some terminology from the field of information retrieval which +may be unfamiliar to many users. âDocumentâ and âtermâ mean pretty much what +youâd expect them to, but others such as âpostingâ and âinverted indexâ need a +formal introduction:</p> +<ul> +<li><em>document</em> - An atomic unit of retrieval.</li> +<li><em>term</em> - An attribute which describes a document.</li> +<li><em>posting</em> - One term indexing one document.</li> +<li><em>term list</em> - The complete list of terms which describe a document.</li> +<li><em>posting list</em> - The complete list of documents which a term indexes.</li> +<li><em>inverted index</em> - A data structure which maps from terms to documents.</li> +</ul> +<p>Since Lucy is a practical implementation of IR theory, it loads these +abstract, distilled definitions down with useful traits. For instance, a +âpostingâ in its most rarefied form is simply a term-document pairing; in +Lucy, the class MatchPosting fills this +role. However, by associating additional information with a posting like the +number of times the term occurs in the document, we can turn it into a +ScorePosting, making it possible +to rank documents by relevance rather than just list documents which happen to +match in no particular order.</p> +<h3>TF/IDF ranking algorithm</h3> +<p>Lucy uses a variant of the well-established âTerm Frequency / Inverse +Document Frequencyâ weighting scheme. A thorough treatment of TF/IDF is too +ambitious for our present purposes, but in a nutshell, it means thatâ¦</p> +<ul> +<li> +<p>in a search for <code>skate park</code>, documents which score well for the +comparatively rare term <code>skate</code> will rank higher than documents which score +well for the more common term <code>park</code>.</p> +</li> +<li> +<p>a 10-word text which has one occurrence each of both <code>skate</code> and <code>park</code> will +rank higher than a 1000-word text which also contains one occurrence of each.</p> +</li> +</ul> +<p>A web search for âtf idfâ will turn up many excellent explanations of the +algorithm.</p> +</div> + + </div> <!-- lucy-main_content_box --> + <div class="clear"></div> + + </div> <!-- lucy-main_content --> + + <div id="lucy-copyright" class="container_16"> + <p>Copyright © 2010-2015 The Apache Software Foundation, Licensed under the + <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br/> + Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The + Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their + respective owners. + </p> + </div> <!-- lucy-copyright --> + + </div> <!-- lucy-rigid_wrapper --> + + </body> +</html> Added: websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial.html ============================================================================== --- websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial.html (added) +++ websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial.html Wed Sep 28 12:07:48 2016 @@ -0,0 +1,142 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html lang="en"> + <head> + <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> + <title>Lucy::Docs::Tutorial</title> + <link rel="stylesheet" type="text/css" media="screen" href="/css/lucy.css"> + </head> + + <body> + + <div id="lucy-rigid_wrapper"> + + <div id="lucy-top" class="container_16 lucy-white_box_3d"> + + <div id="lucy-logo_box" class="grid_8"> + <a href="/"><img src="/images/lucy_logo_150x100.png" alt="Apache Lucyâ¢"></a> + </div> <!-- lucy-logo_box --> + + <div #id="lucy-top_nav_box" class="grid_8"> + <div id="lucy-top_nav_bar" class="container_8"> + <ul> + <li><a href="http://www.apache.org/" title="Apache Software Foundation">Apache Software Foundation</a></li> + <li><a href="http://www.apache.org/licenses/" title="License">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsorship">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li> + <li><a href="http://www.apache.org/security/ " title="Security">Security</a></li> + </ul> + </div> <!-- lucy-top_nav_bar --> + <p><a href="http://www.apache.org/">Apache</a> » <a href="/">Lucy</a> » <a href="/docs/">Docs</a> » <a href="/docs/0.5.0/">0.5.0</a> » <a href="/docs/0.5.0/c/">C</a> » <a href="/docs/0.5.0/c/Lucy/">Lucy</a> » <a href="/docs/0.5.0/c/Lucy/Docs/">Docs</a></p> + <form name="lucy-top_search_box" id="lucy-top_search_box" action="http://www.google.com/search" method="get"> + <input value="*.apache.org" name="sitesearch" type="hidden"/> + <input type="text" name="q" id="query" style="width:85%"> + <input type="submit" id="submit" value="Search"> + </form> + </div> <!-- lucy-top_nav_box --> + + <div class="clear"></div> + + </div> <!-- lucy-top --> + + <div id="lucy-main_content" class="container_16 lucy-white_box_3d"> + + <div class="grid_4" id="lucy-left_nav_box"> + <h6>About</h6> + <ul> + <li><a href="/">Welcome</a></li> + <li><a href="/clownfish.html">Clownfish</a></li> + <li><a href="/faq.html">FAQ</a></li> + <li><a href="/people.html">People</a></li> + </ul> + <h6>Resources</h6> + <ul> + <li><a href="/download.html">Download</a></li> + <li><a href="/mailing_lists.html">Mailing Lists</a></li> + <li><a href="/docs/">Documentation</a></li> + <li><a href="http://wiki.apache.org/lucy/">Wiki</a></li> + <li><a href="https://issues.apache.org/jira/browse/LUCY">Issue Tracker</a></li> + <li><a href="/version_control.html">Version Control</a></li> + </ul> + <h6>Related Projects</h6> + <ul> + <li><a href="http://lucene.apache.org/core/">Lucene</a></li> + <li><a href="http://dezi.org/">Dezi</a></li> + <li><a href="http://lucene.apache.org/solr/">Solr</a></li> + <li><a href="http://lucenenet.apache.org/">Lucene.NET</a></li> + <li><a href="http://lucene.apache.org/pylucene/">PyLucene</a></li> + </ul> + </div> <!-- lucy-left_nav_box --> + + <div id="lucy-main_content_box" class="grid_9"> + <div class="c-api"> +<h2>Step-by-step introduction to Apache Lucy.</h2> +<p>Explore Apache Lucyâs basic functionality by starting with a minimalist CGI +search app based on Lucy::Simple and transforming it, step by step, +into an âadvanced searchâ interface utilizing more flexible core modules like +<a href="../../Lucy/Index/Indexer.html">Indexer</a> and <a href="../../Lucy/Search/IndexSearcher.html">IndexSearcher</a>.</p> +<h3>Chapters</h3> +<ul> +<li> +<p><a href="../../Lucy/Docs/Tutorial/SimpleTutorial.html">SimpleTutorial</a> - Build a bare-bones search app using +Lucy::Simple.</p> +</li> +<li> +<p><a href="../../Lucy/Docs/Tutorial/BeyondSimpleTutorial.html">BeyondSimpleTutorial</a> - Rebuild the app using core +classes like <a href="../../Lucy/Index/Indexer.html">Indexer</a> and +<a href="../../Lucy/Search/IndexSearcher.html">IndexSearcher</a> in place of Lucy::Simple.</p> +</li> +<li> +<p><a href="../../Lucy/Docs/Tutorial/FieldTypeTutorial.html">FieldTypeTutorial</a> - Experiment with different field +characteristics using subclasses of <a href="../../Lucy/Plan/FieldType.html">FieldType</a>.</p> +</li> +<li> +<p><a href="../../Lucy/Docs/Tutorial/AnalysisTutorial.html">AnalysisTutorial</a> - Examine how the choice of +<a href="../../Lucy/Analysis/Analyzer.html">Analyzer</a> subclass affects search results.</p> +</li> +<li> +<p><a href="../../Lucy/Docs/Tutorial/HighlighterTutorial.html">HighlighterTutorial</a> - Augment search results with +highlighted excerpts.</p> +</li> +<li> +<p><a href="../../Lucy/Docs/Tutorial/QueryObjectsTutorial.html">QueryObjectsTutorial</a> - Unlock advanced search features +by using Query objects instead of query strings.</p> +</li> +</ul> +<h3>Source materials</h3> +<p>The source material used by the tutorial app â a multi-text-file presentation +of the United States constitution â can be found in the <code>sample</code> directory +at the root of the Lucy distribution, along with finished indexing and search +apps.</p> +<pre><code class="language-c">sample/indexer_simple.c # simple indexing executable +sample/search_simple.c # simple search executable +sample/indexer.c # indexing executable +sample/search.c # search executable +sample/us_constitution # corpus +</code></pre> +<h3>Conventions</h3> +<p>The user is expected to be familiar with OO Perl and basic CGI programming.</p> +<p>The code in this tutorial assumes a Unix-flavored operating system and the +Apache webserver, but will work with minor modifications on other setups.</p> +<h3>See also</h3> +<p>More advanced and esoteric subjects are covered in <a href="../../Lucy/Docs/Cookbook.html">Cookbook</a>.</p> +</div> + + </div> <!-- lucy-main_content_box --> + <div class="clear"></div> + + </div> <!-- lucy-main_content --> + + <div id="lucy-copyright" class="container_16"> + <p>Copyright © 2010-2015 The Apache Software Foundation, Licensed under the + <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br/> + Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The + Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their + respective owners. + </p> + </div> <!-- lucy-copyright --> + + </div> <!-- lucy-rigid_wrapper --> + + </body> +</html> Added: websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/AnalysisTutorial.html ============================================================================== --- websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/AnalysisTutorial.html (added) +++ websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/AnalysisTutorial.html Wed Sep 28 12:07:48 2016 @@ -0,0 +1,152 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html lang="en"> + <head> + <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> + <title>Lucy::Docs::Tutorial::AnalysisTutorial</title> + <link rel="stylesheet" type="text/css" media="screen" href="/css/lucy.css"> + </head> + + <body> + + <div id="lucy-rigid_wrapper"> + + <div id="lucy-top" class="container_16 lucy-white_box_3d"> + + <div id="lucy-logo_box" class="grid_8"> + <a href="/"><img src="/images/lucy_logo_150x100.png" alt="Apache Lucyâ¢"></a> + </div> <!-- lucy-logo_box --> + + <div #id="lucy-top_nav_box" class="grid_8"> + <div id="lucy-top_nav_bar" class="container_8"> + <ul> + <li><a href="http://www.apache.org/" title="Apache Software Foundation">Apache Software Foundation</a></li> + <li><a href="http://www.apache.org/licenses/" title="License">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsorship">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li> + <li><a href="http://www.apache.org/security/ " title="Security">Security</a></li> + </ul> + </div> <!-- lucy-top_nav_bar --> + <p><a href="http://www.apache.org/">Apache</a> » <a href="/">Lucy</a> » <a href="/docs/">Docs</a> » <a href="/docs/0.5.0/">0.5.0</a> » <a href="/docs/0.5.0/c/">C</a> » <a href="/docs/0.5.0/c/Lucy/">Lucy</a> » <a href="/docs/0.5.0/c/Lucy/Docs/">Docs</a> » <a href="/docs/0.5.0/c/Lucy/Docs/Tutorial/">Tutorial</a></p> + <form name="lucy-top_search_box" id="lucy-top_search_box" action="http://www.google.com/search" method="get"> + <input value="*.apache.org" name="sitesearch" type="hidden"/> + <input type="text" name="q" id="query" style="width:85%"> + <input type="submit" id="submit" value="Search"> + </form> + </div> <!-- lucy-top_nav_box --> + + <div class="clear"></div> + + </div> <!-- lucy-top --> + + <div id="lucy-main_content" class="container_16 lucy-white_box_3d"> + + <div class="grid_4" id="lucy-left_nav_box"> + <h6>About</h6> + <ul> + <li><a href="/">Welcome</a></li> + <li><a href="/clownfish.html">Clownfish</a></li> + <li><a href="/faq.html">FAQ</a></li> + <li><a href="/people.html">People</a></li> + </ul> + <h6>Resources</h6> + <ul> + <li><a href="/download.html">Download</a></li> + <li><a href="/mailing_lists.html">Mailing Lists</a></li> + <li><a href="/docs/">Documentation</a></li> + <li><a href="http://wiki.apache.org/lucy/">Wiki</a></li> + <li><a href="https://issues.apache.org/jira/browse/LUCY">Issue Tracker</a></li> + <li><a href="/version_control.html">Version Control</a></li> + </ul> + <h6>Related Projects</h6> + <ul> + <li><a href="http://lucene.apache.org/core/">Lucene</a></li> + <li><a href="http://dezi.org/">Dezi</a></li> + <li><a href="http://lucene.apache.org/solr/">Solr</a></li> + <li><a href="http://lucenenet.apache.org/">Lucene.NET</a></li> + <li><a href="http://lucene.apache.org/pylucene/">PyLucene</a></li> + </ul> + </div> <!-- lucy-left_nav_box --> + + <div id="lucy-main_content_box" class="grid_9"> + <div class="c-api"> +<h2>How to choose and use Analyzers.</h2> +<p>Try swapping out the EasyAnalyzer in our Schema for a +<a href="../../../Lucy/Analysis/StandardTokenizer.html">StandardTokenizer</a>:</p> +<pre><code class="language-c"> StandardTokenizer *tokenizer = StandardTokenizer_new(); + FullTextType *type = FullTextType_new((Analyzer*)tokenizer); +</code></pre> +<p>Search for <code>senate</code>, <code>Senate</code>, and <code>Senator</code> before and after making the +change and re-indexing.</p> +<p>Under EasyAnalyzer, the results are identical for all three searches, but +under StandardTokenizer, searches are case-sensitive, and the result sets for +<code>Senate</code> and <code>Senator</code> are distinct.</p> +<h3>EasyAnalyzer</h3> +<p>Whatâs happening is that <a href="../../../Lucy/Analysis/EasyAnalyzer.html">EasyAnalyzer</a> is performing more aggressive +processing than StandardTokenizer. In addition to tokenizing, itâs also +converting all text to lower case so that searches are case-insensitive, and +using a âstemmingâ algorithm to reduce related words to a common stem (<code>senat</code>, +in this case).</p> +<p>EasyAnalyzer is actually multiple Analyzers wrapped up in a single package. +In this case, itâs three-in-one, since specifying a EasyAnalyzer with +<code>language => 'en'</code> is equivalent to this snippet creating a +<a href="../../../Lucy/Analysis/PolyAnalyzer.html">PolyAnalyzer</a>:</p> +<pre><code class="language-c"> Vector *analyzers = Vec_new(3); + Vec_Push(analyzers, (Analyzer*)StandardTokenizer_new()); + Vec_Push(analyzers, (Analyzer*)Normalizer_new(NULL, true, false)); + Vec_Push(analyzers, (Analyzer*)SnowStemmer_new(language)); + + PolyAnalyzer *analyzer = PolyAnalyzer_new(NULL, analyzers); + DECREC(analyzers); +</code></pre> +<p>You can add or subtract Analyzers from there if you like. Try adding a fourth +Analyzer, a SnowballStopFilter for suppressing âstopwordsâ like âtheâ, âifâ, +and âmaybeâ.</p> +<pre><code class="language-c"> Vec_Push(analyzers, (Analyzer*)StandardTokenizer_new()); + Vec_Push(analyzers, (Analyzer*)Normalizer_new(NULL, true, false)); + Vec_Push(analyzers, (Analyzer*)SnowStemmer_new(language)); + Vec_Push(analyzers, (Analyzer*)SnowStop_new(language, NULL)); +</code></pre> +<p>Also, try removing the SnowballStemmer.</p> +<pre><code class="language-c"> Vec_Push(analyzers, (Analyzer*)StandardTokenizer_new()); + Vec_Push(analyzers, (Analyzer*)Normalizer_new(NULL, true, false)); +</code></pre> +<p>The original choice of a stock English EasyAnalyzer probably still yields the +best results for this document collection, but you get the idea: sometimes you +want a different Analyzer.</p> +<h3>When the best Analyzer is no Analyzer</h3> +<p>Sometimes you donât want an Analyzer at all. That was true for our âurlâ +field because we didnât need it to be searchable, but itâs also true for +certain types of searchable fields. For instance, âcategoryâ fields are often +set up to match exactly or not at all, as are fields like âlast_nameâ (because +you may not want to conflate results for âHumphreyâ and âHumphriesâ).</p> +<p>To specify that there should be no analysis performed at all, use StringType:</p> +<pre><code class="language-c"> String *name = Str_newf("category"); + StringType *type = StringType_new(); + Schema_Spec_Field(schema, name, (FieldType*)type); + DECREF(type); + DECREF(name); +</code></pre> +<h3>Highlighting up next</h3> +<p>In our next tutorial chapter, <a href="../../../Lucy/Docs/Tutorial/HighlighterTutorial.html">HighlighterTutorial</a>, +weâll add highlighted excerpts from the âcontentâ field to our search results.</p> +</div> + + </div> <!-- lucy-main_content_box --> + <div class="clear"></div> + + </div> <!-- lucy-main_content --> + + <div id="lucy-copyright" class="container_16"> + <p>Copyright © 2010-2015 The Apache Software Foundation, Licensed under the + <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br/> + Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The + Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their + respective owners. + </p> + </div> <!-- lucy-copyright --> + + </div> <!-- lucy-rigid_wrapper --> + + </body> +</html> Added: websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/BeyondSimpleTutorial.html ============================================================================== --- websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/BeyondSimpleTutorial.html (added) +++ websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/BeyondSimpleTutorial.html Wed Sep 28 12:07:48 2016 @@ -0,0 +1,296 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html lang="en"> + <head> + <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> + <title>Lucy::Docs::Tutorial::BeyondSimpleTutorial</title> + <link rel="stylesheet" type="text/css" media="screen" href="/css/lucy.css"> + </head> + + <body> + + <div id="lucy-rigid_wrapper"> + + <div id="lucy-top" class="container_16 lucy-white_box_3d"> + + <div id="lucy-logo_box" class="grid_8"> + <a href="/"><img src="/images/lucy_logo_150x100.png" alt="Apache Lucyâ¢"></a> + </div> <!-- lucy-logo_box --> + + <div #id="lucy-top_nav_box" class="grid_8"> + <div id="lucy-top_nav_bar" class="container_8"> + <ul> + <li><a href="http://www.apache.org/" title="Apache Software Foundation">Apache Software Foundation</a></li> + <li><a href="http://www.apache.org/licenses/" title="License">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsorship">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li> + <li><a href="http://www.apache.org/security/ " title="Security">Security</a></li> + </ul> + </div> <!-- lucy-top_nav_bar --> + <p><a href="http://www.apache.org/">Apache</a> » <a href="/">Lucy</a> » <a href="/docs/">Docs</a> » <a href="/docs/0.5.0/">0.5.0</a> » <a href="/docs/0.5.0/c/">C</a> » <a href="/docs/0.5.0/c/Lucy/">Lucy</a> » <a href="/docs/0.5.0/c/Lucy/Docs/">Docs</a> » <a href="/docs/0.5.0/c/Lucy/Docs/Tutorial/">Tutorial</a></p> + <form name="lucy-top_search_box" id="lucy-top_search_box" action="http://www.google.com/search" method="get"> + <input value="*.apache.org" name="sitesearch" type="hidden"/> + <input type="text" name="q" id="query" style="width:85%"> + <input type="submit" id="submit" value="Search"> + </form> + </div> <!-- lucy-top_nav_box --> + + <div class="clear"></div> + + </div> <!-- lucy-top --> + + <div id="lucy-main_content" class="container_16 lucy-white_box_3d"> + + <div class="grid_4" id="lucy-left_nav_box"> + <h6>About</h6> + <ul> + <li><a href="/">Welcome</a></li> + <li><a href="/clownfish.html">Clownfish</a></li> + <li><a href="/faq.html">FAQ</a></li> + <li><a href="/people.html">People</a></li> + </ul> + <h6>Resources</h6> + <ul> + <li><a href="/download.html">Download</a></li> + <li><a href="/mailing_lists.html">Mailing Lists</a></li> + <li><a href="/docs/">Documentation</a></li> + <li><a href="http://wiki.apache.org/lucy/">Wiki</a></li> + <li><a href="https://issues.apache.org/jira/browse/LUCY">Issue Tracker</a></li> + <li><a href="/version_control.html">Version Control</a></li> + </ul> + <h6>Related Projects</h6> + <ul> + <li><a href="http://lucene.apache.org/core/">Lucene</a></li> + <li><a href="http://dezi.org/">Dezi</a></li> + <li><a href="http://lucene.apache.org/solr/">Solr</a></li> + <li><a href="http://lucenenet.apache.org/">Lucene.NET</a></li> + <li><a href="http://lucene.apache.org/pylucene/">PyLucene</a></li> + </ul> + </div> <!-- lucy-left_nav_box --> + + <div id="lucy-main_content_box" class="grid_9"> + <div class="c-api"> +<h2>A more flexible app structure.</h2> +<h3>Goal</h3> +<p>In this tutorial chapter, weâll refactor the apps we built in +<a href="../../../Lucy/Docs/Tutorial/SimpleTutorial.html">SimpleTutorial</a> so that they look exactly the same from +the end userâs point of view, but offer the developer greater possibilites for +expansion.</p> +<p>To achieve this, weâll ditch Lucy::Simple and replace it with the +classes that it uses internally:</p> +<ul> +<li><a href="../../../Lucy/Plan/Schema.html">Schema</a> - Plan out your index.</li> +<li><a href="../../../Lucy/Plan/FullTextType.html">FullTextType</a> - Field type for full text search.</li> +<li><a href="../../../Lucy/Analysis/EasyAnalyzer.html">EasyAnalyzer</a> - A one-size-fits-all parser/tokenizer.</li> +<li><a href="../../../Lucy/Index/Indexer.html">Indexer</a> - Manipulate index content.</li> +<li><a href="../../../Lucy/Search/IndexSearcher.html">IndexSearcher</a> - Search an index.</li> +<li><a href="../../../Lucy/Search/Hits.html">Hits</a> - Iterate over hits returned by a Searcher.</li> +</ul> +<h3>Adaptations to indexer.pl</h3> +<p>After we load our modulesâ¦</p> +<pre><code class="language-c">#include <dirent.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> + +#define CFISH_USE_SHORT_NAMES +#define LUCY_USE_SHORT_NAMES +#include "Clownfish/String.h" +#include "Lucy/Analysis/EasyAnalyzer.h" +#include "Lucy/Document/Doc.h" +#include "Lucy/Index/Indexer.h" +#include "Lucy/Plan/FullTextType.h" +#include "Lucy/Plan/StringType.h" +#include "Lucy/Plan/Schema.h" + +const char path_to_index[] = "/path/to/index"; +const char uscon_source[] = "/usr/local/apache2/htdocs/us_constitution"; +</code></pre> +<p>⦠the first item weâre going need is a <a href="../../../Lucy/Plan/Schema.html">Schema</a>.</p> +<p>The primary job of a Schema is to specify what fields are available and how +theyâre defined. Weâll start off with three fields: title, content and url.</p> +<pre><code class="language-c">static Schema* +S_create_schema() { + // Create a new schema. + Schema *schema = Schema_new(); + + // Create an analyzer. + String *language = Str_newf("en"); + EasyAnalyzer *analyzer = EasyAnalyzer_new(language); + + // Specify fields. + + FullTextType *type = FullTextType_new((Analyzer*)analyzer); + + { + String *field_str = Str_newf("title"); + Schema_Spec_Field(schema, field_str, (FieldType*)type); + DECREF(field_str); + } + + { + String *field_str = Str_newf("content"); + Schema_Spec_Field(schema, field_str, (FieldType*)type); + DECREF(field_str); + } + + { + String *field_str = Str_newf("url"); + Schema_Spec_Field(schema, field_str, (FieldType*)type); + DECREF(field_str); + } + + DECREF(type); + DECREF(analyzer); + DECREF(language); + return schema; +} +</code></pre> +<p>All of the fields are specâd out using the <a href="../../../Lucy/Plan/FullTextType.html">FullTextType</a> FieldType, +indicating that they will be searchable as âfull textâ â which means that +they can be searched for individual words. The âanalyzerâ, which is unique to +FullTextType fields, is what breaks up the text into searchable tokens.</p> +<p>Next, weâll swap our Lucy::Simple object out for an <a href="../../../Lucy/Index/Indexer.html">Indexer</a>. +The substitution will be straightforward because Simple has merely been +serving as a thin wrapper around an inner Indexer, and weâll just be peeling +away the wrapper.</p> +<p>First, replace the constructor:</p> +<pre><code class="language-c">int +main() { + // Initialize the library. + lucy_bootstrap_parcel(); + + Schema *schema = S_create_schema(); + String *folder = Str_newf("%s", path_to_index); + + Indexer *indexer = Indexer_new(schema, (Obj*)folder, NULL, + Indexer_CREATE | Indexer_TRUNCATE); + +</code></pre> +<p>Next, have the <code>indexer</code> object <a href="../../../Lucy/Index/Indexer.html#func_Add_Doc">Add_Doc()</a> where we +were having the <code>lucy</code> object adding the document before:</p> +<pre><code class="language-c"> DIR *dir = opendir(uscon_source); + if (dir == NULL) { + perror(uscon_source); + return 1; + } + + for (struct dirent *entry = readdir(dir); + entry; + entry = readdir(dir)) { + + if (S_ends_with(entry->d_name, ".txt")) { + Doc *doc = S_parse_file(entry->d_name); + Indexer_Add_Doc(indexer, doc, 1.0); + DECREF(doc); + } + } + + closedir(dir); +</code></pre> +<p>Thereâs only one extra step required: at the end of the app, you must call +commit() explicitly to close the indexing session and commit your changes. +(Lucy::Simple hides this detail, calling commit() implicitly when it needs to).</p> +<pre><code class="language-c"> Indexer_Commit(indexer); + + DECREF(indexer); + DECREF(folder); + DECREF(schema); + return 0; +} +</code></pre> +<h3>Adaptations to search.cgi</h3> +<p>In our search app as in our indexing app, Lucy::Simple has served as a +thin wrapper â this time around <a href="../../../Lucy/Search/IndexSearcher.html">IndexSearcher</a> and +<a href="../../../Lucy/Search/Hits.html">Hits</a>. Swapping out Simple for these two classes is +also straightforward:</p> +<pre><code class="language-c">#include <stdio.h> +#include <stdlib.h> +#include <string.h> + +#define CFISH_USE_SHORT_NAMES +#define LUCY_USE_SHORT_NAMES +#include "Clownfish/String.h" +#include "Lucy/Document/HitDoc.h" +#include "Lucy/Search/Hits.h" +#include "Lucy/Search/IndexSearcher.h" + +const char path_to_index[] = "/path/to/index"; + +int +main(int argc, char *argv[]) { + // Initialize the library. + lucy_bootstrap_parcel(); + + if (argc < 2) { + printf("Usage: %s <querystring>\n", argv[0]); + return 0; + } + + const char *query_c = argv[1]; + + printf("Searching for: %s\n\n", query_c); + + String *folder = Str_newf("%s", path_to_index); + IndexSearcher *searcher = IxSearcher_new((Obj*)folder); + + String *query_str = Str_newf("%s", query_c); + Hits *hits = IxSearcher_Hits(searcher, (Obj*)query_str, 0, 10, NULL); + + String *title_str = Str_newf("title"); + String *url_str = Str_newf("url"); + HitDoc *hit; + int i = 1; + + // Loop over search results. + while (NULL != (hit = Hits_Next(hits))) { + String *title = (String*)HitDoc_Extract(hit, title_str); + char *title_c = Str_To_Utf8(title); + + String *url = (String*)HitDoc_Extract(hit, url_str); + char *url_c = Str_To_Utf8(url); + + printf("Result %d: %s (%s)\n", i, title_c, url_c); + + free(url_c); + free(title_c); + DECREF(url); + DECREF(title); + DECREF(hit); + i++; + } + + DECREF(url_str); + DECREF(title_str); + DECREF(hits); + DECREF(query_str); + DECREF(searcher); + DECREF(folder); + return 0; +} +</code></pre> +<h3>Hooray!</h3> +<p>Congratulations! Your apps do the same thing as before⦠but now theyâll be +easier to customize.</p> +<p>In our next chapter, <a href="../../../Lucy/Docs/Tutorial/FieldTypeTutorial.html">FieldTypeTutorial</a>, weâll explore +how to assign different behaviors to different fields.</p> +</div> + + </div> <!-- lucy-main_content_box --> + <div class="clear"></div> + + </div> <!-- lucy-main_content --> + + <div id="lucy-copyright" class="container_16"> + <p>Copyright © 2010-2015 The Apache Software Foundation, Licensed under the + <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br/> + Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The + Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their + respective owners. + </p> + </div> <!-- lucy-copyright --> + + </div> <!-- lucy-rigid_wrapper --> + + </body> +</html> Added: websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/FieldTypeTutorial.html ============================================================================== --- websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/FieldTypeTutorial.html (added) +++ websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/FieldTypeTutorial.html Wed Sep 28 12:07:48 2016 @@ -0,0 +1,151 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html lang="en"> + <head> + <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> + <title>Lucy::Docs::Tutorial::FieldTypeTutorial</title> + <link rel="stylesheet" type="text/css" media="screen" href="/css/lucy.css"> + </head> + + <body> + + <div id="lucy-rigid_wrapper"> + + <div id="lucy-top" class="container_16 lucy-white_box_3d"> + + <div id="lucy-logo_box" class="grid_8"> + <a href="/"><img src="/images/lucy_logo_150x100.png" alt="Apache Lucyâ¢"></a> + </div> <!-- lucy-logo_box --> + + <div #id="lucy-top_nav_box" class="grid_8"> + <div id="lucy-top_nav_bar" class="container_8"> + <ul> + <li><a href="http://www.apache.org/" title="Apache Software Foundation">Apache Software Foundation</a></li> + <li><a href="http://www.apache.org/licenses/" title="License">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsorship">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li> + <li><a href="http://www.apache.org/security/ " title="Security">Security</a></li> + </ul> + </div> <!-- lucy-top_nav_bar --> + <p><a href="http://www.apache.org/">Apache</a> » <a href="/">Lucy</a> » <a href="/docs/">Docs</a> » <a href="/docs/0.5.0/">0.5.0</a> » <a href="/docs/0.5.0/c/">C</a> » <a href="/docs/0.5.0/c/Lucy/">Lucy</a> » <a href="/docs/0.5.0/c/Lucy/Docs/">Docs</a> » <a href="/docs/0.5.0/c/Lucy/Docs/Tutorial/">Tutorial</a></p> + <form name="lucy-top_search_box" id="lucy-top_search_box" action="http://www.google.com/search" method="get"> + <input value="*.apache.org" name="sitesearch" type="hidden"/> + <input type="text" name="q" id="query" style="width:85%"> + <input type="submit" id="submit" value="Search"> + </form> + </div> <!-- lucy-top_nav_box --> + + <div class="clear"></div> + + </div> <!-- lucy-top --> + + <div id="lucy-main_content" class="container_16 lucy-white_box_3d"> + + <div class="grid_4" id="lucy-left_nav_box"> + <h6>About</h6> + <ul> + <li><a href="/">Welcome</a></li> + <li><a href="/clownfish.html">Clownfish</a></li> + <li><a href="/faq.html">FAQ</a></li> + <li><a href="/people.html">People</a></li> + </ul> + <h6>Resources</h6> + <ul> + <li><a href="/download.html">Download</a></li> + <li><a href="/mailing_lists.html">Mailing Lists</a></li> + <li><a href="/docs/">Documentation</a></li> + <li><a href="http://wiki.apache.org/lucy/">Wiki</a></li> + <li><a href="https://issues.apache.org/jira/browse/LUCY">Issue Tracker</a></li> + <li><a href="/version_control.html">Version Control</a></li> + </ul> + <h6>Related Projects</h6> + <ul> + <li><a href="http://lucene.apache.org/core/">Lucene</a></li> + <li><a href="http://dezi.org/">Dezi</a></li> + <li><a href="http://lucene.apache.org/solr/">Solr</a></li> + <li><a href="http://lucenenet.apache.org/">Lucene.NET</a></li> + <li><a href="http://lucene.apache.org/pylucene/">PyLucene</a></li> + </ul> + </div> <!-- lucy-left_nav_box --> + + <div id="lucy-main_content_box" class="grid_9"> + <div class="c-api"> +<h2>Specify per-field properties and behaviors.</h2> +<p>The Schema we used in the last chapter specifies three fields:</p> +<pre><code class="language-c"> FullTextType *type = FullTextType_new((Analyzer*)analyzer); + + { + String *field_str = Str_newf("title"); + Schema_Spec_Field(schema, field_str, (FieldType*)type); + DECREF(field_str); + } + + { + String *field_str = Str_newf("content"); + Schema_Spec_Field(schema, field_str, (FieldType*)type); + DECREF(field_str); + } + + { + String *field_str = Str_newf("url"); + Schema_Spec_Field(schema, field_str, (FieldType*)type); + DECREF(field_str); + } + +</code></pre> +<p>Since they are all defined as âfull textâ fields, they are all searchable â +including the <code>url</code> field, a dubious choice. Some URLs contain meaningful +information, but these donât, really:</p> +<pre><code>http://example.com/us_constitution/amend1.txt +</code></pre> +<p>We may as well not bother indexing the URL content. To achieve that we need +to assign the <code>url</code> field to a different FieldType.</p> +<h3>StringType</h3> +<p>Instead of FullTextType, weâll use a +<a href="../../../Lucy/Plan/StringType.html">StringType</a>, which doesnât use an +Analyzer to break up text into individual fields. Furthermore, weâll mark +this StringType as unindexed, so that its content wonât be searchable at all.</p> +<pre><code class="language-c"> { + String *field_str = Str_newf("url"); + StringType *type = StringType_new(); + StringType_Set_Indexed(type, false); + Schema_Spec_Field(schema, field_str, (FieldType*)type); + DECREF(type); + DECREF(field_str); + } +</code></pre> +<p>To observe the change in behavior, try searching for <code>us_constitution</code> both +before and after changing the Schema and re-indexing.</p> +<h3>Toggling âstoredâ</h3> +<p>For a taste of other FieldType possibilities, try turning off <code>stored</code> for +one or more fields.</p> +<pre><code class="language-c"> FullTextType *content_type = FullTextType_new((Analyzer*)analyzer); + FullTextType_Set_Stored(content_type, false); +</code></pre> +<p>Turning off <code>stored</code> for either <code>title</code> or <code>url</code> mangles our results page, +but since weâre not displaying <code>content</code>, turning it off for <code>content</code> has +no effect â except on index size.</p> +<h3>Analyzers up next</h3> +<p>Analyzers play a crucial role in the behavior of FullTextType fields. In our +next tutorial chapter, <a href="../../../Lucy/Docs/Tutorial/AnalysisTutorial.html">AnalysisTutorial</a>, weâll see how +changing up the Analyzer changes search results.</p> +</div> + + </div> <!-- lucy-main_content_box --> + <div class="clear"></div> + + </div> <!-- lucy-main_content --> + + <div id="lucy-copyright" class="container_16"> + <p>Copyright © 2010-2015 The Apache Software Foundation, Licensed under the + <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br/> + Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The + Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their + respective owners. + </p> + </div> <!-- lucy-copyright --> + + </div> <!-- lucy-rigid_wrapper --> + + </body> +</html> Added: websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/HighlighterTutorial.html ============================================================================== --- websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/HighlighterTutorial.html (added) +++ websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/HighlighterTutorial.html Wed Sep 28 12:07:48 2016 @@ -0,0 +1,160 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html lang="en"> + <head> + <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> + <title>Lucy::Docs::Tutorial::HighlighterTutorial</title> + <link rel="stylesheet" type="text/css" media="screen" href="/css/lucy.css"> + </head> + + <body> + + <div id="lucy-rigid_wrapper"> + + <div id="lucy-top" class="container_16 lucy-white_box_3d"> + + <div id="lucy-logo_box" class="grid_8"> + <a href="/"><img src="/images/lucy_logo_150x100.png" alt="Apache Lucyâ¢"></a> + </div> <!-- lucy-logo_box --> + + <div #id="lucy-top_nav_box" class="grid_8"> + <div id="lucy-top_nav_bar" class="container_8"> + <ul> + <li><a href="http://www.apache.org/" title="Apache Software Foundation">Apache Software Foundation</a></li> + <li><a href="http://www.apache.org/licenses/" title="License">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsorship">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li> + <li><a href="http://www.apache.org/security/ " title="Security">Security</a></li> + </ul> + </div> <!-- lucy-top_nav_bar --> + <p><a href="http://www.apache.org/">Apache</a> » <a href="/">Lucy</a> » <a href="/docs/">Docs</a> » <a href="/docs/0.5.0/">0.5.0</a> » <a href="/docs/0.5.0/c/">C</a> » <a href="/docs/0.5.0/c/Lucy/">Lucy</a> » <a href="/docs/0.5.0/c/Lucy/Docs/">Docs</a> » <a href="/docs/0.5.0/c/Lucy/Docs/Tutorial/">Tutorial</a></p> + <form name="lucy-top_search_box" id="lucy-top_search_box" action="http://www.google.com/search" method="get"> + <input value="*.apache.org" name="sitesearch" type="hidden"/> + <input type="text" name="q" id="query" style="width:85%"> + <input type="submit" id="submit" value="Search"> + </form> + </div> <!-- lucy-top_nav_box --> + + <div class="clear"></div> + + </div> <!-- lucy-top --> + + <div id="lucy-main_content" class="container_16 lucy-white_box_3d"> + + <div class="grid_4" id="lucy-left_nav_box"> + <h6>About</h6> + <ul> + <li><a href="/">Welcome</a></li> + <li><a href="/clownfish.html">Clownfish</a></li> + <li><a href="/faq.html">FAQ</a></li> + <li><a href="/people.html">People</a></li> + </ul> + <h6>Resources</h6> + <ul> + <li><a href="/download.html">Download</a></li> + <li><a href="/mailing_lists.html">Mailing Lists</a></li> + <li><a href="/docs/">Documentation</a></li> + <li><a href="http://wiki.apache.org/lucy/">Wiki</a></li> + <li><a href="https://issues.apache.org/jira/browse/LUCY">Issue Tracker</a></li> + <li><a href="/version_control.html">Version Control</a></li> + </ul> + <h6>Related Projects</h6> + <ul> + <li><a href="http://lucene.apache.org/core/">Lucene</a></li> + <li><a href="http://dezi.org/">Dezi</a></li> + <li><a href="http://lucene.apache.org/solr/">Solr</a></li> + <li><a href="http://lucenenet.apache.org/">Lucene.NET</a></li> + <li><a href="http://lucene.apache.org/pylucene/">PyLucene</a></li> + </ul> + </div> <!-- lucy-left_nav_box --> + + <div id="lucy-main_content_box" class="grid_9"> + <div class="c-api"> +<h2>Augment search results with highlighted excerpts.</h2> +<p>Adding relevant excerpts with highlighted search terms to your search results +display makes it much easier for end users to scan the page and assess which +hits look promising, dramatically improving their search experience.</p> +<h3>Adaptations to indexer.pl</h3> +<p><a href="../../../Lucy/Highlight/Highlighter.html">Highlighter</a> uses information generated at index +time. To save resources, highlighting is disabled by default and must be +turned on for individual fields.</p> +<pre><code class="language-c"> { + String *field_str = Str_newf("content"); + FullTextType *type = FullTextType_new((Analyzer*)analyzer); + FullTextType_Set_Highlightable(type, true); + Schema_Spec_Field(schema, field_str, (FieldType*)type); + DECREF(type); + DECREF(field_str); + } +</code></pre> +<h3>Adaptations to search.cgi</h3> +<p>To add highlighting and excerpting to the search.cgi sample app, create a +<code>$highlighter</code> object outside the hits iterating loopâ¦</p> +<pre><code class="language-c"> String *content_str = Str_newf("content"); + Highlighter *highlighter + = Highlighter_new((Searcher*)searcher, (Obj*)query, + content_str, 200); +</code></pre> +<p>⦠then modify the loop and the per-hit display to generate and include the +excerpt.</p> +<pre><code class="language-c"> String *title_str = Str_newf("title"); + String *url_str = Str_newf("url"); + HitDoc *hit; + i = 1; + + // Loop over search results. + while (NULL != (hit = Hits_Next(hits))) { + String *title = (String*)HitDoc_Extract(hit, title_str); + char *title_c = Str_To_Utf8(title); + + String *url = (String*)HitDoc_Extract(hit, url_str); + char *url_c = Str_To_Utf8(url); + + String *excerpt = Highlighter_Create_Excerpt(highlighter, hit); + char *excerpt_c = Str_To_Utf8(excerpt); + + printf("Result %d: %s (%s)\n%s\n\n", i, title_c, url_c, excerpt_c); + + free(excerpt_c); + free(url_c); + free(title_c); + DECREF(excerpt); + DECREF(url); + DECREF(title); + DECREF(hit); + i++; + } + + DECREF(url_str); + DECREF(title_str); + DECREF(hits); + DECREF(query_str); + DECREF(highlighter); + DECREF(content_str); + DECREF(searcher); + DECREF(folder); +</code></pre> +<h3>Next chapter: Query objects</h3> +<p>Our next tutorial chapter, <a href="../../../Lucy/Docs/Tutorial/QueryObjectsTutorial.html">QueryObjectsTutorial</a>, +illustrates how to build an âadvanced searchâ interface using +<a href="../../../Lucy/Search/Query.html">Query</a> objects instead of query strings.</p> +</div> + + </div> <!-- lucy-main_content_box --> + <div class="clear"></div> + + </div> <!-- lucy-main_content --> + + <div id="lucy-copyright" class="container_16"> + <p>Copyright © 2010-2015 The Apache Software Foundation, Licensed under the + <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br/> + Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The + Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their + respective owners. + </p> + </div> <!-- lucy-copyright --> + + </div> <!-- lucy-rigid_wrapper --> + + </body> +</html> Added: websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/QueryObjectsTutorial.html ============================================================================== --- websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/QueryObjectsTutorial.html (added) +++ websites/staging/lucy/trunk/content/docs/0.5.0/c/Lucy/Docs/Tutorial/QueryObjectsTutorial.html Wed Sep 28 12:07:48 2016 @@ -0,0 +1,269 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html lang="en"> + <head> + <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> + <title>Lucy::Docs::Tutorial::QueryObjectsTutorial</title> + <link rel="stylesheet" type="text/css" media="screen" href="/css/lucy.css"> + </head> + + <body> + + <div id="lucy-rigid_wrapper"> + + <div id="lucy-top" class="container_16 lucy-white_box_3d"> + + <div id="lucy-logo_box" class="grid_8"> + <a href="/"><img src="/images/lucy_logo_150x100.png" alt="Apache Lucyâ¢"></a> + </div> <!-- lucy-logo_box --> + + <div #id="lucy-top_nav_box" class="grid_8"> + <div id="lucy-top_nav_bar" class="container_8"> + <ul> + <li><a href="http://www.apache.org/" title="Apache Software Foundation">Apache Software Foundation</a></li> + <li><a href="http://www.apache.org/licenses/" title="License">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html" title="Sponsorship">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html" title="Thanks">Thanks</a></li> + <li><a href="http://www.apache.org/security/ " title="Security">Security</a></li> + </ul> + </div> <!-- lucy-top_nav_bar --> + <p><a href="http://www.apache.org/">Apache</a> » <a href="/">Lucy</a> » <a href="/docs/">Docs</a> » <a href="/docs/0.5.0/">0.5.0</a> » <a href="/docs/0.5.0/c/">C</a> » <a href="/docs/0.5.0/c/Lucy/">Lucy</a> » <a href="/docs/0.5.0/c/Lucy/Docs/">Docs</a> » <a href="/docs/0.5.0/c/Lucy/Docs/Tutorial/">Tutorial</a></p> + <form name="lucy-top_search_box" id="lucy-top_search_box" action="http://www.google.com/search" method="get"> + <input value="*.apache.org" name="sitesearch" type="hidden"/> + <input type="text" name="q" id="query" style="width:85%"> + <input type="submit" id="submit" value="Search"> + </form> + </div> <!-- lucy-top_nav_box --> + + <div class="clear"></div> + + </div> <!-- lucy-top --> + + <div id="lucy-main_content" class="container_16 lucy-white_box_3d"> + + <div class="grid_4" id="lucy-left_nav_box"> + <h6>About</h6> + <ul> + <li><a href="/">Welcome</a></li> + <li><a href="/clownfish.html">Clownfish</a></li> + <li><a href="/faq.html">FAQ</a></li> + <li><a href="/people.html">People</a></li> + </ul> + <h6>Resources</h6> + <ul> + <li><a href="/download.html">Download</a></li> + <li><a href="/mailing_lists.html">Mailing Lists</a></li> + <li><a href="/docs/">Documentation</a></li> + <li><a href="http://wiki.apache.org/lucy/">Wiki</a></li> + <li><a href="https://issues.apache.org/jira/browse/LUCY">Issue Tracker</a></li> + <li><a href="/version_control.html">Version Control</a></li> + </ul> + <h6>Related Projects</h6> + <ul> + <li><a href="http://lucene.apache.org/core/">Lucene</a></li> + <li><a href="http://dezi.org/">Dezi</a></li> + <li><a href="http://lucene.apache.org/solr/">Solr</a></li> + <li><a href="http://lucenenet.apache.org/">Lucene.NET</a></li> + <li><a href="http://lucene.apache.org/pylucene/">PyLucene</a></li> + </ul> + </div> <!-- lucy-left_nav_box --> + + <div id="lucy-main_content_box" class="grid_9"> + <div class="c-api"> +<h2>Use Query objects instead of query strings.</h2> +<p>Until now, our search app has had only a single search box. In this tutorial +chapter, weâll move towards an âadvanced searchâ interface, by adding a +âcategoryâ drop-down menu. Three new classes will be required:</p> +<ul> +<li> +<p><a href="../../../Lucy/Search/QueryParser.html">QueryParser</a> - Turn a query string into a +<a href="../../../Lucy/Search/Query.html">Query</a> object.</p> +</li> +<li> +<p><a href="../../../Lucy/Search/TermQuery.html">TermQuery</a> - Query for a specific term within +a specific field.</p> +</li> +<li> +<p><a href="../../../Lucy/Search/ANDQuery.html">ANDQuery</a> - âANDâ together multiple Query +objects to produce an intersected result set.</p> +</li> +</ul> +<h3>Adaptations to indexer.pl</h3> +<p>Our new âcategoryâ field will be a StringType field rather than a FullTextType +field, because we will only be looking for exact matches. It needs to be +indexed, but since we wonât display its value, it doesnât need to be stored.</p> +<pre><code class="language-c"> { + String *field_str = Str_newf("category"); + StringType *type = StringType_new(); + StringType_Set_Stored(type, false); + Schema_Spec_Field(schema, field_str, (FieldType*)type); + DECREF(type); + DECREF(field_str); + } +</code></pre> +<p>There will be three possible values: âarticleâ, âamendmentâ, and âpreambleâ, +which weâll hack out of the source fileâs name during our <code>parse_file</code> +subroutine:</p> +<pre><code class="language-c"> const char *category = NULL; + if (S_starts_with(filename, "art")) { + category = "article"; + } + else if (S_starts_with(filename, "amend")) { + category = "amendment"; + } + else if (S_starts_with(filename, "preamble")) { + category = "preamble"; + } + else { + fprintf(stderr, "Can't derive category for %s", filename); + exit(1); + } + + ... + + { + // Store 'category' field + String *field = Str_newf("category"); + String *value = Str_new_from_utf8(category, strlen(category)); + Doc_Store(doc, field, (Obj*)value); + DECREF(field); + DECREF(value); + } +</code></pre> +<h3>Adaptations to search.cgi</h3> +<p>The âcategoryâ constraint will be added to our search interface using an HTML +âselectâ element (this routine will need to be integrated into the HTML +generation section of search.cgi):</p> +<pre><code class="language-c">static void +S_usage_and_exit(const char *arg0) { + printf("Usage: %s [-c <category>] <querystring>\n", arg0); + exit(1); +} +</code></pre> +<p>Weâll start off by loading our new modules and extracting our new CGI +parameter.</p> +<pre><code class="language-c"> const char *category = NULL; + int i = 1; + + while (i < argc - 1) { + if (strcmp(argv[i], "-c") == 0) { + if (i + 1 >= argc) { + S_usage_and_exit(argv[0]); + } + i += 1; + category = argv[i]; + } + else { + S_usage_and_exit(argv[0]); + } + + i += 1; + } + + if (i + 1 != argc) { + S_usage_and_exit(argv[0]); + } + + const char *query_c = argv[i]; +</code></pre> +<p>QueryParserâs constructor requires a âschemaâ argument. We can get that from +our IndexSearcher:</p> +<pre><code class="language-c"> IndexSearcher *searcher = IxSearcher_new((Obj*)folder); + Schema *schema = IxSearcher_Get_Schema(searcher); + QueryParser *qparser = QParser_new(schema, NULL, NULL, NULL); +</code></pre> +<p>Previously, we have been handing raw query strings to IndexSearcher. Behind +the scenes, IndexSearcher has been using a QueryParser to turn those query +strings into Query objects. Now, we will bring QueryParser into the +foreground and parse the strings explicitly.</p> +<pre><code class="language-c"> Query *query = QParser_Parse(qparser, query_str); +</code></pre> +<p>If the user has specified a category, weâll use an ANDQuery to join our parsed +query together with a TermQuery representing the category.</p> +<pre><code class="language-c"> if (category) { + String *category_name = String_newf("category"); + String *category_str = String_newf("%s", category); + TermQuery *category_query + = TermQuery_new(category_name, category_str); + + Vector *children = Vec_new(2); + Vec_Push(children, (Obj*)query); + Vec_Push(children, category_query); + query = (Query*)ANDQuery_new(children); + + DECREF(children); + DECREF(category_str); + DECREF(category_name); + } +} +</code></pre> +<p>Now when we execute the queryâ¦</p> +<pre><code class="language-c"> Hits *hits = IxSearcher_Hits(searcher, (Obj*)query, 0, 10, NULL); +</code></pre> +<p>⦠weâll get a result set which is the intersection of the parsed query and +the category query.</p> +<h3>Using TermQuery with full text fields</h3> +<p>When querying full text fields, the easiest way is to create query objects +using QueryParser. But sometimes you want to create TermQuery for a single +term in a FullTextType field directly. In this case, we have to run the +search term through the fieldâs analyzer to make sure it gets normalized in +the same way as the fieldâs content.</p> +<pre><code class="language-c">Query* +make_term_query(Schema *schema, String *field, String *term) { + FieldType *type = Schema_Fetch_Type(schema, field); + String *token = NULL; + + if (FieldType_is_a(type, FULLTEXTTYPE)) { + // Run the term through the full text analysis chain. + Analyzer *analyzer = FullTextType_Get_Analyzer((FullTextType*)type); + Vector *tokens = Analyzer_Split(analyzer, term); + + if (Vec_Get_Size(tokens) != 1) { + // If the term expands to more than one token, or no + // tokens at all, it will never match a single token in + // the full text field. + DECREF(tokens); + return (Query*)NoMatchQuery_new(); + } + + token = (String*)Vec_Delete(tokens, 0); + DECREF(tokens); + } + else { + // Exact match for other types. + token = (String*)INCREF(term); + } + + TermQuery *term_query = TermQuery_new(field, (Obj*)token); + + DECREF(token); + return (Query*)term_query; +} +</code></pre> +<h3>Congratulations!</h3> +<p>Youâve made it to the end of the tutorial.</p> +<h3>See Also</h3> +<p>For additional thematic documentation, see the Apache Lucy +<a href="../../../Lucy/Docs/Cookbook.html">Cookbook</a>.</p> +<p>ANDQuery has a companion class, <a href="../../../Lucy/Search/ORQuery.html">ORQuery</a>, and a +close relative, <a href="../../../Lucy/Search/RequiredOptionalQuery.html">RequiredOptionalQuery</a>.</p> +</div> + + </div> <!-- lucy-main_content_box --> + <div class="clear"></div> + + </div> <!-- lucy-main_content --> + + <div id="lucy-copyright" class="container_16"> + <p>Copyright © 2010-2015 The Apache Software Foundation, Licensed under the + <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br/> + Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The + Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their + respective owners. + </p> + </div> <!-- lucy-copyright --> + + </div> <!-- lucy-rigid_wrapper --> + + </body> +</html>
