Convert POD to Markdown

Project: http://git-wip-us.apache.org/repos/asf/lucy/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/5618020f
Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/5618020f
Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/5618020f

Branch: refs/heads/master
Commit: 5618020ff61ba7dac4b7132b5977ad4119e2c220
Parents: c2363da
Author: Nick Wellnhofer <[email protected]>
Authored: Wed Jul 8 12:57:18 2015 +0200
Committer: Nick Wellnhofer <[email protected]>
Committed: Sat Jul 11 15:03:10 2015 +0200

----------------------------------------------------------------------
 core/Lucy/Docs/Cookbook.md                      |  33 ++
 core/Lucy/Docs/Cookbook/CustomQuery.md          | 321 +++++++++++++++++++
 core/Lucy/Docs/Cookbook/CustomQueryParser.md    | 231 +++++++++++++
 core/Lucy/Docs/Cookbook/FastUpdates.md          | 140 ++++++++
 core/Lucy/Docs/DocIDs.md                        |  28 ++
 core/Lucy/Docs/FileFormat.md                    | 191 +++++++++++
 core/Lucy/Docs/IRTheory.md                      |  44 +++
 core/Lucy/Docs/Tutorial.md                      |  53 +++
 core/Lucy/Docs/Tutorial/AnalysisTutorial.md     |  85 +++++
 core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md | 125 ++++++++
 core/Lucy/Docs/Tutorial/FieldTypeTutorial.md    |  60 ++++
 core/Lucy/Docs/Tutorial/HighlighterTutorial.md  |  62 ++++
 core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md | 185 +++++++++++
 core/Lucy/Docs/Tutorial/SimpleTutorial.md       | 298 +++++++++++++++++
 perl/lib/Lucy/Docs/Cookbook.pod                 |  61 ----
 perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod     | 320 ------------------
 .../Lucy/Docs/Cookbook/CustomQueryParser.pod    | 236 --------------
 perl/lib/Lucy/Docs/Cookbook/FastUpdates.pod     | 153 ---------
 perl/lib/Lucy/Docs/DocIDs.pod                   |  47 ---
 perl/lib/Lucy/Docs/FileFormat.pod               | 239 --------------
 perl/lib/Lucy/Docs/IRTheory.pod                 |  94 ------
 perl/lib/Lucy/Docs/Tutorial.pod                 |  89 -----
 perl/lib/Lucy/Docs/Tutorial/Analysis.pod        |  94 ------
 perl/lib/Lucy/Docs/Tutorial/BeyondSimple.pod    | 153 ---------
 perl/lib/Lucy/Docs/Tutorial/FieldType.pod       |  74 -----
 perl/lib/Lucy/Docs/Tutorial/Highlighter.pod     |  76 -----
 perl/lib/Lucy/Docs/Tutorial/QueryObjects.pod    | 198 ------------
 perl/lib/Lucy/Docs/Tutorial/Simple.pod          | 298 -----------------
 28 files changed, 1856 insertions(+), 2132 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Cookbook.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Cookbook.md b/core/Lucy/Docs/Cookbook.md
new file mode 100644
index 0000000..ec6994f
--- /dev/null
+++ b/core/Lucy/Docs/Cookbook.md
@@ -0,0 +1,33 @@
+# Apache Lucy recipes
+
+The Cookbook provides thematic documentation covering some of Apache Lucy's
+more sophisticated features.  For a step-by-step introduction to Lucy,
+see [](cfish:Tutorial).
+
+## Chapters
+
+* [](cfish:FastUpdates) - While index updates are fast on
+  average, worst-case update performance may be significantly slower. To make
+  index updates consistently quick, we must manually intervene to control the
+  process of index segment consolidation.
+
+* [](cfish:CustomQuery) - Explore Lucy's support for
+  custom query types by creating a "PrefixQuery" class to handle trailing
+  wildcards.
+
+* [](cfish:CustomQueryParser) - Define your own custom
+  search query syntax using [](cfish:lucy.QueryParser) and
+  Parse::RecDescent.
+
+## Materials
+
+Some of the recipes in the Cookbook reference the completed
+[](cfish:Tutorial) application.  These materials can be
+found in the `sample` directory at the root of the Lucy distribution:
+
+~~~ perl
+sample/indexer.pl        # indexing app
+sample/search.cgi        # search app
+sample/us_constitution   # corpus
+~~~
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Cookbook/CustomQuery.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Cookbook/CustomQuery.md 
b/core/Lucy/Docs/Cookbook/CustomQuery.md
new file mode 100644
index 0000000..d135c8b
--- /dev/null
+++ b/core/Lucy/Docs/Cookbook/CustomQuery.md
@@ -0,0 +1,321 @@
+# Sample subclass of Query
+
+Explore Apache Lucy's support for custom query types by creating a
+"PrefixQuery" class to handle trailing wildcards.
+
+~~~ perl
+my $prefix_query = PrefixQuery->new(
+    field        => 'content',
+    query_string => 'foo*',
+);
+my $hits = $searcher->hits( query => $prefix_query );
+...
+~~~
+
+## Query, Compiler, and Matcher 
+
+To add support for a new query type, we need three classes: a Query, a
+Compiler, and a Matcher.  
+
+* PrefixQuery - a subclass of [](cfish:lucy.Query), and the only class
+  that client code will deal with directly.
+
+* PrefixCompiler - a subclass of [](cfish:lucy.Compiler), whose primary 
+  role is to compile a PrefixQuery to a PrefixMatcher.
+
+* PrefixMatcher - a subclass of [](cfish:lucy.Matcher), which does the
+  heavy lifting: it applies the query to individual documents and assigns a
+  score to each match.
+
+The PrefixQuery class on its own isn't enough because a Query object's role is
+limited to expressing an abstract specification for the search.  A Query is
+basically nothing but metadata; execution is left to the Query's companion
+Compiler and Matcher.
+
+Here's a simplified sketch illustrating how a Searcher's hits() method ties
+together the three classes.
+
+~~~ perl
+sub hits {
+    my ( $self, $query ) = @_;
+    my $compiler = $query->make_compiler(
+        searcher => $self,
+        boost    => $query->get_boost,
+    );
+    my $matcher = $compiler->make_matcher(
+        reader     => $self->get_reader,
+        need_score => 1,
+    );
+    my @hits = $matcher->capture_hits;
+    return \@hits;
+}
+~~~
+
+### PrefixQuery
+
+Our PrefixQuery class will have two attributes: a query string and a field
+name.
+
+~~~ perl
+package PrefixQuery;
+use base qw( Lucy::Search::Query );
+use Carp;
+use Scalar::Util qw( blessed );
+
+# Inside-out member vars and hand-rolled accessors.
+my %query_string;
+my %field;
+sub get_query_string { my $self = shift; return $query_string{$$self} }
+sub get_field        { my $self = shift; return $field{$$self} }
+~~~
+
+PrefixQuery's constructor collects and validates the attributes.
+
+~~~ perl
+sub new {
+    my ( $class, %args ) = @_;
+    my $query_string = delete $args{query_string};
+    my $field        = delete $args{field};
+    my $self         = $class->SUPER::new(%args);
+    confess("'query_string' param is required")
+        unless defined $query_string;
+    confess("Invalid query_string: '$query_string'")
+        unless $query_string =~ /\*\s*$/;
+    confess("'field' param is required")
+        unless defined $field;
+    $query_string{$$self} = $query_string;
+    $field{$$self}        = $field;
+    return $self;
+}
+~~~
+
+Since this is an inside-out class, we'll need a destructor:
+
+~~~ perl
+sub DESTROY {
+    my $self = shift;
+    delete $query_string{$$self};
+    delete $field{$$self};
+    $self->SUPER::DESTROY;
+}
+~~~
+
+The equals() method determines whether two Queries are logically equivalent:
+
+~~~ perl
+sub equals {
+    my ( $self, $other ) = @_;
+    return 0 unless blessed($other);
+    return 0 unless $other->isa("PrefixQuery");
+    return 0 unless $field{$$self} eq $field{$$other};
+    return 0 unless $query_string{$$self} eq $query_string{$$other};
+    return 1;
+}
+~~~
+
+The last thing we'll need is a make_compiler() factory method which kicks out
+a subclass of [](cfish:lucy.Compiler).
+
+~~~ perl
+sub make_compiler {
+    my ( $self, %args ) = @_;
+    my $subordinate = delete $args{subordinate};
+    my $compiler = PrefixCompiler->new( %args, parent => $self );
+    $compiler->normalize unless $subordinate;
+    return $compiler;
+}
+~~~
+
+### PrefixCompiler
+
+PrefixQuery's make_compiler() method will be called internally at search-time
+by objects which subclass [](cfish:lucy.Searcher) -- such as
+[IndexSearchers](cfish:lucy.IndexSearcher).
+
+A Searcher is associated with a particular collection of documents.   These
+documents may all reside in one index, as with IndexSearcher, or they may be
+spread out across multiple indexes on one or more machines, as with
+[](ClusterSearcher).
+
+Searcher objects have access to certain statistical information about the
+collections they represent; for instance, a Searcher can tell you how many
+documents are in the collection...
+
+~~~ perl
+my $maximum_number_of_docs_in_collection = $searcher->doc_max;
+~~~
+
+... or how many documents a specific term appears in:
+
+~~~ perl
+my $term_appears_in_this_many_docs = $searcher->doc_freq(
+    field => 'content',
+    term  => 'foo',
+);
+~~~
+
+Such information can be used by sophisticated Compiler implementations to
+assign more or less heft to individual queries or sub-queries.  However, we're
+not going to bother with weighting for this demo; we'll just assign a fixed
+score of 1.0 to each matching document.
+
+We don't need to write a constructor, as it will suffice to inherit new() from
+Lucy::Search::Compiler.  The only method we need to implement for
+PrefixCompiler is make_matcher().
+
+~~~ perl
+package PrefixCompiler;
+use base qw( Lucy::Search::Compiler );
+
+sub make_matcher {
+    my ( $self, %args ) = @_;
+    my $seg_reader = $args{reader};
+
+    # Retrieve low-level components LexiconReader and PostingListReader.
+    my $lex_reader
+        = $seg_reader->obtain("Lucy::Index::LexiconReader");
+    my $plist_reader
+        = $seg_reader->obtain("Lucy::Index::PostingListReader");
+    
+    # Acquire a Lexicon and seek it to our query string.
+    my $substring = $self->get_parent->get_query_string;
+    $substring =~ s/\*.\s*$//;
+    my $field = $self->get_parent->get_field;
+    my $lexicon = $lex_reader->lexicon( field => $field );
+    return unless $lexicon;
+    $lexicon->seek($substring);
+    
+    # Accumulate PostingLists for each matching term.
+    my @posting_lists;
+    while ( defined( my $term = $lexicon->get_term ) ) {
+        last unless $term =~ /^\Q$substring/;
+        my $posting_list = $plist_reader->posting_list(
+            field => $field,
+            term  => $term,
+        );
+        if ($posting_list) {
+            push @posting_lists, $posting_list;
+        }
+        last unless $lexicon->next;
+    }
+    return unless @posting_lists;
+    
+    return PrefixMatcher->new( posting_lists => \@posting_lists );
+}
+~~~
+
+PrefixCompiler gets access to a [](cfish:lucy.SegReader)
+object when make_matcher() gets called.  From the SegReader and its
+sub-components [](cfish:lucy.LexiconReader) and
+[](cfish:lucy.PostingListReader), we acquire a
+[](cfish:lucy.Lexicon), scan through the Lexicon's unique
+terms, and acquire a [](cfish:lucy.PostingList) for each
+term that matches our prefix.
+
+Each of these PostingList objects represents a set of documents which match
+the query.
+
+### PrefixMatcher
+
+The Matcher subclass is the most involved.  
+
+~~~ perl
+package PrefixMatcher;
+use base qw( Lucy::Search::Matcher );
+
+# Inside-out member vars.
+my %doc_ids;
+my %tick;
+
+sub new {
+    my ( $class, %args ) = @_;
+    my $posting_lists = delete $args{posting_lists};
+    my $self          = $class->SUPER::new(%args);
+    
+    # Cheesy but simple way of interleaving PostingList doc sets.
+    my %all_doc_ids;
+    for my $posting_list (@$posting_lists) {
+        while ( my $doc_id = $posting_list->next ) {
+            $all_doc_ids{$doc_id} = undef;
+        }
+    }
+    my @doc_ids = sort { $a <=> $b } keys %all_doc_ids;
+    $doc_ids{$$self} = \@doc_ids;
+    
+    # Track our position within the array of doc ids.
+    $tick{$$self} = -1;
+    
+    return $self;
+}
+
+sub DESTROY {
+    my $self = shift;
+    delete $doc_ids{$$self};
+    delete $tick{$$self};
+    $self->SUPER::DESTROY;
+}
+~~~
+
+The doc ids must be in order, or some will be ignored; hence the `sort`
+above.
+
+In addition to the constructor and destructor, there are three methods that
+must be overridden.
+
+next() advances the Matcher to the next valid matching doc.  
+
+~~~ perl
+sub next {
+    my $self    = shift;
+    my $doc_ids = $doc_ids{$$self};
+    my $tick    = ++$tick{$$self};
+    return 0 if $tick >= scalar @$doc_ids;
+    return $doc_ids->[$tick];
+}
+~~~
+
+get_doc_id() returns the current document id, or 0 if the Matcher is
+exhausted.  ([Document numbers](cfish:DocIDs) start at 1, so 0 is
+a sentinel.)
+
+~~~ perl
+sub get_doc_id {
+    my $self    = shift;
+    my $tick    = $tick{$$self};
+    my $doc_ids = $doc_ids{$$self};
+    return $tick < scalar @$doc_ids ? $doc_ids->[$tick] : 0;
+}
+~~~
+
+score() conveys the relevance score of the current match.  We'll just return a
+fixed score of 1.0:
+
+~~~ perl
+sub score { 1.0 }
+~~~
+
+## Usage 
+
+To get a basic feel for PrefixQuery, insert the FlatQueryParser module
+described in [](cfish:CustomQueryParser) (which supports
+PrefixQuery) into the search.cgi sample app.
+
+~~~ perl
+my $parser = FlatQueryParser->new( schema => $searcher->get_schema );
+my $query  = $parser->parse($q);
+~~~
+
+If you're planning on using PrefixQuery in earnest, though, you may want to
+change up analyzers to avoid stemming, because stemming -- another approach to
+prefix conflation -- is not perfectly compatible with prefix searches.
+
+~~~ perl
+# Polyanalyzer with no SnowballStemmer.
+my $analyzer = Lucy::Analysis::PolyAnalyzer->new(
+    analyzers => [
+        Lucy::Analysis::StandardTokenizer->new,
+        Lucy::Analysis::Normalizer->new,
+    ],
+);
+~~~
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Cookbook/CustomQueryParser.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Cookbook/CustomQueryParser.md 
b/core/Lucy/Docs/Cookbook/CustomQueryParser.md
new file mode 100644
index 0000000..39b1167
--- /dev/null
+++ b/core/Lucy/Docs/Cookbook/CustomQueryParser.md
@@ -0,0 +1,231 @@
+# Sample subclass of QueryParser.
+
+Implement a custom search query language using a subclass of
+[](cfish:lucy.QueryParser).
+
+## The language
+
+At first, our query language will support only simple term queries and phrases
+delimited by double quotes.  For simplicity's sake, it will not support
+parenthetical groupings, boolean operators, or prepended plus/minus.  The
+results for all subqueries will be unioned together -- i.e. joined using an OR
+-- which is usually the best approach for small-to-medium-sized document
+collections.
+
+Later, we'll add support for trailing wildcards.
+
+## Single-field parser
+
+Our initial parser implentation will generate queries against a single fixed
+field, "content", and it will analyze text using a fixed choice of English
+EasyAnalyzer.  We won't subclass Lucy::Search::QueryParser just yet.
+
+~~~ perl
+package FlatQueryParser;
+use Lucy::Search::TermQuery;
+use Lucy::Search::PhraseQuery;
+use Lucy::Search::ORQuery;
+use Carp;
+
+sub new { 
+    my $analyzer = Lucy::Analysis::EasyAnalyzer->new(
+        language => 'en',
+    );
+    return bless { 
+        field    => 'content',
+        analyzer => $analyzer,
+    }, __PACKAGE__;
+}
+~~~
+
+Some private helper subs for creating TermQuery and PhraseQuery objects will
+help keep the size of our main parse() subroutine down:
+
+~~~ perl
+sub _make_term_query {
+    my ( $self, $term ) = @_;
+    return Lucy::Search::TermQuery->new(
+        field => $self->{field},
+        term  => $term,
+    );
+}
+
+sub _make_phrase_query {
+    my ( $self, $terms ) = @_;
+    return Lucy::Search::PhraseQuery->new(
+        field => $self->{field},
+        terms => $terms,
+    );
+}
+~~~
+
+Our private \_tokenize() method treats double-quote delimited material as a
+single token and splits on whitespace everywhere else.
+
+~~~ perl
+sub _tokenize {
+    my ( $self, $query_string ) = @_;
+    my @tokens;
+    while ( length $query_string ) {
+        if ( $query_string =~ s/^\s+// ) {
+            next;    # skip whitespace
+        }
+        elsif ( $query_string =~ s/^("[^"]*(?:"|$))// ) {
+            push @tokens, $1;    # double-quoted phrase
+        }
+        else {
+            $query_string =~ s/(\S+)//;
+            push @tokens, $1;    # single word
+        }
+    }
+    return \@tokens;
+}
+~~~
+
+The main parsing routine creates an array of tokens by calling \_tokenize(),
+runs the tokens through through the EasyAnalyzer, creates TermQuery or
+PhraseQuery objects according to how many tokens emerge from the
+EasyAnalyzer's split() method, and adds each of the sub-queries to the primary
+ORQuery.
+
+~~~ perl
+sub parse {
+    my ( $self, $query_string ) = @_;
+    my $tokens   = $self->_tokenize($query_string);
+    my $analyzer = $self->{analyzer};
+    my $or_query = Lucy::Search::ORQuery->new;
+
+    for my $token (@$tokens) {
+        if ( $token =~ s/^"// ) {
+            $token =~ s/"$//;
+            my $terms = $analyzer->split($token);
+            my $query = $self->_make_phrase_query($terms);
+            $or_query->add_child($phrase_query);
+        }
+        else {
+            my $terms = $analyzer->split($token);
+            if ( @$terms == 1 ) {
+                my $query = $self->_make_term_query( $terms->[0] );
+                $or_query->add_child($query);
+            }
+            elsif ( @$terms > 1 ) {
+                my $query = $self->_make_phrase_query($terms);
+                $or_query->add_child($query);
+            }
+        }
+    }
+
+    return $or_query;
+}
+~~~
+
+## Multi-field parser
+
+Most often, the end user will want their search query to match not only a
+single 'content' field, but also 'title' and so on.  To make that happen, we
+have to turn queries such as this...
+
+    foo AND NOT bar
+
+... into the logical equivalent of this:
+
+    (title:foo OR content:foo) AND NOT (title:bar OR content:bar)
+
+Rather than continue with our own from-scratch parser class and write the
+routines to accomplish that expansion, we're now going to subclass 
Lucy::Search::QueryParser
+and take advantage of some of its existing methods.
+
+Our first parser implementation had the "content" field name and the choice of
+English EasyAnalyzer hard-coded for simplicity, but we don't need to do that
+once we subclass Lucy::Search::QueryParser.  QueryParser's constructor --
+which we will inherit, allowing us to eliminate our own constructor --
+requires a Schema which conveys field
+and Analyzer information, so we can just defer to that.
+
+~~~ perl
+package FlatQueryParser;
+use base qw( Lucy::Search::QueryParser );
+use Lucy::Search::TermQuery;
+use Lucy::Search::PhraseQuery;
+use Lucy::Search::ORQuery;
+use PrefixQuery;
+use Carp;
+
+# Inherit new()
+~~~
+
+We're also going to jettison our \_make_term_query() and \_make_phrase_query()
+helper subs and chop our parse() subroutine way down.  Our revised parse()
+routine will generate Lucy::Search::LeafQuery objects instead of TermQueries
+and PhraseQueries:
+
+~~~ perl
+sub parse {
+    my ( $self, $query_string ) = @_;
+    my $tokens = $self->_tokenize($query_string);
+    my $or_query = Lucy::Search::ORQuery->new;
+    for my $token (@$tokens) {
+        my $leaf_query = Lucy::Search::LeafQuery->new( text => $token );
+        $or_query->add_child($leaf_query);
+    }
+    return $self->expand($or_query);
+}
+~~~
+
+The magic happens in QueryParser's expand() method, which walks the ORQuery
+object we supply to it looking for LeafQuery objects, and calls expand_leaf()
+for each one it finds.  expand_leaf() performs field-specific analysis,
+decides whether each query should be a TermQuery or a PhraseQuery, and if
+multiple fields are required, creates an ORQuery which mults out e.g.  `foo`
+into `(title:foo OR content:foo)`.
+
+## Extending the query language
+
+To add support for trailing wildcards to our query language, we need to
+override expand_leaf() to accommodate PrefixQuery, while deferring to the
+parent class implementation on TermQuery and PhraseQuery.
+
+~~~ perl
+sub expand_leaf {
+    my ( $self, $leaf_query ) = @_;
+    my $text = $leaf_query->get_text;
+    if ( $text =~ /\*$/ ) {
+        my $or_query = Lucy::Search::ORQuery->new;
+        for my $field ( @{ $self->get_fields } ) {
+            my $prefix_query = PrefixQuery->new(
+                field        => $field,
+                query_string => $text,
+            );
+            $or_query->add_child($prefix_query);
+        }
+        return $or_query;
+    }
+    else {
+        return $self->SUPER::expand_leaf($leaf_query);
+    }
+}
+~~~
+
+Ordinarily, those asterisks would have been stripped when running tokens
+through the EasyAnalyzer -- query strings containing "foo\*" would produce
+TermQueries for the term "foo".  Our override intercepts tokens with trailing
+asterisks and processes them as PrefixQueries before `SUPER::expand_leaf` can
+discard them, so that a search for "foo\*" can match "food", "foosball", and so
+on.
+
+## Usage
+
+Insert our custom parser into the search.cgi sample app to get a feel for how
+it behaves:
+
+~~~ perl
+my $parser = FlatQueryParser->new( schema => $searcher->get_schema );
+my $query  = $parser->parse( decode( 'UTF-8', $cgi->param('q') || '' ) );
+my $hits   = $searcher->hits(
+    query      => $query,
+    offset     => $offset,
+    num_wanted => $page_size,
+);
+...
+~~~
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Cookbook/FastUpdates.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Cookbook/FastUpdates.md 
b/core/Lucy/Docs/Cookbook/FastUpdates.md
new file mode 100644
index 0000000..511310a
--- /dev/null
+++ b/core/Lucy/Docs/Cookbook/FastUpdates.md
@@ -0,0 +1,140 @@
+# Near real-time index updates
+
+While index updates are fast on average, worst-case update performance may be
+significantly slower.  To make index updates consistently quick, we must
+manually intervene to control the process of index segment consolidation.
+
+## The problem
+
+Ordinarily, modifying an index is cheap. New data is added to new segments,
+and the time to write a new segment scales more or less linearly with the
+number of documents added during the indexing session.  
+
+Deletions are also cheap most of the time, because we don't remove documents
+immediately but instead mark them as deleted, and adding the deletion mark is
+cheap.
+
+However, as new segments are added and the deletion rate for existing segments
+increases, search-time performance slowly begins to degrade.  At some point,
+it becomes necessary to consolidate existing segments, rewriting their data
+into a new segment.  
+
+If the recycled segments are small, the time it takes to rewrite them may not
+be significant.  Every once in a while, though, a large amount of data must be
+rewritten.
+
+## Procrastinating and playing catch-up
+
+The simplest way to force fast index updates is to avoid rewriting anything.
+
+Indexer relies upon [](cfish:lucy.IndexManager)'s
+recycle() method to tell it which segments should be consolidated.  If we
+subclass IndexManager and override recycle() so that it always returns an
+empty array, we get consistently quick performance:
+
+~~~ perl
+package NoMergeManager;
+use base qw( Lucy::Index::IndexManager );
+sub recycle { [] }
+
+package main;
+my $indexer = Lucy::Index::Indexer->new(
+    index => '/path/to/index',
+    manager => NoMergeManager->new,
+);
+...
+$indexer->commit;
+~~~
+
+However, we can't procrastinate forever.  Eventually, we'll have to run an
+ordinary, uncontrolled indexing session, potentially triggering a large
+rewrite of lots of small and/or degraded segments:
+
+~~~ perl
+my $indexer = Lucy::Index::Indexer->new( 
+    index => '/path/to/index', 
+    # manager => NoMergeManager->new,
+);
+...
+$indexer->commit;
+~~~
+
+## Acceptable worst-case update time, slower degradation
+
+Never merging anything at all in the main indexing process is probably
+overkill.  Small segments are relatively cheap to merge; we just need to guard
+against the big rewrites.  
+
+Setting a ceiling on the number of documents in the segments to be recycled
+allows us to avoid a mass proliferation of tiny, single-document segments,
+while still offering decent worst-case update speed:
+
+~~~ perl
+package LightMergeManager;
+use base qw( Lucy::Index::IndexManager );
+
+sub recycle {
+    my $self = shift;
+    my $seg_readers = $self->SUPER::recycle(@_);
+    @$seg_readers = grep { $_->doc_max < 10 } @$seg_readers;
+    return $seg_readers;
+}
+~~~
+
+However, we still have to consolidate every once in a while, and while that
+happens content updates will be locked out.
+
+## Background merging
+
+If it's not acceptable to lock out updates while the index consolidation
+process runs, the alternative is to move the consolidation process out of
+band, using Lucy::Index::BackgroundMerger.  
+
+It's never safe to have more than one Indexer attempting to modify the content
+of an index at the same time, but a BackgroundMerger and an Indexer can
+operate simultaneously:
+
+~~~ perl
+# Indexing process.
+use Scalar::Util qw( blessed );
+my $retries = 0;
+while (1) {
+    eval {
+        my $indexer = Lucy::Index::Indexer->new(
+                index => '/path/to/index',
+                manager => LightMergeManager->new,
+            );
+        $indexer->add_doc($doc);
+        $indexer->commit;
+    };
+    last unless $@;
+    if ( blessed($@) and $@->isa("Lucy::Store::LockErr") ) {
+        # Catch LockErr.
+        warn "Couldn't get lock ($retries retries)";
+        $retries++;
+    }
+    else {
+        die "Write failed: $@";
+    }
+}
+
+# Background merge process.
+my $manager = Lucy::Index::IndexManager->new;
+$manager->set_write_lock_timeout(60_000);
+my $bg_merger = Lucy::Index::BackgroundMerger->new(
+    index   => '/path/to/index',
+    manager => $manager,
+);
+$bg_merger->commit;
+~~~
+
+The exception handling code becomes useful once you have more than one index
+modification process happening simultaneously.  By default, Indexer tries
+several times to acquire a write lock over the span of one second, then holds
+it until commit() completes.  BackgroundMerger handles most of its work
+without the write lock, but it does need it briefly once at the beginning and
+once again near the end.  Under normal loads, the internal retry logic will
+resolve conflicts, but if it's not acceptable to miss an insert, you probably
+want to catch LockErr exceptions thrown by Indexer.  In contrast, a LockErr
+from BackgroundMerger probably just needs to be logged.
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/DocIDs.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/DocIDs.md b/core/Lucy/Docs/DocIDs.md
new file mode 100644
index 0000000..af696b2
--- /dev/null
+++ b/core/Lucy/Docs/DocIDs.md
@@ -0,0 +1,28 @@
+# Characteristics of Apache Lucy document ids.
+
+## Document ids are signed 32-bit integers
+
+Document ids in Apache Lucy start at 1.  Because 0 is never a valid doc id, we
+can use it as a sentinel value:
+
+~~~ perl
+while ( my $doc_id = $posting_list->next ) {
+    ...
+}
+~~~
+
+## Document ids are ephemeral
+
+The document ids used by Lucy are associated with a single index
+snapshot.  The moment an index is updated, the mapping of document ids to
+documents is subject to change.
+
+Since IndexReader objects represent a point-in-time view of an index, document
+ids are guaranteed to remain static for the life of the reader.  However,
+because they are not permanent, Lucy document ids cannot be used as
+foreign keys to locate records in external data sources.  If you truly need a
+primary key field, you must define it and populate it yourself.
+
+Furthermore, the order of document ids does not tell you anything about the
+sequence in which documents were added to the index.
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/FileFormat.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/FileFormat.md b/core/Lucy/Docs/FileFormat.md
new file mode 100644
index 0000000..c5f606c
--- /dev/null
+++ b/core/Lucy/Docs/FileFormat.md
@@ -0,0 +1,191 @@
+# Overview of index file format
+
+It is not necessary to understand the current implementation details of the
+index file format in order to use Apache Lucy effectively, but it may be
+helpful if you are interested in tweaking for high performance, exotic usage,
+or debugging and development.  
+
+On a file system, an index is a directory.  The files inside have a
+hierarchical relationship: an index is made up of "segments", each of which is
+an independent inverted index with its own subdirectory; each segment is made
+up of several component parts.
+
+    [index]--|
+             |--snapshot_XXX.json
+             |--schema_XXX.json
+             |--write.lock
+             |
+             |--seg_1--|
+             |         |--segmeta.json
+             |         |--cfmeta.json
+             |         |--cf.dat-------|
+             |                         |--[lexicon]
+             |                         |--[postings]
+             |                         |--[documents]
+             |                         |--[highlight]
+             |                         |--[deletions]
+             |
+             |--seg_2--|
+             |         |--segmeta.json
+             |         |--cfmeta.json
+             |         |--cf.dat-------|
+             |                         |--[lexicon]
+             |                         |--[postings]
+             |                         |--[documents]
+             |                         |--[highlight]
+             |                         |--[deletions]
+             |
+             |--[...]--| 
+
+## Write-once philosophy
+
+All segment directory names consist of the string "seg\_" followed by a number
+in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher numbers indicating
+more recent segments.  Once a segment is finished and committed, its name is
+never re-used and its files are never modified.
+
+Old segments become obsolete and can be removed when their data has been
+consolidated into new segments during the process of segment merging and
+optimization.  A fully-optimized index has only one segment.
+
+## Top-level entries
+
+There are a handful of "top-level" files and directories which belong to the
+entire index rather than to a particular segment.
+
+### snapshot_XXX.json
+
+A "snapshot" file, e.g. `snapshot_m7p.json`, is list of index files and
+directories.  Because index files, once written, are never modified, the list
+of entries in a snapshot defines a point-in-time view of the data in an index.
+
+Like segment directories, snapshot files also utilize the
+unique-base-36-number naming convention; the higher the number, the more
+recent the file.  The appearance of a new snapshot file within the index
+directory constitutes an index update.  While a new segment is being written
+new files may be added to the index directory, but until a new snapshot file
+gets written, a Searcher opening the index for reading won't know about them.
+
+### schema_XXX.json
+
+The schema file is a Schema object describing the index's format, serialized
+as JSON.  It, too, is versioned, and a given snapshot file will reference one
+and only one schema file.
+
+### locks 
+
+By default, only one indexing process may safely modify the index at any given
+time.  Processes reserve an index by laying claim to the `write.lock` file
+within the `locks/` directory.  A smattering of other lock files may be used
+from time to time, as well.
+
+## A segment's component parts
+
+By default, each segment has up to five logical components: lexicon, postings,
+document storage, highlight data, and deletions.  Binary data from these
+components gets stored in virtual files within the "cf.dat" compound file;
+metadata is stored in a shared "segmeta.json" file.
+
+### segmeta.json
+
+The segmeta.json file is a central repository for segment metadata.  In
+addition to information such as document counts and field numbers, it also
+warehouses arbitrary metadata on behalf of individual index components.
+
+### Lexicon 
+
+Each indexed field gets its own lexicon in each segment.  The exact files
+involved depend on the field's type, but generally speaking there will be two
+parts.  First, there's a primary `lexicon-XXX.dat` file which houses a
+complete term list associating terms with corpus frequency statistics,
+postings file locations, etc.  Second, one or more "lexicon index" files may
+be present which contain periodic samples from the primary lexicon file to
+facilitate fast lookups.
+
+### Postings
+
+"Posting" is a technical term from the field of 
+[information retrieval](cfish:IRTheory), defined as a single
+instance of a one term indexing one document.  If you are looking at the index
+in the back of a book, and you see that "freedom" is referenced on pages 8,
+86, and 240, that would be three postings, which taken together form a
+"posting list".  The same terminology applies to an index in electronic form.
+
+Each segment has one postings file per indexed field.  When a search is
+performed for a single term, first that term is looked up in the lexicon.  If
+the term exists in the segment, the record in the lexicon will contain
+information about which postings file to look at and where to look.
+
+The first thing any posting record tells you is a document id.  By iterating
+over all the postings associated with a term, you can find all the documents
+that match that term, a process which is analogous to looking up page numbers
+in a book's index.  However, each posting record typically contains other
+information in addition to document id, e.g. the positions at which the term
+occurs within the field.
+
+### Documents
+
+The document storage section is a simple database, organized into two files:
+
+* __documents.dat__ - Serialized documents.
+
+* __documents.ix__ - Document storage index, a solid array of 64-bit integers
+  where each integer location corresponds to a document id, and the value at
+  that location points at a file position in the documents.dat file.
+
+### Highlight data 
+
+The files which store data used for excerpting and highlighting are organized
+similarly to the files used to store documents.
+
+* __highlight.dat__ - Chunks of serialized highlight data, one per doc id.
+
+* __highlight.ix__ - Highlight data index -- as with the `documents.ix` file, a
+  solid array of 64-bit file pointers.
+
+### Deletions
+
+When a document is "deleted" from a segment, it is not actually purged right
+away; it is merely marked as "deleted" via a deletions file.  Deletions files
+contains bit vectors with one bit for each document in the segment; if bit
+\#254 is set then document 254 is deleted, and if that document turns up in a
+search it will be masked out.
+
+It is only when a segment's contents are rewritten to a new segment during the
+segment-merging process that deleted documents truly go away.
+
+## Compound Files
+
+If you peer inside an index directory, you won't actually find any files named
+"documents.dat", "highlight.ix", etc. unless there is an indexing process
+underway.  What you will find instead is one "cf.dat" and one "cfmeta.json"
+file per segment.
+
+To minimize the need for file descriptors at search-time, all per-segment
+binary data files are concatenated together in "cf.dat" at the close of each
+indexing session.  Information about where each file begins and ends is stored
+in `cfmeta.json`.  When the segment is opened for reading, a single file
+descriptor per "cf.dat" file can be shared among several readers.
+
+## A Typical Search
+
+Here's a simplified narrative, dramatizing how a search for "freedom" against
+a given segment plays out:
+
+1. The searcher asks the relevant Lexicon Index, "Do you know anything about
+   'freedom'?"  Lexicon Index replies, "Can't say for sure, but if the main
+   Lexicon file does, 'freedom' is probably somewhere around byte 21008".  
+
+2. The main Lexicon tells the searcher "One moment, let me scan our records...
+   Yes, we have 2 documents which contain 'freedom'.  You'll find them in
+   seg_6/postings-4.dat starting at byte 66991."
+
+3. The Postings file says "Yep, we have 'freedom', all right!  Document id 40
+   has 1 'freedom', and document 44 has 8.  If you need to know more, like if 
any
+   'freedom' is part of the phrase 'freedom of speech', ask me about positions!
+
+4. If the searcher is only looking for 'freedom' in isolation, that's where it
+   stops.  It now knows enough to assign the documents scores against 
"freedom",
+   with the 8-freedom document likely ranking higher than the single-freedom
+   document.
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/IRTheory.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/IRTheory.md b/core/Lucy/Docs/IRTheory.md
new file mode 100644
index 0000000..a9af4ed
--- /dev/null
+++ b/core/Lucy/Docs/IRTheory.md
@@ -0,0 +1,44 @@
+# Crash course in information retrieval
+
+Just enough Information Retrieval theory to find your way around Apache Lucy.
+
+## Terminology
+
+Lucy uses some terminology from the field of information retrieval which
+may be unfamiliar to many users.  "Document" and "term" mean pretty much what
+you'd expect them to, but others such as "posting" and "inverted index" need a
+formal introduction:
+
+* _document_ - An atomic unit of retrieval.
+* _term_ - An attribute which describes a document.
+* _posting_ - One term indexing one document.
+* _term list_ - The complete list of terms which describe a document.
+* _posting list_ - The complete list of documents which a term indexes.
+* _inverted index_ - A data structure which maps from terms to documents.
+
+Since Lucy is a practical implementation of IR theory, it loads these
+abstract, distilled definitions down with useful traits.  For instance, a
+"posting" in its most rarefied form is simply a term-document pairing; in
+Lucy, the class [](cfish:lucy.MatchPosting) fills this
+role.  However, by associating additional information with a posting like the
+number of times the term occurs in the document, we can turn it into a
+[](cfish:lucy.ScorePosting), making it possible
+to rank documents by relevance rather than just list documents which happen to
+match in no particular order.
+
+## TF/IDF ranking algorithm
+
+Lucy uses a variant of the well-established "Term Frequency / Inverse
+Document Frequency" weighting scheme.  A thorough treatment of TF/IDF is too
+ambitious for our present purposes, but in a nutshell, it means that...
+
+* in a search for `skate park`, documents which score well for the
+  comparatively rare term `skate` will rank higher than documents which score
+  well for the more common term `park`.
+
+* a 10-word text which has one occurrence each of both `skate` and `park` will
+  rank higher than a 1000-word text which also contains one occurrence of each.
+
+A web search for "tf idf" will turn up many excellent explanations of the
+algorithm.
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Tutorial.md b/core/Lucy/Docs/Tutorial.md
new file mode 100644
index 0000000..57c66b2
--- /dev/null
+++ b/core/Lucy/Docs/Tutorial.md
@@ -0,0 +1,53 @@
+# Step-by-step introduction to Apache Lucy.
+
+Explore Apache Lucy's basic functionality by starting with a minimalist CGI
+search app based on Lucy::Simple and transforming it, step by step,
+into an "advanced search" interface utilizing more flexible core modules like
+[](cfish:lucy.Indexer) and [](cfish:lucy.IndexSearcher).
+
+## Chapters
+
+* [](cfish:SimpleTutorial) - Build a bare-bones search app using
+  Lucy::Simple.
+
+* [](cfish:BeyondSimpleTutorial) - Rebuild the app using core
+  classes like [](cfish:lucy.Indexer) and
+  [](cfish:lucy.IndexSearcher) in place of Lucy::Simple.
+
+* [](cfish:FieldTypeTutorial) - Experiment with different field
+  characteristics using subclasses of [](cfish:lucy.FieldType).
+
+* [](cfish:AnalysisTutorial) - Examine how the choice of
+  [](cfish:lucy.Analyzer) subclass affects search results.
+
+* [](cfish:HighlighterTutorial) - Augment search results with
+  highlighted excerpts.
+
+* [](cfish:QueryObjectsTutorial) - Unlock advanced search features
+  by using Query objects instead of query strings.
+
+## Source materials
+
+The source material used by the tutorial app -- a multi-text-file presentation
+of the United States constitution -- can be found in the `sample` directory
+at the root of the Lucy distribution, along with finished indexing and search
+apps.
+
+~~~ perl
+sample/indexer.pl        # indexing app
+sample/search.cgi        # search app
+sample/us_constitution   # corpus
+~~~
+
+## Conventions
+
+The user is expected to be familiar with OO Perl and basic CGI programming.
+
+The code in this tutorial assumes a Unix-flavored operating system and the
+Apache webserver, but will work with minor modifications on other setups.
+
+## See also
+
+More advanced and esoteric subjects are covered in [](cfish:Cookbook).
+
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/AnalysisTutorial.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Tutorial/AnalysisTutorial.md 
b/core/Lucy/Docs/Tutorial/AnalysisTutorial.md
new file mode 100644
index 0000000..a55dd09
--- /dev/null
+++ b/core/Lucy/Docs/Tutorial/AnalysisTutorial.md
@@ -0,0 +1,85 @@
+# How to choose and use Analyzers.
+
+Try swapping out the EasyAnalyzer in our Schema for a StandardTokenizer:
+
+~~~ perl
+my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
+my $type = Lucy::Plan::FullTextType->new(
+    analyzer => $tokenizer,
+);
+~~~
+
+Search for `senate`, `Senate`, and `Senator` before and after making the
+change and re-indexing.
+
+Under EasyAnalyzer, the results are identical for all three searches, but
+under StandardTokenizer, searches are case-sensitive, and the result sets for
+`Senate` and `Senator` are distinct.
+
+## EasyAnalyzer
+
+What's happening is that EasyAnalyzer is performing more aggressive processing
+than StandardTokenizer.  In addition to tokenizing, it's also converting all
+text to lower case so that searches are case-insensitive, and using a
+"stemming" algorithm to reduce related words to a common stem (`senat`, in
+this case).
+
+EasyAnalyzer is actually multiple Analyzers wrapped up in a single package.
+In this case, it's three-in-one, since specifying a EasyAnalyzer with
+`language => 'en'` is equivalent to this snippet:
+
+~~~ perl
+my $tokenizer    = Lucy::Analysis::StandardTokenizer->new;
+my $normalizer   = Lucy::Analysis::Normalizer->new;
+my $stemmer      = Lucy::Analysis::SnowballStemmer->new( language => 'en' );
+my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
+    analyzers => [ $tokenizer, $normalizer, $stemmer ],
+);
+~~~
+
+You can add or subtract Analyzers from there if you like.  Try adding a fourth
+Analyzer, a SnowballStopFilter for suppressing "stopwords" like "the", "if",
+and "maybe".
+
+~~~ perl
+my $stopfilter = Lucy::Analysis::SnowballStopFilter->new( 
+    language => 'en',
+);
+my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
+    analyzers => [ $tokenizer, $normalizer, $stopfilter, $stemmer ],
+);
+~~~
+
+Also, try removing the SnowballStemmer.
+
+~~~ perl
+my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
+    analyzers => [ $tokenizer, $normalizer ],
+);
+~~~
+
+The original choice of a stock English EasyAnalyzer probably still yields the
+best results for this document collection, but you get the idea: sometimes you
+want a different Analyzer.
+
+## When the best Analyzer is no Analyzer
+
+Sometimes you don't want an Analyzer at all.  That was true for our "url"
+field because we didn't need it to be searchable, but it's also true for
+certain types of searchable fields.  For instance, "category" fields are often
+set up to match exactly or not at all, as are fields like "last_name" (because
+you may not want to conflate results for "Humphrey" and "Humphries").
+
+To specify that there should be no analysis performed at all, use StringType:
+
+~~~ perl
+my $type = Lucy::Plan::StringType->new;
+$schema->spec_field( name => 'category', type => $type );
+~~~
+
+## Highlighting up next
+
+In our next tutorial chapter, [](cfish:HighlighterTutorial),
+we'll add highlighted excerpts from the "content" field to our search results.
+
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md 
b/core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md
new file mode 100644
index 0000000..00c8e71
--- /dev/null
+++ b/core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md
@@ -0,0 +1,125 @@
+# A more flexible app structure.
+
+## Goal
+
+In this tutorial chapter, we'll refactor the apps we built in
+[](cfish:SimpleTutorial) so that they look exactly the same from
+the end user's point of view, but offer the developer greater possibilites for
+expansion.  
+
+To achieve this, we'll ditch Lucy::Simple and replace it with the
+classes that it uses internally:
+
+* [](cfish:lucy.Schema) - Plan out your index.
+* [](cfish:lucy.FullTextType) - Field type for full text search.
+* [](cfish:lucy.EasyAnalyzer) - A one-size-fits-all parser/tokenizer.
+* [](cfish:lucy.Indexer) - Manipulate index content.
+* [](cfish:lucy.IndexSearcher) - Search an index.
+* [](cfish:lucy.Hits) - Iterate over hits returned by a Searcher.
+
+## Adaptations to indexer.pl
+
+After we load our modules...
+
+~~~ perl
+use Lucy::Plan::Schema;
+use Lucy::Plan::FullTextType;
+use Lucy::Analysis::EasyAnalyzer;
+use Lucy::Index::Indexer;
+~~~
+
+... the first item we're going need is a [](cfish:lucy.Schema).
+
+The primary job of a Schema is to specify what fields are available and how
+they're defined.  We'll start off with three fields: title, content and url.
+
+~~~ perl
+# Create Schema.
+my $schema = Lucy::Plan::Schema->new;
+my $easyanalyzer = Lucy::Analysis::EasyAnalyzer->new(
+    language => 'en',
+);
+my $type = Lucy::Plan::FullTextType->new(
+    analyzer => $easyanalyzer,
+);
+$schema->spec_field( name => 'title',   type => $type );
+$schema->spec_field( name => 'content', type => $type );
+$schema->spec_field( name => 'url',     type => $type );
+~~~
+
+All of the fields are spec'd out using the "FullTextType" FieldType,
+indicating that they will be searchable as "full text" -- which means that
+they can be searched for individual words.  The "analyzer", which is unique to
+FullTextType fields, is what breaks up the text into searchable tokens.
+
+Next, we'll swap our Lucy::Simple object out for a Lucy::Index::Indexer.
+The substitution will be straightforward because Simple has merely been
+serving as a thin wrapper around an inner Indexer, and we'll just be peeling
+away the wrapper.
+
+First, replace the constructor:
+
+~~~ perl
+# Create Indexer.
+my $indexer = Lucy::Index::Indexer->new(
+    index    => $path_to_index,
+    schema   => $schema,
+    create   => 1,
+    truncate => 1,
+);
+~~~
+
+Next, have the `$indexer` object `add_doc` where we were having the
+`$lucy` object `add_doc` before:
+
+~~~ perl
+foreach my $filename (@filenames) {
+    my $doc = parse_file($filename);
+    $indexer->add_doc($doc);
+}
+~~~
+
+There's only one extra step required: at the end of the app, you must call
+commit() explicitly to close the indexing session and commit your changes.
+(Lucy::Simple hides this detail, calling commit() implicitly when it needs to).
+
+~~~ perl
+$indexer->commit;
+~~~
+
+## Adaptations to search.cgi
+
+In our search app as in our indexing app, Lucy::Simple has served as a
+thin wrapper -- this time around [](cfish:lucy.IndexSearcher) and
+[](cfish:lucy.Hits).  Swapping out Simple for these two classes is
+also straightforward:
+
+~~~ perl
+use Lucy::Search::IndexSearcher;
+
+my $searcher = Lucy::Search::IndexSearcher->new( 
+    index => $path_to_index,
+);
+my $hits = $searcher->hits(    # returns a Hits object, not a hit count
+    query      => $q,
+    offset     => $offset,
+    num_wanted => $page_size,
+);
+my $hit_count = $hits->total_hits;  # get the hit count here
+
+...
+
+while ( my $hit = $hits->next ) {
+    ...
+}
+~~~
+
+## Hooray!
+
+Congratulations!  Your apps do the same thing as before... but now they'll be
+easier to customize.  
+
+In our next chapter, ()[cfish:FieldTypeTutorial), we'll explore
+how to assign different behaviors to different fields.
+
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/FieldTypeTutorial.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Tutorial/FieldTypeTutorial.md 
b/core/Lucy/Docs/Tutorial/FieldTypeTutorial.md
new file mode 100644
index 0000000..fe6885a
--- /dev/null
+++ b/core/Lucy/Docs/Tutorial/FieldTypeTutorial.md
@@ -0,0 +1,60 @@
+# Specify per-field properties and behaviors.
+
+The Schema we used in the last chapter specifies three fields: 
+
+~~~ perl
+my $type = Lucy::Plan::FullTextType->new(
+    analyzer => $polyanalyzer,
+);
+$schema->spec_field( name => 'title',   type => $type );
+$schema->spec_field( name => 'content', type => $type );
+$schema->spec_field( name => 'url',     type => $type );
+~~~
+
+Since they are all defined as "full text" fields, they are all searchable --
+including the `url` field, a dubious choice.  Some URLs contain meaningful
+information, but these don't, really:
+
+    http://example.com/us_constitution/amend1.txt
+
+We may as well not bother indexing the URL content.  To achieve that we need
+to assign the `url` field to a different FieldType.  
+
+## StringType
+
+Instead of FullTextType, we'll use a
+[](cfish:lucy.StringType), which doesn't use an
+Analyzer to break up text into individual fields.  Furthermore, we'll mark
+this StringType as unindexed, so that its content won't be searchable at all.
+
+~~~ perl
+my $url_type = Lucy::Plan::StringType->new( indexed => 0 );
+$schema->spec_field( name => 'url', type => $url_type );
+~~~
+
+To observe the change in behavior, try searching for `us_constitution` both
+before and after changing the Schema and re-indexing.
+
+## Toggling 'stored'
+
+For a taste of other FieldType possibilities, try turning off `stored` for
+one or more fields.
+
+~~~ perl
+my $content_type = Lucy::Plan::FullTextType->new(
+    analyzer => $polyanalyzer,
+    stored   => 0,
+);
+~~~
+
+Turning off `stored` for either `title` or `url` mangles our results page,
+but since we're not displaying `content`, turning it off for `content` has
+no effect -- except on index size.
+
+## Analyzers up next
+
+Analyzers play a crucial role in the behavior of FullTextType fields.  In our
+next tutorial chapter, [](cfish:AnalysisTutorial), we'll see how
+changing up the Analyzer changes search results.
+
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/HighlighterTutorial.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Tutorial/HighlighterTutorial.md 
b/core/Lucy/Docs/Tutorial/HighlighterTutorial.md
new file mode 100644
index 0000000..857ee01
--- /dev/null
+++ b/core/Lucy/Docs/Tutorial/HighlighterTutorial.md
@@ -0,0 +1,62 @@
+# Augment search results with highlighted excerpts.
+
+Adding relevant excerpts with highlighted search terms to your search results
+display makes it much easier for end users to scan the page and assess which
+hits look promising, dramatically improving their search experience.
+
+## Adaptations to indexer.pl
+
+[](cfish:lucy.Highlighter) uses information generated at index
+time.  To save resources, highlighting is disabled by default and must be
+turned on for individual fields.
+
+~~~ perl
+my $highlightable = Lucy::Plan::FullTextType->new(
+    analyzer      => $polyanalyzer,
+    highlightable => 1,
+);
+$schema->spec_field( name => 'content', type => $highlightable );
+~~~
+
+## Adaptations to search.cgi
+
+To add highlighting and excerpting to the search.cgi sample app, create a
+`$highlighter` object outside the hits iterating loop...
+
+~~~ perl
+my $highlighter = Lucy::Highlight::Highlighter->new(
+    searcher => $searcher,
+    query    => $q,
+    field    => 'content'
+);
+~~~
+
+... then modify the loop and the per-hit display to generate and include the
+excerpt.
+
+~~~ perl
+# Create result list.
+my $report = '';
+while ( my $hit = $hits->next ) {
+    my $score   = sprintf( "%0.3f", $hit->get_score );
+    my $excerpt = $highlighter->create_excerpt($hit);
+    $report .= qq|
+        <p>
+          <a href="$hit->{url}"><strong>$hit->{title}</strong></a>
+          <em>$score</em>
+          <br />
+          $excerpt
+          <br />
+          <span class="excerptURL">$hit->{url}</span>
+        </p>
+    |;
+}
+~~~
+
+## Next chapter: Query objects
+
+Our next tutorial chapter, [](cfish:QueryObjectsTutorial),
+illustrates how to build an "advanced search" interface using
+[](cfish:lucy.Query) objects instead of query strings.
+
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md 
b/core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md
new file mode 100644
index 0000000..53d4cea
--- /dev/null
+++ b/core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md
@@ -0,0 +1,185 @@
+# Use Query objects instead of query strings.
+
+Until now, our search app has had only a single search box.  In this tutorial
+chapter, we'll move towards an "advanced search" interface, by adding a
+"category" drop-down menu.  Three new classes will be required:
+
+* [](cfish:lucy.QueryParser) - Turn a query string into a
+  [](cfish:lucy.Query) object.
+
+* [](cfish:lucy.TermQuery) - Query for a specific term within
+  a specific field.
+
+* [](cfish:lucy.ANDQuery) - "AND" together multiple Query
+objects to produce an intersected result set.
+
+## Adaptations to indexer.pl
+
+Our new "category" field will be a StringType field rather than a FullTextType
+field, because we will only be looking for exact matches.  It needs to be
+indexed, but since we won't display its value, it doesn't need to be stored.
+
+~~~ perl
+my $cat_type = Lucy::Plan::StringType->new( stored => 0 );
+$schema->spec_field( name => 'category', type => $cat_type );
+~~~
+
+There will be three possible values: "article", "amendment", and "preamble",
+which we'll hack out of the source file's name during our `parse_file`
+subroutine:
+
+~~~ perl
+my $category
+    = $filename =~ /art/      ? 'article'
+    : $filename =~ /amend/    ? 'amendment'
+    : $filename =~ /preamble/ ? 'preamble'
+    :                           die "Can't derive category for $filename";
+return {
+    title    => $title,
+    content  => $bodytext,
+    url      => "/us_constitution/$filename",
+    category => $category,
+};
+~~~
+
+## Adaptations to search.cgi
+
+The "category" constraint will be added to our search interface using an HTML
+"select" element (this routine will need to be integrated into the HTML
+generation section of search.cgi):
+
+~~~ perl
+# Build up the HTML "select" object for the "category" field.
+sub generate_category_select {
+    my $cat = shift;
+    my $select = qq|
+      <select name="category">
+        <option value="">All Sections</option>
+        <option value="article">Articles</option>
+        <option value="amendment">Amendments</option>
+      </select>|;
+    if ($cat) {
+        $select =~ s/"$cat"/"$cat" selected/;
+    }
+    return $select;
+}
+~~~
+
+We'll start off by loading our new modules and extracting our new CGI
+parameter.
+
+~~~ perl
+use Lucy::Search::QueryParser;
+use Lucy::Search::TermQuery;
+use Lucy::Search::ANDQuery;
+
+... 
+
+my $category = decode( "UTF-8", $cgi->param('category') || '' );
+~~~
+
+QueryParser's constructor requires a "schema" argument.  We can get that from
+our IndexSearcher:
+
+~~~ perl
+# Create an IndexSearcher and a QueryParser.
+my $searcher = Lucy::Search::IndexSearcher->new( 
+    index => $path_to_index, 
+);
+my $qparser  = Lucy::Search::QueryParser->new( 
+    schema => $searcher->get_schema,
+);
+~~~
+
+Previously, we have been handing raw query strings to IndexSearcher.  Behind
+the scenes, IndexSearcher has been using a QueryParser to turn those query
+strings into Query objects.  Now, we will bring QueryParser into the
+foreground and parse the strings explicitly.
+
+~~~ perl
+my $query = $qparser->parse($q);
+~~~
+
+If the user has specified a category, we'll use an ANDQuery to join our parsed
+query together with a TermQuery representing the category.
+
+~~~ perl
+if ($category) {
+    my $category_query = Lucy::Search::TermQuery->new(
+        field => 'category', 
+        term  => $category,
+    );
+    $query = Lucy::Search::ANDQuery->new(
+        children => [ $query, $category_query ]
+    );
+}
+~~~
+
+Now when we execute the query...
+
+~~~ perl
+# Execute the Query and get a Hits object.
+my $hits = $searcher->hits(
+    query      => $query,
+    offset     => $offset,
+    num_wanted => $page_size,
+);
+~~~
+
+... we'll get a result set which is the intersection of the parsed query and
+the category query.
+
+## Using TermQuery with full text fields
+
+When querying full text fields, the easiest way is to create query objects
+using QueryParser. But sometimes you want to create TermQuery for a single
+term in a FullTextType field directly. In this case, we have to run the
+search term through the field's analyzer to make sure it gets normalized in
+the same way as the field's content.
+
+~~~ perl
+sub make_term_query {
+    my ($field, $term) = @_;
+
+    my $token;
+    my $type = $schema->fetch_type($field);
+
+    if ( $type->isa('Lucy::Plan::FullTextType') ) {
+        # Run the term through the full text analysis chain.
+        my $analyzer = $type->get_analyzer;
+        my $tokens   = $analyzer->split($term);
+
+        if ( @$tokens != 1 ) {
+            # If the term expands to more than one token, or no
+            # tokens at all, it will never match a token in the
+            # full text field.
+            return Lucy::Search::NoMatchQuery->new;
+        }
+
+        $token = $tokens->[0];
+    }
+    else {
+        # Exact match for other types.
+        $token = $term;
+    }
+
+    return Lucy::Search::TermQuery->new(
+        field => $field,
+        term  => $token,
+    );
+}
+~~~
+
+## Congratulations!
+
+You've made it to the end of the tutorial.
+
+## See Also
+
+For additional thematic documentation, see the Apache Lucy
+[](cfish:Cookbook).
+
+ANDQuery has a companion class, [](cfish:lucy.ORQuery), and a
+close relative, [](cfish:lucy.RequiredOptionalQuery).
+
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/SimpleTutorial.md
----------------------------------------------------------------------
diff --git a/core/Lucy/Docs/Tutorial/SimpleTutorial.md 
b/core/Lucy/Docs/Tutorial/SimpleTutorial.md
new file mode 100644
index 0000000..83883e7
--- /dev/null
+++ b/core/Lucy/Docs/Tutorial/SimpleTutorial.md
@@ -0,0 +1,298 @@
+# Bare-bones search app.
+
+## Setup
+
+Copy the text presentation of the US Constitution from the `sample` directory
+of the Apache Lucy distribution to the base level of your web server's
+`htdocs` directory.
+
+    $ cp -R sample/us_constitution /usr/local/apache2/htdocs/
+
+## Indexing: indexer.pl
+
+Our first task will be to create an application called `indexer.pl` which
+builds a searchable "inverted index" from a collection of documents.  
+
+After we specify some configuration variables and load all necessary
+modules...
+
+~~~ perl
+#!/usr/local/bin/perl
+use strict;
+use warnings;
+
+# (Change configuration variables as needed.)
+my $path_to_index = '/path/to/index';
+my $uscon_source  = '/usr/local/apache2/htdocs/us_constitution';
+
+use Lucy::Simple;
+use File::Spec::Functions qw( catfile );
+~~~
+
+... we'll start by creating a Lucy::Simple object, telling it where we'd
+like the index to be located and the language of the source material.
+
+~~~ perl
+my $lucy = Lucy::Simple->new(
+    path     => $path_to_index,
+    language => 'en',
+);
+~~~
+
+Next, we'll add a subroutine which parses our sample documents.
+
+~~~ perl
+# Parse a file from our US Constitution collection and return a hashref with
+# the fields title, body, and url.
+sub parse_file {
+    my $filename = shift;
+    my $filepath = catfile( $uscon_source, $filename );
+    open( my $fh, '<', $filepath ) or die "Can't open '$filepath': $!";
+    my $text = do { local $/; <$fh> };    # slurp file content
+    $text =~ /\A(.+?)^\s+(.*)/ms
+        or die "Can't extract title/bodytext from '$filepath'";
+    my $title    = $1;
+    my $bodytext = $2;
+    return {
+        title    => $title,
+        content  => $bodytext,
+        url      => "/us_constitution/$filename",
+    };
+}
+~~~
+
+Add some elementary directory reading code...
+
+~~~ perl
+# Collect names of source files.
+opendir( my $dh, $uscon_source )
+    or die "Couldn't opendir '$uscon_source': $!";
+my @filenames = grep { $_ =~ /\.txt/ } readdir $dh;
+~~~
+
+... and now we're ready for the meat of indexer.pl -- which occupies exactly
+one line of code.
+
+~~~ perl
+foreach my $filename (@filenames) {
+    my $doc = parse_file($filename);
+    $lucy->add_doc($doc);  # ta-da!
+}
+~~~
+
+## Search: search.cgi
+
+As with our indexing app, the bulk of the code in our search script won't be
+Lucy-specific.  
+
+The beginning is dedicated to CGI processing and configuration.
+
+~~~ perl
+#!/usr/local/bin/perl -T
+use strict;
+use warnings;
+
+# (Change configuration variables as needed.)
+my $path_to_index = '/path/to/index';
+
+use CGI;
+use List::Util qw( max min );
+use POSIX qw( ceil );
+use Encode qw( decode );
+use Lucy::Simple;
+
+my $cgi       = CGI->new;
+my $q         = decode( "UTF-8", $cgi->param('q') || '' );
+my $offset    = decode( "UTF-8", $cgi->param('offset') || 0 );
+my $page_size = 10;
+~~~
+
+Once that's out of the way, we create our Lucy::Simple object and feed
+it a query string.
+
+~~~ perl
+my $lucy = Lucy::Simple->new(
+    path     => $path_to_index,
+    language => 'en',
+);
+my $hit_count = $lucy->search(
+    query      => $q,
+    offset     => $offset,
+    num_wanted => $page_size,
+);
+~~~
+
+The value returned by search() is the total number of documents in the
+collection which matched the query.  We'll show this hit count to the user,
+and also use it in conjunction with the parameters `offset` and `num_wanted`
+to break up results into "pages" of manageable size.
+
+Calling search() on our Simple object turns it into an iterator. Invoking
+next() now returns hits one at a time as [](cfish:lucy.HitDoc)
+objects, starting with the most relevant.
+
+~~~ perl
+# Create result list.
+my $report = '';
+while ( my $hit = $lucy->next ) {
+    my $score = sprintf( "%0.3f", $hit->get_score );
+    $report .= qq|
+        <p>
+          <a href="$hit->{url}"><strong>$hit->{title}</strong></a>
+          <em>$score</em>
+          <br>
+          <span class="excerptURL">$hit->{url}</span>
+        </p>
+        |;
+}
+~~~
+
+The rest of the script is just text wrangling. 
+
+~~~ perl5
+#---------------------------------------------------------------#
+# No tutorial material below this point - just html generation. #
+#---------------------------------------------------------------#
+
+# Generate paging links and hit count, print and exit.
+my $paging_links = generate_paging_info( $q, $hit_count );
+blast_out_content( $q, $report, $paging_links );
+
+# Create html fragment with links for paging through results n-at-a-time.
+sub generate_paging_info {
+    my ( $query_string, $total_hits ) = @_;
+    my $escaped_q = CGI::escapeHTML($query_string);
+    my $paging_info;
+    if ( !length $query_string ) {
+        # No query?  No display.
+        $paging_info = '';
+    }
+    elsif ( $total_hits == 0 ) {
+        # Alert the user that their search failed.
+        $paging_info
+            = qq|<p>No matches for <strong>$escaped_q</strong></p>|;
+    }
+    else {
+        # Calculate the nums for the first and last hit to display.
+        my $last_result = min( ( $offset + $page_size ), $total_hits );
+        my $first_result = min( ( $offset + 1 ), $last_result );
+
+        # Display the result nums, start paging info.
+        $paging_info = qq|
+            <p>
+                Results <strong>$first_result-$last_result</strong> 
+                of <strong>$total_hits</strong> 
+                for <strong>$escaped_q</strong>.
+            </p>
+            <p>
+                Results Page:
+            |;
+
+        # Calculate first and last hits pages to display / link to.
+        my $current_page = int( $first_result / $page_size ) + 1;
+        my $last_page    = ceil( $total_hits / $page_size );
+        my $first_page   = max( 1, ( $current_page - 9 ) );
+        $last_page = min( $last_page, ( $current_page + 10 ) );
+
+        # Create a url for use in paging links.
+        my $href = $cgi->url( -relative => 1 );
+        $href .= "?q=" . CGI::escape($query_string);
+        $href .= ";offset=" . CGI::escape($offset);
+
+        # Generate the "Prev" link.
+        if ( $current_page > 1 ) {
+            my $new_offset = ( $current_page - 2 ) * $page_size;
+            $href =~ s/(?<=offset=)\d+/$new_offset/;
+            $paging_info .= qq|<a href="$href">&lt;= Prev</a>\n|;
+        }
+
+        # Generate paging links.
+        for my $page_num ( $first_page .. $last_page ) {
+            if ( $page_num == $current_page ) {
+                $paging_info .= qq|$page_num \n|;
+            }
+            else {
+                my $new_offset = ( $page_num - 1 ) * $page_size;
+                $href =~ s/(?<=offset=)\d+/$new_offset/;
+                $paging_info .= qq|<a href="$href">$page_num</a>\n|;
+            }
+        }
+
+        # Generate the "Next" link.
+        if ( $current_page != $last_page ) {
+            my $new_offset = $current_page * $page_size;
+            $href =~ s/(?<=offset=)\d+/$new_offset/;
+            $paging_info .= qq|<a href="$href">Next =&gt;</a>\n|;
+        }
+
+        # Close tag.
+        $paging_info .= "</p>\n";
+    }
+
+    return $paging_info;
+}
+
+# Print content to output.
+sub blast_out_content {
+    my ( $query_string, $hit_list, $paging_info ) = @_;
+    my $escaped_q = CGI::escapeHTML($query_string);
+    binmode( STDOUT, ":encoding(UTF-8)" );
+    print qq|Content-type: text/html; charset=UTF-8\n\n|;
+    print qq|
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
+    "http://www.w3.org/TR/html4/loose.dtd";>
+<html>
+<head>
+  <meta http-equiv="Content-type" 
+    content="text/html;charset=UTF-8">
+  <link rel="stylesheet" type="text/css" 
+    href="/us_constitution/uscon.css">
+  <title>Lucy: $escaped_q</title>
+</head>
+
+<body>
+
+  <div id="navigation">
+    <form id="usconSearch" action="">
+      <strong>
+        Search the 
+        <a href="/us_constitution/index.html">US Constitution</a>:
+      </strong>
+      <input type="text" name="q" id="q" value="$escaped_q">
+      <input type="submit" value="=&gt;">
+    </form>
+  </div><!--navigation-->
+
+  <div id="bodytext">
+
+  $hit_list
+
+  $paging_info
+
+    <p style="font-size: smaller; color: #666">
+      <em>
+        Powered by <a href="http://lucy.apache.org/";
+        >Apache Lucy<small><sup>TM</sup></small></a>
+      </em>
+    </p>
+  </div><!--bodytext-->
+
+</body>
+
+</html>
+|;
+}
+~~~
+
+## OK... now what?
+
+Lucy::Simple is perfectly adequate for some tasks, but it's not very flexible.
+Many people find that it doesn't do at least one or two things they can't live
+without.
+
+In our next tutorial chapter,
+[](cfish:BeyondSimpleTutorial), we'll rewrite our
+indexing and search scripts using the classes that Lucy::Simple hides
+from view, opening up the possibilities for expansion; then, we'll spend the
+rest of the tutorial chapters exploring these possibilities.
+

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/perl/lib/Lucy/Docs/Cookbook.pod
----------------------------------------------------------------------
diff --git a/perl/lib/Lucy/Docs/Cookbook.pod b/perl/lib/Lucy/Docs/Cookbook.pod
deleted file mode 100644
index 6726db9..0000000
--- a/perl/lib/Lucy/Docs/Cookbook.pod
+++ /dev/null
@@ -1,61 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-=head1 NAME
-
-Lucy::Docs::Cookbook - Apache Lucy recipes.
-
-=head1 DESCRIPTION
-
-The Cookbook provides thematic documentation covering some of Apache Lucy's
-more sophisticated features.  For a step-by-step introduction to Lucy,
-see L<Lucy::Docs::Tutorial>.
-
-=head2 Chapters
-
-=over
-
-=item *
-
-L<Lucy::Docs::Cookbook::FastUpdates> - While index updates are fast on
-average, worst-case update performance may be significantly slower. To make
-index updates consistently quick, we must manually intervene to control the
-process of index segment consolidation.
-
-=item *
-
-L<Lucy::Docs::Cookbook::CustomQuery> - Explore Lucy's support for
-custom query types by creating a "PrefixQuery" class to handle trailing
-wildcards.
-
-=item *
-
-L<Lucy::Docs::Cookbook::CustomQueryParser> - Define your own custom
-search query syntax using Lucy::Search::QueryParser and
-L<Parse::RecDescent>.
-
-=back
-
-=head2 Materials
-
-Some of the recipes in the Cookbook reference the completed
-L<Tutorial|Lucy::Docs::Tutorial> application.  These materials can be
-found in the C<sample> directory at the root of the Lucy distribution:
-
-    sample/indexer.pl        # indexing app
-    sample/search.cgi        # search app
-    sample/us_constitution   # corpus
-
-

http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod
----------------------------------------------------------------------
diff --git a/perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod 
b/perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod
deleted file mode 100644
index 2c78bf1..0000000
--- a/perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod
+++ /dev/null
@@ -1,320 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-=head1 NAME
-
-Lucy::Docs::Cookbook::CustomQuery - Sample subclass of Query.
-
-=head1 ABSTRACT
-
-Explore Apache Lucy's support for custom query types by creating a
-"PrefixQuery" class to handle trailing wildcards.
-
-    my $prefix_query = PrefixQuery->new(
-        field        => 'content',
-        query_string => 'foo*',
-    );
-    my $hits = $searcher->hits( query => $prefix_query );
-    ...
-
-=head1 Query, Compiler, and Matcher 
-
-To add support for a new query type, we need three classes: a Query, a
-Compiler, and a Matcher.  
-
-=over
-
-=item *
-
-PrefixQuery - a subclass of L<Lucy::Search::Query>, and the only class
-that client code will deal with directly.
-
-=item *
-
-PrefixCompiler - a subclass of L<Lucy::Search::Compiler>, whose primary 
-role is to compile a PrefixQuery to a PrefixMatcher.
-
-=item *
-
-PrefixMatcher - a subclass of L<Lucy::Search::Matcher>, which does the
-heavy lifting: it applies the query to individual documents and assigns a
-score to each match.
-
-=back
-
-The PrefixQuery class on its own isn't enough because a Query object's role is
-limited to expressing an abstract specification for the search.  A Query is
-basically nothing but metadata; execution is left to the Query's companion
-Compiler and Matcher.
-
-Here's a simplified sketch illustrating how a Searcher's hits() method ties
-together the three classes.
-
-    sub hits {
-        my ( $self, $query ) = @_;
-        my $compiler = $query->make_compiler(
-            searcher => $self,
-            boost    => $query->get_boost,
-        );
-        my $matcher = $compiler->make_matcher(
-            reader     => $self->get_reader,
-            need_score => 1,
-        );
-        my @hits = $matcher->capture_hits;
-        return \@hits;
-    }
-
-=head2 PrefixQuery
-
-Our PrefixQuery class will have two attributes: a query string and a field
-name.
-
-    package PrefixQuery;
-    use base qw( Lucy::Search::Query );
-    use Carp;
-    use Scalar::Util qw( blessed );
-    
-    # Inside-out member vars and hand-rolled accessors.
-    my %query_string;
-    my %field;
-    sub get_query_string { my $self = shift; return $query_string{$$self} }
-    sub get_field        { my $self = shift; return $field{$$self} }
-
-PrefixQuery's constructor collects and validates the attributes.
-
-    sub new {
-        my ( $class, %args ) = @_;
-        my $query_string = delete $args{query_string};
-        my $field        = delete $args{field};
-        my $self         = $class->SUPER::new(%args);
-        confess("'query_string' param is required")
-            unless defined $query_string;
-        confess("Invalid query_string: '$query_string'")
-            unless $query_string =~ /\*\s*$/;
-        confess("'field' param is required")
-            unless defined $field;
-        $query_string{$$self} = $query_string;
-        $field{$$self}        = $field;
-        return $self;
-    }
-
-Since this is an inside-out class, we'll need a destructor:
-
-    sub DESTROY {
-        my $self = shift;
-        delete $query_string{$$self};
-        delete $field{$$self};
-        $self->SUPER::DESTROY;
-    }
-
-The equals() method determines whether two Queries are logically equivalent:
-
-    sub equals {
-        my ( $self, $other ) = @_;
-        return 0 unless blessed($other);
-        return 0 unless $other->isa("PrefixQuery");
-        return 0 unless $field{$$self} eq $field{$$other};
-        return 0 unless $query_string{$$self} eq $query_string{$$other};
-        return 1;
-    }
-
-The last thing we'll need is a make_compiler() factory method which kicks out
-a subclass of L<Compiler|Lucy::Search::Compiler>.
-
-    sub make_compiler {
-        my ( $self, %args ) = @_;
-        my $subordinate = delete $args{subordinate};
-        my $compiler = PrefixCompiler->new( %args, parent => $self );
-        $compiler->normalize unless $subordinate;
-        return $compiler;
-    }
-
-=head2 PrefixCompiler
-
-PrefixQuery's make_compiler() method will be called internally at search-time
-by objects which subclass L<Lucy::Search::Searcher> -- such as
-L<IndexSearchers|Lucy::Search::IndexSearcher>.
-
-A Searcher is associated with a particular collection of documents.   These
-documents may all reside in one index, as with IndexSearcher, or they may be
-spread out across multiple indexes on one or more machines, as with
-L<LucyX::Remote::ClusterSearcher>.  
-
-Searcher objects have access to certain statistical information about the
-collections they represent; for instance, a Searcher can tell you how many
-documents are in the collection...
-
-    my $maximum_number_of_docs_in_collection = $searcher->doc_max;
-
-... or how many documents a specific term appears in:
-
-    my $term_appears_in_this_many_docs = $searcher->doc_freq(
-        field => 'content',
-        term  => 'foo',
-    );
-
-Such information can be used by sophisticated Compiler implementations to
-assign more or less heft to individual queries or sub-queries.  However, we're
-not going to bother with weighting for this demo; we'll just assign a fixed
-score of 1.0 to each matching document.
-
-We don't need to write a constructor, as it will suffice to inherit new() from
-Lucy::Search::Compiler.  The only method we need to implement for
-PrefixCompiler is make_matcher().
-
-    package PrefixCompiler;
-    use base qw( Lucy::Search::Compiler );
-
-    sub make_matcher {
-        my ( $self, %args ) = @_;
-        my $seg_reader = $args{reader};
-
-        # Retrieve low-level components LexiconReader and PostingListReader.
-        my $lex_reader
-            = $seg_reader->obtain("Lucy::Index::LexiconReader");
-        my $plist_reader
-            = $seg_reader->obtain("Lucy::Index::PostingListReader");
-        
-        # Acquire a Lexicon and seek it to our query string.
-        my $substring = $self->get_parent->get_query_string;
-        $substring =~ s/\*.\s*$//;
-        my $field = $self->get_parent->get_field;
-        my $lexicon = $lex_reader->lexicon( field => $field );
-        return unless $lexicon;
-        $lexicon->seek($substring);
-        
-        # Accumulate PostingLists for each matching term.
-        my @posting_lists;
-        while ( defined( my $term = $lexicon->get_term ) ) {
-            last unless $term =~ /^\Q$substring/;
-            my $posting_list = $plist_reader->posting_list(
-                field => $field,
-                term  => $term,
-            );
-            if ($posting_list) {
-                push @posting_lists, $posting_list;
-            }
-            last unless $lexicon->next;
-        }
-        return unless @posting_lists;
-        
-        return PrefixMatcher->new( posting_lists => \@posting_lists );
-    }
-
-PrefixCompiler gets access to a L<SegReader|Lucy::Index::SegReader>
-object when make_matcher() gets called.  From the SegReader and its
-sub-components L<LexiconReader|Lucy::Index::LexiconReader> and
-L<PostingListReader|Lucy::Index::PostingListReader>, we acquire a
-L<Lexicon|Lucy::Index::Lexicon>, scan through the Lexicon's unique
-terms, and acquire a L<PostingList|Lucy::Index::PostingList> for each
-term that matches our prefix.
-
-Each of these PostingList objects represents a set of documents which match
-the query.
-
-=head2 PrefixMatcher
-
-The Matcher subclass is the most involved.  
-
-    package PrefixMatcher;
-    use base qw( Lucy::Search::Matcher );
-    
-    # Inside-out member vars.
-    my %doc_ids;
-    my %tick;
-    
-    sub new {
-        my ( $class, %args ) = @_;
-        my $posting_lists = delete $args{posting_lists};
-        my $self          = $class->SUPER::new(%args);
-        
-        # Cheesy but simple way of interleaving PostingList doc sets.
-        my %all_doc_ids;
-        for my $posting_list (@$posting_lists) {
-            while ( my $doc_id = $posting_list->next ) {
-                $all_doc_ids{$doc_id} = undef;
-            }
-        }
-        my @doc_ids = sort { $a <=> $b } keys %all_doc_ids;
-        $doc_ids{$$self} = \@doc_ids;
-        
-        # Track our position within the array of doc ids.
-        $tick{$$self} = -1;
-        
-        return $self;
-    }
-    
-    sub DESTROY {
-        my $self = shift;
-        delete $doc_ids{$$self};
-        delete $tick{$$self};
-        $self->SUPER::DESTROY;
-    }
-
-The doc ids must be in order, or some will be ignored; hence the C<sort>
-above.
-
-In addition to the constructor and destructor, there are three methods that
-must be overridden.
-
-next() advances the Matcher to the next valid matching doc.  
-
-    sub next {
-        my $self    = shift;
-        my $doc_ids = $doc_ids{$$self};
-        my $tick    = ++$tick{$$self};
-        return 0 if $tick >= scalar @$doc_ids;
-        return $doc_ids->[$tick];
-    }
-
-get_doc_id() returns the current document id, or 0 if the Matcher is
-exhausted.  (L<Document numbers|Lucy::Docs::DocIDs> start at 1, so 0 is
-a sentinel.)
-
-    sub get_doc_id {
-        my $self    = shift;
-        my $tick    = $tick{$$self};
-        my $doc_ids = $doc_ids{$$self};
-        return $tick < scalar @$doc_ids ? $doc_ids->[$tick] : 0;
-    }
-
-score() conveys the relevance score of the current match.  We'll just return a
-fixed score of 1.0:
-
-    sub score { 1.0 }
-
-=head1 Usage 
-
-To get a basic feel for PrefixQuery, insert the FlatQueryParser module
-described in L<Lucy::Docs::Cookbook::CustomQueryParser> (which supports
-PrefixQuery) into the search.cgi sample app.
-
-    my $parser = FlatQueryParser->new( schema => $searcher->get_schema );
-    my $query  = $parser->parse($q);
-
-If you're planning on using PrefixQuery in earnest, though, you may want to
-change up analyzers to avoid stemming, because stemming -- another approach to
-prefix conflation -- is not perfectly compatible with prefix searches.
-
-    # Polyanalyzer with no SnowballStemmer.
-    my $analyzer = Lucy::Analysis::PolyAnalyzer->new(
-        analyzers => [
-            Lucy::Analysis::StandardTokenizer->new,
-            Lucy::Analysis::Normalizer->new,
-        ],
-    );
-
-=cut
-

Reply via email to