Convert POD to Markdown
Project: http://git-wip-us.apache.org/repos/asf/lucy/repo Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/5618020f Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/5618020f Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/5618020f Branch: refs/heads/master Commit: 5618020ff61ba7dac4b7132b5977ad4119e2c220 Parents: c2363da Author: Nick Wellnhofer <[email protected]> Authored: Wed Jul 8 12:57:18 2015 +0200 Committer: Nick Wellnhofer <[email protected]> Committed: Sat Jul 11 15:03:10 2015 +0200 ---------------------------------------------------------------------- core/Lucy/Docs/Cookbook.md | 33 ++ core/Lucy/Docs/Cookbook/CustomQuery.md | 321 +++++++++++++++++++ core/Lucy/Docs/Cookbook/CustomQueryParser.md | 231 +++++++++++++ core/Lucy/Docs/Cookbook/FastUpdates.md | 140 ++++++++ core/Lucy/Docs/DocIDs.md | 28 ++ core/Lucy/Docs/FileFormat.md | 191 +++++++++++ core/Lucy/Docs/IRTheory.md | 44 +++ core/Lucy/Docs/Tutorial.md | 53 +++ core/Lucy/Docs/Tutorial/AnalysisTutorial.md | 85 +++++ core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md | 125 ++++++++ core/Lucy/Docs/Tutorial/FieldTypeTutorial.md | 60 ++++ core/Lucy/Docs/Tutorial/HighlighterTutorial.md | 62 ++++ core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md | 185 +++++++++++ core/Lucy/Docs/Tutorial/SimpleTutorial.md | 298 +++++++++++++++++ perl/lib/Lucy/Docs/Cookbook.pod | 61 ---- perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod | 320 ------------------ .../Lucy/Docs/Cookbook/CustomQueryParser.pod | 236 -------------- perl/lib/Lucy/Docs/Cookbook/FastUpdates.pod | 153 --------- perl/lib/Lucy/Docs/DocIDs.pod | 47 --- perl/lib/Lucy/Docs/FileFormat.pod | 239 -------------- perl/lib/Lucy/Docs/IRTheory.pod | 94 ------ perl/lib/Lucy/Docs/Tutorial.pod | 89 ----- perl/lib/Lucy/Docs/Tutorial/Analysis.pod | 94 ------ perl/lib/Lucy/Docs/Tutorial/BeyondSimple.pod | 153 --------- perl/lib/Lucy/Docs/Tutorial/FieldType.pod | 74 ----- perl/lib/Lucy/Docs/Tutorial/Highlighter.pod | 76 ----- perl/lib/Lucy/Docs/Tutorial/QueryObjects.pod | 198 ------------ perl/lib/Lucy/Docs/Tutorial/Simple.pod | 298 ----------------- 28 files changed, 1856 insertions(+), 2132 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Cookbook.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Cookbook.md b/core/Lucy/Docs/Cookbook.md new file mode 100644 index 0000000..ec6994f --- /dev/null +++ b/core/Lucy/Docs/Cookbook.md @@ -0,0 +1,33 @@ +# Apache Lucy recipes + +The Cookbook provides thematic documentation covering some of Apache Lucy's +more sophisticated features. For a step-by-step introduction to Lucy, +see [](cfish:Tutorial). + +## Chapters + +* [](cfish:FastUpdates) - While index updates are fast on + average, worst-case update performance may be significantly slower. To make + index updates consistently quick, we must manually intervene to control the + process of index segment consolidation. + +* [](cfish:CustomQuery) - Explore Lucy's support for + custom query types by creating a "PrefixQuery" class to handle trailing + wildcards. + +* [](cfish:CustomQueryParser) - Define your own custom + search query syntax using [](cfish:lucy.QueryParser) and + Parse::RecDescent. + +## Materials + +Some of the recipes in the Cookbook reference the completed +[](cfish:Tutorial) application. These materials can be +found in the `sample` directory at the root of the Lucy distribution: + +~~~ perl +sample/indexer.pl # indexing app +sample/search.cgi # search app +sample/us_constitution # corpus +~~~ + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Cookbook/CustomQuery.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Cookbook/CustomQuery.md b/core/Lucy/Docs/Cookbook/CustomQuery.md new file mode 100644 index 0000000..d135c8b --- /dev/null +++ b/core/Lucy/Docs/Cookbook/CustomQuery.md @@ -0,0 +1,321 @@ +# Sample subclass of Query + +Explore Apache Lucy's support for custom query types by creating a +"PrefixQuery" class to handle trailing wildcards. + +~~~ perl +my $prefix_query = PrefixQuery->new( + field => 'content', + query_string => 'foo*', +); +my $hits = $searcher->hits( query => $prefix_query ); +... +~~~ + +## Query, Compiler, and Matcher + +To add support for a new query type, we need three classes: a Query, a +Compiler, and a Matcher. + +* PrefixQuery - a subclass of [](cfish:lucy.Query), and the only class + that client code will deal with directly. + +* PrefixCompiler - a subclass of [](cfish:lucy.Compiler), whose primary + role is to compile a PrefixQuery to a PrefixMatcher. + +* PrefixMatcher - a subclass of [](cfish:lucy.Matcher), which does the + heavy lifting: it applies the query to individual documents and assigns a + score to each match. + +The PrefixQuery class on its own isn't enough because a Query object's role is +limited to expressing an abstract specification for the search. A Query is +basically nothing but metadata; execution is left to the Query's companion +Compiler and Matcher. + +Here's a simplified sketch illustrating how a Searcher's hits() method ties +together the three classes. + +~~~ perl +sub hits { + my ( $self, $query ) = @_; + my $compiler = $query->make_compiler( + searcher => $self, + boost => $query->get_boost, + ); + my $matcher = $compiler->make_matcher( + reader => $self->get_reader, + need_score => 1, + ); + my @hits = $matcher->capture_hits; + return \@hits; +} +~~~ + +### PrefixQuery + +Our PrefixQuery class will have two attributes: a query string and a field +name. + +~~~ perl +package PrefixQuery; +use base qw( Lucy::Search::Query ); +use Carp; +use Scalar::Util qw( blessed ); + +# Inside-out member vars and hand-rolled accessors. +my %query_string; +my %field; +sub get_query_string { my $self = shift; return $query_string{$$self} } +sub get_field { my $self = shift; return $field{$$self} } +~~~ + +PrefixQuery's constructor collects and validates the attributes. + +~~~ perl +sub new { + my ( $class, %args ) = @_; + my $query_string = delete $args{query_string}; + my $field = delete $args{field}; + my $self = $class->SUPER::new(%args); + confess("'query_string' param is required") + unless defined $query_string; + confess("Invalid query_string: '$query_string'") + unless $query_string =~ /\*\s*$/; + confess("'field' param is required") + unless defined $field; + $query_string{$$self} = $query_string; + $field{$$self} = $field; + return $self; +} +~~~ + +Since this is an inside-out class, we'll need a destructor: + +~~~ perl +sub DESTROY { + my $self = shift; + delete $query_string{$$self}; + delete $field{$$self}; + $self->SUPER::DESTROY; +} +~~~ + +The equals() method determines whether two Queries are logically equivalent: + +~~~ perl +sub equals { + my ( $self, $other ) = @_; + return 0 unless blessed($other); + return 0 unless $other->isa("PrefixQuery"); + return 0 unless $field{$$self} eq $field{$$other}; + return 0 unless $query_string{$$self} eq $query_string{$$other}; + return 1; +} +~~~ + +The last thing we'll need is a make_compiler() factory method which kicks out +a subclass of [](cfish:lucy.Compiler). + +~~~ perl +sub make_compiler { + my ( $self, %args ) = @_; + my $subordinate = delete $args{subordinate}; + my $compiler = PrefixCompiler->new( %args, parent => $self ); + $compiler->normalize unless $subordinate; + return $compiler; +} +~~~ + +### PrefixCompiler + +PrefixQuery's make_compiler() method will be called internally at search-time +by objects which subclass [](cfish:lucy.Searcher) -- such as +[IndexSearchers](cfish:lucy.IndexSearcher). + +A Searcher is associated with a particular collection of documents. These +documents may all reside in one index, as with IndexSearcher, or they may be +spread out across multiple indexes on one or more machines, as with +[](ClusterSearcher). + +Searcher objects have access to certain statistical information about the +collections they represent; for instance, a Searcher can tell you how many +documents are in the collection... + +~~~ perl +my $maximum_number_of_docs_in_collection = $searcher->doc_max; +~~~ + +... or how many documents a specific term appears in: + +~~~ perl +my $term_appears_in_this_many_docs = $searcher->doc_freq( + field => 'content', + term => 'foo', +); +~~~ + +Such information can be used by sophisticated Compiler implementations to +assign more or less heft to individual queries or sub-queries. However, we're +not going to bother with weighting for this demo; we'll just assign a fixed +score of 1.0 to each matching document. + +We don't need to write a constructor, as it will suffice to inherit new() from +Lucy::Search::Compiler. The only method we need to implement for +PrefixCompiler is make_matcher(). + +~~~ perl +package PrefixCompiler; +use base qw( Lucy::Search::Compiler ); + +sub make_matcher { + my ( $self, %args ) = @_; + my $seg_reader = $args{reader}; + + # Retrieve low-level components LexiconReader and PostingListReader. + my $lex_reader + = $seg_reader->obtain("Lucy::Index::LexiconReader"); + my $plist_reader + = $seg_reader->obtain("Lucy::Index::PostingListReader"); + + # Acquire a Lexicon and seek it to our query string. + my $substring = $self->get_parent->get_query_string; + $substring =~ s/\*.\s*$//; + my $field = $self->get_parent->get_field; + my $lexicon = $lex_reader->lexicon( field => $field ); + return unless $lexicon; + $lexicon->seek($substring); + + # Accumulate PostingLists for each matching term. + my @posting_lists; + while ( defined( my $term = $lexicon->get_term ) ) { + last unless $term =~ /^\Q$substring/; + my $posting_list = $plist_reader->posting_list( + field => $field, + term => $term, + ); + if ($posting_list) { + push @posting_lists, $posting_list; + } + last unless $lexicon->next; + } + return unless @posting_lists; + + return PrefixMatcher->new( posting_lists => \@posting_lists ); +} +~~~ + +PrefixCompiler gets access to a [](cfish:lucy.SegReader) +object when make_matcher() gets called. From the SegReader and its +sub-components [](cfish:lucy.LexiconReader) and +[](cfish:lucy.PostingListReader), we acquire a +[](cfish:lucy.Lexicon), scan through the Lexicon's unique +terms, and acquire a [](cfish:lucy.PostingList) for each +term that matches our prefix. + +Each of these PostingList objects represents a set of documents which match +the query. + +### PrefixMatcher + +The Matcher subclass is the most involved. + +~~~ perl +package PrefixMatcher; +use base qw( Lucy::Search::Matcher ); + +# Inside-out member vars. +my %doc_ids; +my %tick; + +sub new { + my ( $class, %args ) = @_; + my $posting_lists = delete $args{posting_lists}; + my $self = $class->SUPER::new(%args); + + # Cheesy but simple way of interleaving PostingList doc sets. + my %all_doc_ids; + for my $posting_list (@$posting_lists) { + while ( my $doc_id = $posting_list->next ) { + $all_doc_ids{$doc_id} = undef; + } + } + my @doc_ids = sort { $a <=> $b } keys %all_doc_ids; + $doc_ids{$$self} = \@doc_ids; + + # Track our position within the array of doc ids. + $tick{$$self} = -1; + + return $self; +} + +sub DESTROY { + my $self = shift; + delete $doc_ids{$$self}; + delete $tick{$$self}; + $self->SUPER::DESTROY; +} +~~~ + +The doc ids must be in order, or some will be ignored; hence the `sort` +above. + +In addition to the constructor and destructor, there are three methods that +must be overridden. + +next() advances the Matcher to the next valid matching doc. + +~~~ perl +sub next { + my $self = shift; + my $doc_ids = $doc_ids{$$self}; + my $tick = ++$tick{$$self}; + return 0 if $tick >= scalar @$doc_ids; + return $doc_ids->[$tick]; +} +~~~ + +get_doc_id() returns the current document id, or 0 if the Matcher is +exhausted. ([Document numbers](cfish:DocIDs) start at 1, so 0 is +a sentinel.) + +~~~ perl +sub get_doc_id { + my $self = shift; + my $tick = $tick{$$self}; + my $doc_ids = $doc_ids{$$self}; + return $tick < scalar @$doc_ids ? $doc_ids->[$tick] : 0; +} +~~~ + +score() conveys the relevance score of the current match. We'll just return a +fixed score of 1.0: + +~~~ perl +sub score { 1.0 } +~~~ + +## Usage + +To get a basic feel for PrefixQuery, insert the FlatQueryParser module +described in [](cfish:CustomQueryParser) (which supports +PrefixQuery) into the search.cgi sample app. + +~~~ perl +my $parser = FlatQueryParser->new( schema => $searcher->get_schema ); +my $query = $parser->parse($q); +~~~ + +If you're planning on using PrefixQuery in earnest, though, you may want to +change up analyzers to avoid stemming, because stemming -- another approach to +prefix conflation -- is not perfectly compatible with prefix searches. + +~~~ perl +# Polyanalyzer with no SnowballStemmer. +my $analyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ + Lucy::Analysis::StandardTokenizer->new, + Lucy::Analysis::Normalizer->new, + ], +); +~~~ + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Cookbook/CustomQueryParser.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Cookbook/CustomQueryParser.md b/core/Lucy/Docs/Cookbook/CustomQueryParser.md new file mode 100644 index 0000000..39b1167 --- /dev/null +++ b/core/Lucy/Docs/Cookbook/CustomQueryParser.md @@ -0,0 +1,231 @@ +# Sample subclass of QueryParser. + +Implement a custom search query language using a subclass of +[](cfish:lucy.QueryParser). + +## The language + +At first, our query language will support only simple term queries and phrases +delimited by double quotes. For simplicity's sake, it will not support +parenthetical groupings, boolean operators, or prepended plus/minus. The +results for all subqueries will be unioned together -- i.e. joined using an OR +-- which is usually the best approach for small-to-medium-sized document +collections. + +Later, we'll add support for trailing wildcards. + +## Single-field parser + +Our initial parser implentation will generate queries against a single fixed +field, "content", and it will analyze text using a fixed choice of English +EasyAnalyzer. We won't subclass Lucy::Search::QueryParser just yet. + +~~~ perl +package FlatQueryParser; +use Lucy::Search::TermQuery; +use Lucy::Search::PhraseQuery; +use Lucy::Search::ORQuery; +use Carp; + +sub new { + my $analyzer = Lucy::Analysis::EasyAnalyzer->new( + language => 'en', + ); + return bless { + field => 'content', + analyzer => $analyzer, + }, __PACKAGE__; +} +~~~ + +Some private helper subs for creating TermQuery and PhraseQuery objects will +help keep the size of our main parse() subroutine down: + +~~~ perl +sub _make_term_query { + my ( $self, $term ) = @_; + return Lucy::Search::TermQuery->new( + field => $self->{field}, + term => $term, + ); +} + +sub _make_phrase_query { + my ( $self, $terms ) = @_; + return Lucy::Search::PhraseQuery->new( + field => $self->{field}, + terms => $terms, + ); +} +~~~ + +Our private \_tokenize() method treats double-quote delimited material as a +single token and splits on whitespace everywhere else. + +~~~ perl +sub _tokenize { + my ( $self, $query_string ) = @_; + my @tokens; + while ( length $query_string ) { + if ( $query_string =~ s/^\s+// ) { + next; # skip whitespace + } + elsif ( $query_string =~ s/^("[^"]*(?:"|$))// ) { + push @tokens, $1; # double-quoted phrase + } + else { + $query_string =~ s/(\S+)//; + push @tokens, $1; # single word + } + } + return \@tokens; +} +~~~ + +The main parsing routine creates an array of tokens by calling \_tokenize(), +runs the tokens through through the EasyAnalyzer, creates TermQuery or +PhraseQuery objects according to how many tokens emerge from the +EasyAnalyzer's split() method, and adds each of the sub-queries to the primary +ORQuery. + +~~~ perl +sub parse { + my ( $self, $query_string ) = @_; + my $tokens = $self->_tokenize($query_string); + my $analyzer = $self->{analyzer}; + my $or_query = Lucy::Search::ORQuery->new; + + for my $token (@$tokens) { + if ( $token =~ s/^"// ) { + $token =~ s/"$//; + my $terms = $analyzer->split($token); + my $query = $self->_make_phrase_query($terms); + $or_query->add_child($phrase_query); + } + else { + my $terms = $analyzer->split($token); + if ( @$terms == 1 ) { + my $query = $self->_make_term_query( $terms->[0] ); + $or_query->add_child($query); + } + elsif ( @$terms > 1 ) { + my $query = $self->_make_phrase_query($terms); + $or_query->add_child($query); + } + } + } + + return $or_query; +} +~~~ + +## Multi-field parser + +Most often, the end user will want their search query to match not only a +single 'content' field, but also 'title' and so on. To make that happen, we +have to turn queries such as this... + + foo AND NOT bar + +... into the logical equivalent of this: + + (title:foo OR content:foo) AND NOT (title:bar OR content:bar) + +Rather than continue with our own from-scratch parser class and write the +routines to accomplish that expansion, we're now going to subclass Lucy::Search::QueryParser +and take advantage of some of its existing methods. + +Our first parser implementation had the "content" field name and the choice of +English EasyAnalyzer hard-coded for simplicity, but we don't need to do that +once we subclass Lucy::Search::QueryParser. QueryParser's constructor -- +which we will inherit, allowing us to eliminate our own constructor -- +requires a Schema which conveys field +and Analyzer information, so we can just defer to that. + +~~~ perl +package FlatQueryParser; +use base qw( Lucy::Search::QueryParser ); +use Lucy::Search::TermQuery; +use Lucy::Search::PhraseQuery; +use Lucy::Search::ORQuery; +use PrefixQuery; +use Carp; + +# Inherit new() +~~~ + +We're also going to jettison our \_make_term_query() and \_make_phrase_query() +helper subs and chop our parse() subroutine way down. Our revised parse() +routine will generate Lucy::Search::LeafQuery objects instead of TermQueries +and PhraseQueries: + +~~~ perl +sub parse { + my ( $self, $query_string ) = @_; + my $tokens = $self->_tokenize($query_string); + my $or_query = Lucy::Search::ORQuery->new; + for my $token (@$tokens) { + my $leaf_query = Lucy::Search::LeafQuery->new( text => $token ); + $or_query->add_child($leaf_query); + } + return $self->expand($or_query); +} +~~~ + +The magic happens in QueryParser's expand() method, which walks the ORQuery +object we supply to it looking for LeafQuery objects, and calls expand_leaf() +for each one it finds. expand_leaf() performs field-specific analysis, +decides whether each query should be a TermQuery or a PhraseQuery, and if +multiple fields are required, creates an ORQuery which mults out e.g. `foo` +into `(title:foo OR content:foo)`. + +## Extending the query language + +To add support for trailing wildcards to our query language, we need to +override expand_leaf() to accommodate PrefixQuery, while deferring to the +parent class implementation on TermQuery and PhraseQuery. + +~~~ perl +sub expand_leaf { + my ( $self, $leaf_query ) = @_; + my $text = $leaf_query->get_text; + if ( $text =~ /\*$/ ) { + my $or_query = Lucy::Search::ORQuery->new; + for my $field ( @{ $self->get_fields } ) { + my $prefix_query = PrefixQuery->new( + field => $field, + query_string => $text, + ); + $or_query->add_child($prefix_query); + } + return $or_query; + } + else { + return $self->SUPER::expand_leaf($leaf_query); + } +} +~~~ + +Ordinarily, those asterisks would have been stripped when running tokens +through the EasyAnalyzer -- query strings containing "foo\*" would produce +TermQueries for the term "foo". Our override intercepts tokens with trailing +asterisks and processes them as PrefixQueries before `SUPER::expand_leaf` can +discard them, so that a search for "foo\*" can match "food", "foosball", and so +on. + +## Usage + +Insert our custom parser into the search.cgi sample app to get a feel for how +it behaves: + +~~~ perl +my $parser = FlatQueryParser->new( schema => $searcher->get_schema ); +my $query = $parser->parse( decode( 'UTF-8', $cgi->param('q') || '' ) ); +my $hits = $searcher->hits( + query => $query, + offset => $offset, + num_wanted => $page_size, +); +... +~~~ + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Cookbook/FastUpdates.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Cookbook/FastUpdates.md b/core/Lucy/Docs/Cookbook/FastUpdates.md new file mode 100644 index 0000000..511310a --- /dev/null +++ b/core/Lucy/Docs/Cookbook/FastUpdates.md @@ -0,0 +1,140 @@ +# Near real-time index updates + +While index updates are fast on average, worst-case update performance may be +significantly slower. To make index updates consistently quick, we must +manually intervene to control the process of index segment consolidation. + +## The problem + +Ordinarily, modifying an index is cheap. New data is added to new segments, +and the time to write a new segment scales more or less linearly with the +number of documents added during the indexing session. + +Deletions are also cheap most of the time, because we don't remove documents +immediately but instead mark them as deleted, and adding the deletion mark is +cheap. + +However, as new segments are added and the deletion rate for existing segments +increases, search-time performance slowly begins to degrade. At some point, +it becomes necessary to consolidate existing segments, rewriting their data +into a new segment. + +If the recycled segments are small, the time it takes to rewrite them may not +be significant. Every once in a while, though, a large amount of data must be +rewritten. + +## Procrastinating and playing catch-up + +The simplest way to force fast index updates is to avoid rewriting anything. + +Indexer relies upon [](cfish:lucy.IndexManager)'s +recycle() method to tell it which segments should be consolidated. If we +subclass IndexManager and override recycle() so that it always returns an +empty array, we get consistently quick performance: + +~~~ perl +package NoMergeManager; +use base qw( Lucy::Index::IndexManager ); +sub recycle { [] } + +package main; +my $indexer = Lucy::Index::Indexer->new( + index => '/path/to/index', + manager => NoMergeManager->new, +); +... +$indexer->commit; +~~~ + +However, we can't procrastinate forever. Eventually, we'll have to run an +ordinary, uncontrolled indexing session, potentially triggering a large +rewrite of lots of small and/or degraded segments: + +~~~ perl +my $indexer = Lucy::Index::Indexer->new( + index => '/path/to/index', + # manager => NoMergeManager->new, +); +... +$indexer->commit; +~~~ + +## Acceptable worst-case update time, slower degradation + +Never merging anything at all in the main indexing process is probably +overkill. Small segments are relatively cheap to merge; we just need to guard +against the big rewrites. + +Setting a ceiling on the number of documents in the segments to be recycled +allows us to avoid a mass proliferation of tiny, single-document segments, +while still offering decent worst-case update speed: + +~~~ perl +package LightMergeManager; +use base qw( Lucy::Index::IndexManager ); + +sub recycle { + my $self = shift; + my $seg_readers = $self->SUPER::recycle(@_); + @$seg_readers = grep { $_->doc_max < 10 } @$seg_readers; + return $seg_readers; +} +~~~ + +However, we still have to consolidate every once in a while, and while that +happens content updates will be locked out. + +## Background merging + +If it's not acceptable to lock out updates while the index consolidation +process runs, the alternative is to move the consolidation process out of +band, using Lucy::Index::BackgroundMerger. + +It's never safe to have more than one Indexer attempting to modify the content +of an index at the same time, but a BackgroundMerger and an Indexer can +operate simultaneously: + +~~~ perl +# Indexing process. +use Scalar::Util qw( blessed ); +my $retries = 0; +while (1) { + eval { + my $indexer = Lucy::Index::Indexer->new( + index => '/path/to/index', + manager => LightMergeManager->new, + ); + $indexer->add_doc($doc); + $indexer->commit; + }; + last unless $@; + if ( blessed($@) and $@->isa("Lucy::Store::LockErr") ) { + # Catch LockErr. + warn "Couldn't get lock ($retries retries)"; + $retries++; + } + else { + die "Write failed: $@"; + } +} + +# Background merge process. +my $manager = Lucy::Index::IndexManager->new; +$manager->set_write_lock_timeout(60_000); +my $bg_merger = Lucy::Index::BackgroundMerger->new( + index => '/path/to/index', + manager => $manager, +); +$bg_merger->commit; +~~~ + +The exception handling code becomes useful once you have more than one index +modification process happening simultaneously. By default, Indexer tries +several times to acquire a write lock over the span of one second, then holds +it until commit() completes. BackgroundMerger handles most of its work +without the write lock, but it does need it briefly once at the beginning and +once again near the end. Under normal loads, the internal retry logic will +resolve conflicts, but if it's not acceptable to miss an insert, you probably +want to catch LockErr exceptions thrown by Indexer. In contrast, a LockErr +from BackgroundMerger probably just needs to be logged. + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/DocIDs.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/DocIDs.md b/core/Lucy/Docs/DocIDs.md new file mode 100644 index 0000000..af696b2 --- /dev/null +++ b/core/Lucy/Docs/DocIDs.md @@ -0,0 +1,28 @@ +# Characteristics of Apache Lucy document ids. + +## Document ids are signed 32-bit integers + +Document ids in Apache Lucy start at 1. Because 0 is never a valid doc id, we +can use it as a sentinel value: + +~~~ perl +while ( my $doc_id = $posting_list->next ) { + ... +} +~~~ + +## Document ids are ephemeral + +The document ids used by Lucy are associated with a single index +snapshot. The moment an index is updated, the mapping of document ids to +documents is subject to change. + +Since IndexReader objects represent a point-in-time view of an index, document +ids are guaranteed to remain static for the life of the reader. However, +because they are not permanent, Lucy document ids cannot be used as +foreign keys to locate records in external data sources. If you truly need a +primary key field, you must define it and populate it yourself. + +Furthermore, the order of document ids does not tell you anything about the +sequence in which documents were added to the index. + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/FileFormat.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/FileFormat.md b/core/Lucy/Docs/FileFormat.md new file mode 100644 index 0000000..c5f606c --- /dev/null +++ b/core/Lucy/Docs/FileFormat.md @@ -0,0 +1,191 @@ +# Overview of index file format + +It is not necessary to understand the current implementation details of the +index file format in order to use Apache Lucy effectively, but it may be +helpful if you are interested in tweaking for high performance, exotic usage, +or debugging and development. + +On a file system, an index is a directory. The files inside have a +hierarchical relationship: an index is made up of "segments", each of which is +an independent inverted index with its own subdirectory; each segment is made +up of several component parts. + + [index]--| + |--snapshot_XXX.json + |--schema_XXX.json + |--write.lock + | + |--seg_1--| + | |--segmeta.json + | |--cfmeta.json + | |--cf.dat-------| + | |--[lexicon] + | |--[postings] + | |--[documents] + | |--[highlight] + | |--[deletions] + | + |--seg_2--| + | |--segmeta.json + | |--cfmeta.json + | |--cf.dat-------| + | |--[lexicon] + | |--[postings] + | |--[documents] + | |--[highlight] + | |--[deletions] + | + |--[...]--| + +## Write-once philosophy + +All segment directory names consist of the string "seg\_" followed by a number +in base 36: seg_1, seg_5m, seg_p9s2 and so on, with higher numbers indicating +more recent segments. Once a segment is finished and committed, its name is +never re-used and its files are never modified. + +Old segments become obsolete and can be removed when their data has been +consolidated into new segments during the process of segment merging and +optimization. A fully-optimized index has only one segment. + +## Top-level entries + +There are a handful of "top-level" files and directories which belong to the +entire index rather than to a particular segment. + +### snapshot_XXX.json + +A "snapshot" file, e.g. `snapshot_m7p.json`, is list of index files and +directories. Because index files, once written, are never modified, the list +of entries in a snapshot defines a point-in-time view of the data in an index. + +Like segment directories, snapshot files also utilize the +unique-base-36-number naming convention; the higher the number, the more +recent the file. The appearance of a new snapshot file within the index +directory constitutes an index update. While a new segment is being written +new files may be added to the index directory, but until a new snapshot file +gets written, a Searcher opening the index for reading won't know about them. + +### schema_XXX.json + +The schema file is a Schema object describing the index's format, serialized +as JSON. It, too, is versioned, and a given snapshot file will reference one +and only one schema file. + +### locks + +By default, only one indexing process may safely modify the index at any given +time. Processes reserve an index by laying claim to the `write.lock` file +within the `locks/` directory. A smattering of other lock files may be used +from time to time, as well. + +## A segment's component parts + +By default, each segment has up to five logical components: lexicon, postings, +document storage, highlight data, and deletions. Binary data from these +components gets stored in virtual files within the "cf.dat" compound file; +metadata is stored in a shared "segmeta.json" file. + +### segmeta.json + +The segmeta.json file is a central repository for segment metadata. In +addition to information such as document counts and field numbers, it also +warehouses arbitrary metadata on behalf of individual index components. + +### Lexicon + +Each indexed field gets its own lexicon in each segment. The exact files +involved depend on the field's type, but generally speaking there will be two +parts. First, there's a primary `lexicon-XXX.dat` file which houses a +complete term list associating terms with corpus frequency statistics, +postings file locations, etc. Second, one or more "lexicon index" files may +be present which contain periodic samples from the primary lexicon file to +facilitate fast lookups. + +### Postings + +"Posting" is a technical term from the field of +[information retrieval](cfish:IRTheory), defined as a single +instance of a one term indexing one document. If you are looking at the index +in the back of a book, and you see that "freedom" is referenced on pages 8, +86, and 240, that would be three postings, which taken together form a +"posting list". The same terminology applies to an index in electronic form. + +Each segment has one postings file per indexed field. When a search is +performed for a single term, first that term is looked up in the lexicon. If +the term exists in the segment, the record in the lexicon will contain +information about which postings file to look at and where to look. + +The first thing any posting record tells you is a document id. By iterating +over all the postings associated with a term, you can find all the documents +that match that term, a process which is analogous to looking up page numbers +in a book's index. However, each posting record typically contains other +information in addition to document id, e.g. the positions at which the term +occurs within the field. + +### Documents + +The document storage section is a simple database, organized into two files: + +* __documents.dat__ - Serialized documents. + +* __documents.ix__ - Document storage index, a solid array of 64-bit integers + where each integer location corresponds to a document id, and the value at + that location points at a file position in the documents.dat file. + +### Highlight data + +The files which store data used for excerpting and highlighting are organized +similarly to the files used to store documents. + +* __highlight.dat__ - Chunks of serialized highlight data, one per doc id. + +* __highlight.ix__ - Highlight data index -- as with the `documents.ix` file, a + solid array of 64-bit file pointers. + +### Deletions + +When a document is "deleted" from a segment, it is not actually purged right +away; it is merely marked as "deleted" via a deletions file. Deletions files +contains bit vectors with one bit for each document in the segment; if bit +\#254 is set then document 254 is deleted, and if that document turns up in a +search it will be masked out. + +It is only when a segment's contents are rewritten to a new segment during the +segment-merging process that deleted documents truly go away. + +## Compound Files + +If you peer inside an index directory, you won't actually find any files named +"documents.dat", "highlight.ix", etc. unless there is an indexing process +underway. What you will find instead is one "cf.dat" and one "cfmeta.json" +file per segment. + +To minimize the need for file descriptors at search-time, all per-segment +binary data files are concatenated together in "cf.dat" at the close of each +indexing session. Information about where each file begins and ends is stored +in `cfmeta.json`. When the segment is opened for reading, a single file +descriptor per "cf.dat" file can be shared among several readers. + +## A Typical Search + +Here's a simplified narrative, dramatizing how a search for "freedom" against +a given segment plays out: + +1. The searcher asks the relevant Lexicon Index, "Do you know anything about + 'freedom'?" Lexicon Index replies, "Can't say for sure, but if the main + Lexicon file does, 'freedom' is probably somewhere around byte 21008". + +2. The main Lexicon tells the searcher "One moment, let me scan our records... + Yes, we have 2 documents which contain 'freedom'. You'll find them in + seg_6/postings-4.dat starting at byte 66991." + +3. The Postings file says "Yep, we have 'freedom', all right! Document id 40 + has 1 'freedom', and document 44 has 8. If you need to know more, like if any + 'freedom' is part of the phrase 'freedom of speech', ask me about positions! + +4. If the searcher is only looking for 'freedom' in isolation, that's where it + stops. It now knows enough to assign the documents scores against "freedom", + with the 8-freedom document likely ranking higher than the single-freedom + document. + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/IRTheory.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/IRTheory.md b/core/Lucy/Docs/IRTheory.md new file mode 100644 index 0000000..a9af4ed --- /dev/null +++ b/core/Lucy/Docs/IRTheory.md @@ -0,0 +1,44 @@ +# Crash course in information retrieval + +Just enough Information Retrieval theory to find your way around Apache Lucy. + +## Terminology + +Lucy uses some terminology from the field of information retrieval which +may be unfamiliar to many users. "Document" and "term" mean pretty much what +you'd expect them to, but others such as "posting" and "inverted index" need a +formal introduction: + +* _document_ - An atomic unit of retrieval. +* _term_ - An attribute which describes a document. +* _posting_ - One term indexing one document. +* _term list_ - The complete list of terms which describe a document. +* _posting list_ - The complete list of documents which a term indexes. +* _inverted index_ - A data structure which maps from terms to documents. + +Since Lucy is a practical implementation of IR theory, it loads these +abstract, distilled definitions down with useful traits. For instance, a +"posting" in its most rarefied form is simply a term-document pairing; in +Lucy, the class [](cfish:lucy.MatchPosting) fills this +role. However, by associating additional information with a posting like the +number of times the term occurs in the document, we can turn it into a +[](cfish:lucy.ScorePosting), making it possible +to rank documents by relevance rather than just list documents which happen to +match in no particular order. + +## TF/IDF ranking algorithm + +Lucy uses a variant of the well-established "Term Frequency / Inverse +Document Frequency" weighting scheme. A thorough treatment of TF/IDF is too +ambitious for our present purposes, but in a nutshell, it means that... + +* in a search for `skate park`, documents which score well for the + comparatively rare term `skate` will rank higher than documents which score + well for the more common term `park`. + +* a 10-word text which has one occurrence each of both `skate` and `park` will + rank higher than a 1000-word text which also contains one occurrence of each. + +A web search for "tf idf" will turn up many excellent explanations of the +algorithm. + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Tutorial.md b/core/Lucy/Docs/Tutorial.md new file mode 100644 index 0000000..57c66b2 --- /dev/null +++ b/core/Lucy/Docs/Tutorial.md @@ -0,0 +1,53 @@ +# Step-by-step introduction to Apache Lucy. + +Explore Apache Lucy's basic functionality by starting with a minimalist CGI +search app based on Lucy::Simple and transforming it, step by step, +into an "advanced search" interface utilizing more flexible core modules like +[](cfish:lucy.Indexer) and [](cfish:lucy.IndexSearcher). + +## Chapters + +* [](cfish:SimpleTutorial) - Build a bare-bones search app using + Lucy::Simple. + +* [](cfish:BeyondSimpleTutorial) - Rebuild the app using core + classes like [](cfish:lucy.Indexer) and + [](cfish:lucy.IndexSearcher) in place of Lucy::Simple. + +* [](cfish:FieldTypeTutorial) - Experiment with different field + characteristics using subclasses of [](cfish:lucy.FieldType). + +* [](cfish:AnalysisTutorial) - Examine how the choice of + [](cfish:lucy.Analyzer) subclass affects search results. + +* [](cfish:HighlighterTutorial) - Augment search results with + highlighted excerpts. + +* [](cfish:QueryObjectsTutorial) - Unlock advanced search features + by using Query objects instead of query strings. + +## Source materials + +The source material used by the tutorial app -- a multi-text-file presentation +of the United States constitution -- can be found in the `sample` directory +at the root of the Lucy distribution, along with finished indexing and search +apps. + +~~~ perl +sample/indexer.pl # indexing app +sample/search.cgi # search app +sample/us_constitution # corpus +~~~ + +## Conventions + +The user is expected to be familiar with OO Perl and basic CGI programming. + +The code in this tutorial assumes a Unix-flavored operating system and the +Apache webserver, but will work with minor modifications on other setups. + +## See also + +More advanced and esoteric subjects are covered in [](cfish:Cookbook). + + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/AnalysisTutorial.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Tutorial/AnalysisTutorial.md b/core/Lucy/Docs/Tutorial/AnalysisTutorial.md new file mode 100644 index 0000000..a55dd09 --- /dev/null +++ b/core/Lucy/Docs/Tutorial/AnalysisTutorial.md @@ -0,0 +1,85 @@ +# How to choose and use Analyzers. + +Try swapping out the EasyAnalyzer in our Schema for a StandardTokenizer: + +~~~ perl +my $tokenizer = Lucy::Analysis::StandardTokenizer->new; +my $type = Lucy::Plan::FullTextType->new( + analyzer => $tokenizer, +); +~~~ + +Search for `senate`, `Senate`, and `Senator` before and after making the +change and re-indexing. + +Under EasyAnalyzer, the results are identical for all three searches, but +under StandardTokenizer, searches are case-sensitive, and the result sets for +`Senate` and `Senator` are distinct. + +## EasyAnalyzer + +What's happening is that EasyAnalyzer is performing more aggressive processing +than StandardTokenizer. In addition to tokenizing, it's also converting all +text to lower case so that searches are case-insensitive, and using a +"stemming" algorithm to reduce related words to a common stem (`senat`, in +this case). + +EasyAnalyzer is actually multiple Analyzers wrapped up in a single package. +In this case, it's three-in-one, since specifying a EasyAnalyzer with +`language => 'en'` is equivalent to this snippet: + +~~~ perl +my $tokenizer = Lucy::Analysis::StandardTokenizer->new; +my $normalizer = Lucy::Analysis::Normalizer->new; +my $stemmer = Lucy::Analysis::SnowballStemmer->new( language => 'en' ); +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $tokenizer, $normalizer, $stemmer ], +); +~~~ + +You can add or subtract Analyzers from there if you like. Try adding a fourth +Analyzer, a SnowballStopFilter for suppressing "stopwords" like "the", "if", +and "maybe". + +~~~ perl +my $stopfilter = Lucy::Analysis::SnowballStopFilter->new( + language => 'en', +); +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $tokenizer, $normalizer, $stopfilter, $stemmer ], +); +~~~ + +Also, try removing the SnowballStemmer. + +~~~ perl +my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( + analyzers => [ $tokenizer, $normalizer ], +); +~~~ + +The original choice of a stock English EasyAnalyzer probably still yields the +best results for this document collection, but you get the idea: sometimes you +want a different Analyzer. + +## When the best Analyzer is no Analyzer + +Sometimes you don't want an Analyzer at all. That was true for our "url" +field because we didn't need it to be searchable, but it's also true for +certain types of searchable fields. For instance, "category" fields are often +set up to match exactly or not at all, as are fields like "last_name" (because +you may not want to conflate results for "Humphrey" and "Humphries"). + +To specify that there should be no analysis performed at all, use StringType: + +~~~ perl +my $type = Lucy::Plan::StringType->new; +$schema->spec_field( name => 'category', type => $type ); +~~~ + +## Highlighting up next + +In our next tutorial chapter, [](cfish:HighlighterTutorial), +we'll add highlighted excerpts from the "content" field to our search results. + + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md b/core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md new file mode 100644 index 0000000..00c8e71 --- /dev/null +++ b/core/Lucy/Docs/Tutorial/BeyondSimpleTutorial.md @@ -0,0 +1,125 @@ +# A more flexible app structure. + +## Goal + +In this tutorial chapter, we'll refactor the apps we built in +[](cfish:SimpleTutorial) so that they look exactly the same from +the end user's point of view, but offer the developer greater possibilites for +expansion. + +To achieve this, we'll ditch Lucy::Simple and replace it with the +classes that it uses internally: + +* [](cfish:lucy.Schema) - Plan out your index. +* [](cfish:lucy.FullTextType) - Field type for full text search. +* [](cfish:lucy.EasyAnalyzer) - A one-size-fits-all parser/tokenizer. +* [](cfish:lucy.Indexer) - Manipulate index content. +* [](cfish:lucy.IndexSearcher) - Search an index. +* [](cfish:lucy.Hits) - Iterate over hits returned by a Searcher. + +## Adaptations to indexer.pl + +After we load our modules... + +~~~ perl +use Lucy::Plan::Schema; +use Lucy::Plan::FullTextType; +use Lucy::Analysis::EasyAnalyzer; +use Lucy::Index::Indexer; +~~~ + +... the first item we're going need is a [](cfish:lucy.Schema). + +The primary job of a Schema is to specify what fields are available and how +they're defined. We'll start off with three fields: title, content and url. + +~~~ perl +# Create Schema. +my $schema = Lucy::Plan::Schema->new; +my $easyanalyzer = Lucy::Analysis::EasyAnalyzer->new( + language => 'en', +); +my $type = Lucy::Plan::FullTextType->new( + analyzer => $easyanalyzer, +); +$schema->spec_field( name => 'title', type => $type ); +$schema->spec_field( name => 'content', type => $type ); +$schema->spec_field( name => 'url', type => $type ); +~~~ + +All of the fields are spec'd out using the "FullTextType" FieldType, +indicating that they will be searchable as "full text" -- which means that +they can be searched for individual words. The "analyzer", which is unique to +FullTextType fields, is what breaks up the text into searchable tokens. + +Next, we'll swap our Lucy::Simple object out for a Lucy::Index::Indexer. +The substitution will be straightforward because Simple has merely been +serving as a thin wrapper around an inner Indexer, and we'll just be peeling +away the wrapper. + +First, replace the constructor: + +~~~ perl +# Create Indexer. +my $indexer = Lucy::Index::Indexer->new( + index => $path_to_index, + schema => $schema, + create => 1, + truncate => 1, +); +~~~ + +Next, have the `$indexer` object `add_doc` where we were having the +`$lucy` object `add_doc` before: + +~~~ perl +foreach my $filename (@filenames) { + my $doc = parse_file($filename); + $indexer->add_doc($doc); +} +~~~ + +There's only one extra step required: at the end of the app, you must call +commit() explicitly to close the indexing session and commit your changes. +(Lucy::Simple hides this detail, calling commit() implicitly when it needs to). + +~~~ perl +$indexer->commit; +~~~ + +## Adaptations to search.cgi + +In our search app as in our indexing app, Lucy::Simple has served as a +thin wrapper -- this time around [](cfish:lucy.IndexSearcher) and +[](cfish:lucy.Hits). Swapping out Simple for these two classes is +also straightforward: + +~~~ perl +use Lucy::Search::IndexSearcher; + +my $searcher = Lucy::Search::IndexSearcher->new( + index => $path_to_index, +); +my $hits = $searcher->hits( # returns a Hits object, not a hit count + query => $q, + offset => $offset, + num_wanted => $page_size, +); +my $hit_count = $hits->total_hits; # get the hit count here + +... + +while ( my $hit = $hits->next ) { + ... +} +~~~ + +## Hooray! + +Congratulations! Your apps do the same thing as before... but now they'll be +easier to customize. + +In our next chapter, ()[cfish:FieldTypeTutorial), we'll explore +how to assign different behaviors to different fields. + + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/FieldTypeTutorial.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Tutorial/FieldTypeTutorial.md b/core/Lucy/Docs/Tutorial/FieldTypeTutorial.md new file mode 100644 index 0000000..fe6885a --- /dev/null +++ b/core/Lucy/Docs/Tutorial/FieldTypeTutorial.md @@ -0,0 +1,60 @@ +# Specify per-field properties and behaviors. + +The Schema we used in the last chapter specifies three fields: + +~~~ perl +my $type = Lucy::Plan::FullTextType->new( + analyzer => $polyanalyzer, +); +$schema->spec_field( name => 'title', type => $type ); +$schema->spec_field( name => 'content', type => $type ); +$schema->spec_field( name => 'url', type => $type ); +~~~ + +Since they are all defined as "full text" fields, they are all searchable -- +including the `url` field, a dubious choice. Some URLs contain meaningful +information, but these don't, really: + + http://example.com/us_constitution/amend1.txt + +We may as well not bother indexing the URL content. To achieve that we need +to assign the `url` field to a different FieldType. + +## StringType + +Instead of FullTextType, we'll use a +[](cfish:lucy.StringType), which doesn't use an +Analyzer to break up text into individual fields. Furthermore, we'll mark +this StringType as unindexed, so that its content won't be searchable at all. + +~~~ perl +my $url_type = Lucy::Plan::StringType->new( indexed => 0 ); +$schema->spec_field( name => 'url', type => $url_type ); +~~~ + +To observe the change in behavior, try searching for `us_constitution` both +before and after changing the Schema and re-indexing. + +## Toggling 'stored' + +For a taste of other FieldType possibilities, try turning off `stored` for +one or more fields. + +~~~ perl +my $content_type = Lucy::Plan::FullTextType->new( + analyzer => $polyanalyzer, + stored => 0, +); +~~~ + +Turning off `stored` for either `title` or `url` mangles our results page, +but since we're not displaying `content`, turning it off for `content` has +no effect -- except on index size. + +## Analyzers up next + +Analyzers play a crucial role in the behavior of FullTextType fields. In our +next tutorial chapter, [](cfish:AnalysisTutorial), we'll see how +changing up the Analyzer changes search results. + + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/HighlighterTutorial.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Tutorial/HighlighterTutorial.md b/core/Lucy/Docs/Tutorial/HighlighterTutorial.md new file mode 100644 index 0000000..857ee01 --- /dev/null +++ b/core/Lucy/Docs/Tutorial/HighlighterTutorial.md @@ -0,0 +1,62 @@ +# Augment search results with highlighted excerpts. + +Adding relevant excerpts with highlighted search terms to your search results +display makes it much easier for end users to scan the page and assess which +hits look promising, dramatically improving their search experience. + +## Adaptations to indexer.pl + +[](cfish:lucy.Highlighter) uses information generated at index +time. To save resources, highlighting is disabled by default and must be +turned on for individual fields. + +~~~ perl +my $highlightable = Lucy::Plan::FullTextType->new( + analyzer => $polyanalyzer, + highlightable => 1, +); +$schema->spec_field( name => 'content', type => $highlightable ); +~~~ + +## Adaptations to search.cgi + +To add highlighting and excerpting to the search.cgi sample app, create a +`$highlighter` object outside the hits iterating loop... + +~~~ perl +my $highlighter = Lucy::Highlight::Highlighter->new( + searcher => $searcher, + query => $q, + field => 'content' +); +~~~ + +... then modify the loop and the per-hit display to generate and include the +excerpt. + +~~~ perl +# Create result list. +my $report = ''; +while ( my $hit = $hits->next ) { + my $score = sprintf( "%0.3f", $hit->get_score ); + my $excerpt = $highlighter->create_excerpt($hit); + $report .= qq| + <p> + <a href="$hit->{url}"><strong>$hit->{title}</strong></a> + <em>$score</em> + <br /> + $excerpt + <br /> + <span class="excerptURL">$hit->{url}</span> + </p> + |; +} +~~~ + +## Next chapter: Query objects + +Our next tutorial chapter, [](cfish:QueryObjectsTutorial), +illustrates how to build an "advanced search" interface using +[](cfish:lucy.Query) objects instead of query strings. + + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md b/core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md new file mode 100644 index 0000000..53d4cea --- /dev/null +++ b/core/Lucy/Docs/Tutorial/QueryObjectsTutorial.md @@ -0,0 +1,185 @@ +# Use Query objects instead of query strings. + +Until now, our search app has had only a single search box. In this tutorial +chapter, we'll move towards an "advanced search" interface, by adding a +"category" drop-down menu. Three new classes will be required: + +* [](cfish:lucy.QueryParser) - Turn a query string into a + [](cfish:lucy.Query) object. + +* [](cfish:lucy.TermQuery) - Query for a specific term within + a specific field. + +* [](cfish:lucy.ANDQuery) - "AND" together multiple Query +objects to produce an intersected result set. + +## Adaptations to indexer.pl + +Our new "category" field will be a StringType field rather than a FullTextType +field, because we will only be looking for exact matches. It needs to be +indexed, but since we won't display its value, it doesn't need to be stored. + +~~~ perl +my $cat_type = Lucy::Plan::StringType->new( stored => 0 ); +$schema->spec_field( name => 'category', type => $cat_type ); +~~~ + +There will be three possible values: "article", "amendment", and "preamble", +which we'll hack out of the source file's name during our `parse_file` +subroutine: + +~~~ perl +my $category + = $filename =~ /art/ ? 'article' + : $filename =~ /amend/ ? 'amendment' + : $filename =~ /preamble/ ? 'preamble' + : die "Can't derive category for $filename"; +return { + title => $title, + content => $bodytext, + url => "/us_constitution/$filename", + category => $category, +}; +~~~ + +## Adaptations to search.cgi + +The "category" constraint will be added to our search interface using an HTML +"select" element (this routine will need to be integrated into the HTML +generation section of search.cgi): + +~~~ perl +# Build up the HTML "select" object for the "category" field. +sub generate_category_select { + my $cat = shift; + my $select = qq| + <select name="category"> + <option value="">All Sections</option> + <option value="article">Articles</option> + <option value="amendment">Amendments</option> + </select>|; + if ($cat) { + $select =~ s/"$cat"/"$cat" selected/; + } + return $select; +} +~~~ + +We'll start off by loading our new modules and extracting our new CGI +parameter. + +~~~ perl +use Lucy::Search::QueryParser; +use Lucy::Search::TermQuery; +use Lucy::Search::ANDQuery; + +... + +my $category = decode( "UTF-8", $cgi->param('category') || '' ); +~~~ + +QueryParser's constructor requires a "schema" argument. We can get that from +our IndexSearcher: + +~~~ perl +# Create an IndexSearcher and a QueryParser. +my $searcher = Lucy::Search::IndexSearcher->new( + index => $path_to_index, +); +my $qparser = Lucy::Search::QueryParser->new( + schema => $searcher->get_schema, +); +~~~ + +Previously, we have been handing raw query strings to IndexSearcher. Behind +the scenes, IndexSearcher has been using a QueryParser to turn those query +strings into Query objects. Now, we will bring QueryParser into the +foreground and parse the strings explicitly. + +~~~ perl +my $query = $qparser->parse($q); +~~~ + +If the user has specified a category, we'll use an ANDQuery to join our parsed +query together with a TermQuery representing the category. + +~~~ perl +if ($category) { + my $category_query = Lucy::Search::TermQuery->new( + field => 'category', + term => $category, + ); + $query = Lucy::Search::ANDQuery->new( + children => [ $query, $category_query ] + ); +} +~~~ + +Now when we execute the query... + +~~~ perl +# Execute the Query and get a Hits object. +my $hits = $searcher->hits( + query => $query, + offset => $offset, + num_wanted => $page_size, +); +~~~ + +... we'll get a result set which is the intersection of the parsed query and +the category query. + +## Using TermQuery with full text fields + +When querying full text fields, the easiest way is to create query objects +using QueryParser. But sometimes you want to create TermQuery for a single +term in a FullTextType field directly. In this case, we have to run the +search term through the field's analyzer to make sure it gets normalized in +the same way as the field's content. + +~~~ perl +sub make_term_query { + my ($field, $term) = @_; + + my $token; + my $type = $schema->fetch_type($field); + + if ( $type->isa('Lucy::Plan::FullTextType') ) { + # Run the term through the full text analysis chain. + my $analyzer = $type->get_analyzer; + my $tokens = $analyzer->split($term); + + if ( @$tokens != 1 ) { + # If the term expands to more than one token, or no + # tokens at all, it will never match a token in the + # full text field. + return Lucy::Search::NoMatchQuery->new; + } + + $token = $tokens->[0]; + } + else { + # Exact match for other types. + $token = $term; + } + + return Lucy::Search::TermQuery->new( + field => $field, + term => $token, + ); +} +~~~ + +## Congratulations! + +You've made it to the end of the tutorial. + +## See Also + +For additional thematic documentation, see the Apache Lucy +[](cfish:Cookbook). + +ANDQuery has a companion class, [](cfish:lucy.ORQuery), and a +close relative, [](cfish:lucy.RequiredOptionalQuery). + + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/core/Lucy/Docs/Tutorial/SimpleTutorial.md ---------------------------------------------------------------------- diff --git a/core/Lucy/Docs/Tutorial/SimpleTutorial.md b/core/Lucy/Docs/Tutorial/SimpleTutorial.md new file mode 100644 index 0000000..83883e7 --- /dev/null +++ b/core/Lucy/Docs/Tutorial/SimpleTutorial.md @@ -0,0 +1,298 @@ +# Bare-bones search app. + +## Setup + +Copy the text presentation of the US Constitution from the `sample` directory +of the Apache Lucy distribution to the base level of your web server's +`htdocs` directory. + + $ cp -R sample/us_constitution /usr/local/apache2/htdocs/ + +## Indexing: indexer.pl + +Our first task will be to create an application called `indexer.pl` which +builds a searchable "inverted index" from a collection of documents. + +After we specify some configuration variables and load all necessary +modules... + +~~~ perl +#!/usr/local/bin/perl +use strict; +use warnings; + +# (Change configuration variables as needed.) +my $path_to_index = '/path/to/index'; +my $uscon_source = '/usr/local/apache2/htdocs/us_constitution'; + +use Lucy::Simple; +use File::Spec::Functions qw( catfile ); +~~~ + +... we'll start by creating a Lucy::Simple object, telling it where we'd +like the index to be located and the language of the source material. + +~~~ perl +my $lucy = Lucy::Simple->new( + path => $path_to_index, + language => 'en', +); +~~~ + +Next, we'll add a subroutine which parses our sample documents. + +~~~ perl +# Parse a file from our US Constitution collection and return a hashref with +# the fields title, body, and url. +sub parse_file { + my $filename = shift; + my $filepath = catfile( $uscon_source, $filename ); + open( my $fh, '<', $filepath ) or die "Can't open '$filepath': $!"; + my $text = do { local $/; <$fh> }; # slurp file content + $text =~ /\A(.+?)^\s+(.*)/ms + or die "Can't extract title/bodytext from '$filepath'"; + my $title = $1; + my $bodytext = $2; + return { + title => $title, + content => $bodytext, + url => "/us_constitution/$filename", + }; +} +~~~ + +Add some elementary directory reading code... + +~~~ perl +# Collect names of source files. +opendir( my $dh, $uscon_source ) + or die "Couldn't opendir '$uscon_source': $!"; +my @filenames = grep { $_ =~ /\.txt/ } readdir $dh; +~~~ + +... and now we're ready for the meat of indexer.pl -- which occupies exactly +one line of code. + +~~~ perl +foreach my $filename (@filenames) { + my $doc = parse_file($filename); + $lucy->add_doc($doc); # ta-da! +} +~~~ + +## Search: search.cgi + +As with our indexing app, the bulk of the code in our search script won't be +Lucy-specific. + +The beginning is dedicated to CGI processing and configuration. + +~~~ perl +#!/usr/local/bin/perl -T +use strict; +use warnings; + +# (Change configuration variables as needed.) +my $path_to_index = '/path/to/index'; + +use CGI; +use List::Util qw( max min ); +use POSIX qw( ceil ); +use Encode qw( decode ); +use Lucy::Simple; + +my $cgi = CGI->new; +my $q = decode( "UTF-8", $cgi->param('q') || '' ); +my $offset = decode( "UTF-8", $cgi->param('offset') || 0 ); +my $page_size = 10; +~~~ + +Once that's out of the way, we create our Lucy::Simple object and feed +it a query string. + +~~~ perl +my $lucy = Lucy::Simple->new( + path => $path_to_index, + language => 'en', +); +my $hit_count = $lucy->search( + query => $q, + offset => $offset, + num_wanted => $page_size, +); +~~~ + +The value returned by search() is the total number of documents in the +collection which matched the query. We'll show this hit count to the user, +and also use it in conjunction with the parameters `offset` and `num_wanted` +to break up results into "pages" of manageable size. + +Calling search() on our Simple object turns it into an iterator. Invoking +next() now returns hits one at a time as [](cfish:lucy.HitDoc) +objects, starting with the most relevant. + +~~~ perl +# Create result list. +my $report = ''; +while ( my $hit = $lucy->next ) { + my $score = sprintf( "%0.3f", $hit->get_score ); + $report .= qq| + <p> + <a href="$hit->{url}"><strong>$hit->{title}</strong></a> + <em>$score</em> + <br> + <span class="excerptURL">$hit->{url}</span> + </p> + |; +} +~~~ + +The rest of the script is just text wrangling. + +~~~ perl5 +#---------------------------------------------------------------# +# No tutorial material below this point - just html generation. # +#---------------------------------------------------------------# + +# Generate paging links and hit count, print and exit. +my $paging_links = generate_paging_info( $q, $hit_count ); +blast_out_content( $q, $report, $paging_links ); + +# Create html fragment with links for paging through results n-at-a-time. +sub generate_paging_info { + my ( $query_string, $total_hits ) = @_; + my $escaped_q = CGI::escapeHTML($query_string); + my $paging_info; + if ( !length $query_string ) { + # No query? No display. + $paging_info = ''; + } + elsif ( $total_hits == 0 ) { + # Alert the user that their search failed. + $paging_info + = qq|<p>No matches for <strong>$escaped_q</strong></p>|; + } + else { + # Calculate the nums for the first and last hit to display. + my $last_result = min( ( $offset + $page_size ), $total_hits ); + my $first_result = min( ( $offset + 1 ), $last_result ); + + # Display the result nums, start paging info. + $paging_info = qq| + <p> + Results <strong>$first_result-$last_result</strong> + of <strong>$total_hits</strong> + for <strong>$escaped_q</strong>. + </p> + <p> + Results Page: + |; + + # Calculate first and last hits pages to display / link to. + my $current_page = int( $first_result / $page_size ) + 1; + my $last_page = ceil( $total_hits / $page_size ); + my $first_page = max( 1, ( $current_page - 9 ) ); + $last_page = min( $last_page, ( $current_page + 10 ) ); + + # Create a url for use in paging links. + my $href = $cgi->url( -relative => 1 ); + $href .= "?q=" . CGI::escape($query_string); + $href .= ";offset=" . CGI::escape($offset); + + # Generate the "Prev" link. + if ( $current_page > 1 ) { + my $new_offset = ( $current_page - 2 ) * $page_size; + $href =~ s/(?<=offset=)\d+/$new_offset/; + $paging_info .= qq|<a href="$href"><= Prev</a>\n|; + } + + # Generate paging links. + for my $page_num ( $first_page .. $last_page ) { + if ( $page_num == $current_page ) { + $paging_info .= qq|$page_num \n|; + } + else { + my $new_offset = ( $page_num - 1 ) * $page_size; + $href =~ s/(?<=offset=)\d+/$new_offset/; + $paging_info .= qq|<a href="$href">$page_num</a>\n|; + } + } + + # Generate the "Next" link. + if ( $current_page != $last_page ) { + my $new_offset = $current_page * $page_size; + $href =~ s/(?<=offset=)\d+/$new_offset/; + $paging_info .= qq|<a href="$href">Next =></a>\n|; + } + + # Close tag. + $paging_info .= "</p>\n"; + } + + return $paging_info; +} + +# Print content to output. +sub blast_out_content { + my ( $query_string, $hit_list, $paging_info ) = @_; + my $escaped_q = CGI::escapeHTML($query_string); + binmode( STDOUT, ":encoding(UTF-8)" ); + print qq|Content-type: text/html; charset=UTF-8\n\n|; + print qq| +<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" + "http://www.w3.org/TR/html4/loose.dtd"> +<html> +<head> + <meta http-equiv="Content-type" + content="text/html;charset=UTF-8"> + <link rel="stylesheet" type="text/css" + href="/us_constitution/uscon.css"> + <title>Lucy: $escaped_q</title> +</head> + +<body> + + <div id="navigation"> + <form id="usconSearch" action=""> + <strong> + Search the + <a href="/us_constitution/index.html">US Constitution</a>: + </strong> + <input type="text" name="q" id="q" value="$escaped_q"> + <input type="submit" value="=>"> + </form> + </div><!--navigation--> + + <div id="bodytext"> + + $hit_list + + $paging_info + + <p style="font-size: smaller; color: #666"> + <em> + Powered by <a href="http://lucy.apache.org/" + >Apache Lucy<small><sup>TM</sup></small></a> + </em> + </p> + </div><!--bodytext--> + +</body> + +</html> +|; +} +~~~ + +## OK... now what? + +Lucy::Simple is perfectly adequate for some tasks, but it's not very flexible. +Many people find that it doesn't do at least one or two things they can't live +without. + +In our next tutorial chapter, +[](cfish:BeyondSimpleTutorial), we'll rewrite our +indexing and search scripts using the classes that Lucy::Simple hides +from view, opening up the possibilities for expansion; then, we'll spend the +rest of the tutorial chapters exploring these possibilities. + http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/perl/lib/Lucy/Docs/Cookbook.pod ---------------------------------------------------------------------- diff --git a/perl/lib/Lucy/Docs/Cookbook.pod b/perl/lib/Lucy/Docs/Cookbook.pod deleted file mode 100644 index 6726db9..0000000 --- a/perl/lib/Lucy/Docs/Cookbook.pod +++ /dev/null @@ -1,61 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -=head1 NAME - -Lucy::Docs::Cookbook - Apache Lucy recipes. - -=head1 DESCRIPTION - -The Cookbook provides thematic documentation covering some of Apache Lucy's -more sophisticated features. For a step-by-step introduction to Lucy, -see L<Lucy::Docs::Tutorial>. - -=head2 Chapters - -=over - -=item * - -L<Lucy::Docs::Cookbook::FastUpdates> - While index updates are fast on -average, worst-case update performance may be significantly slower. To make -index updates consistently quick, we must manually intervene to control the -process of index segment consolidation. - -=item * - -L<Lucy::Docs::Cookbook::CustomQuery> - Explore Lucy's support for -custom query types by creating a "PrefixQuery" class to handle trailing -wildcards. - -=item * - -L<Lucy::Docs::Cookbook::CustomQueryParser> - Define your own custom -search query syntax using Lucy::Search::QueryParser and -L<Parse::RecDescent>. - -=back - -=head2 Materials - -Some of the recipes in the Cookbook reference the completed -L<Tutorial|Lucy::Docs::Tutorial> application. These materials can be -found in the C<sample> directory at the root of the Lucy distribution: - - sample/indexer.pl # indexing app - sample/search.cgi # search app - sample/us_constitution # corpus - - http://git-wip-us.apache.org/repos/asf/lucy/blob/5618020f/perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod ---------------------------------------------------------------------- diff --git a/perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod b/perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod deleted file mode 100644 index 2c78bf1..0000000 --- a/perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod +++ /dev/null @@ -1,320 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -=head1 NAME - -Lucy::Docs::Cookbook::CustomQuery - Sample subclass of Query. - -=head1 ABSTRACT - -Explore Apache Lucy's support for custom query types by creating a -"PrefixQuery" class to handle trailing wildcards. - - my $prefix_query = PrefixQuery->new( - field => 'content', - query_string => 'foo*', - ); - my $hits = $searcher->hits( query => $prefix_query ); - ... - -=head1 Query, Compiler, and Matcher - -To add support for a new query type, we need three classes: a Query, a -Compiler, and a Matcher. - -=over - -=item * - -PrefixQuery - a subclass of L<Lucy::Search::Query>, and the only class -that client code will deal with directly. - -=item * - -PrefixCompiler - a subclass of L<Lucy::Search::Compiler>, whose primary -role is to compile a PrefixQuery to a PrefixMatcher. - -=item * - -PrefixMatcher - a subclass of L<Lucy::Search::Matcher>, which does the -heavy lifting: it applies the query to individual documents and assigns a -score to each match. - -=back - -The PrefixQuery class on its own isn't enough because a Query object's role is -limited to expressing an abstract specification for the search. A Query is -basically nothing but metadata; execution is left to the Query's companion -Compiler and Matcher. - -Here's a simplified sketch illustrating how a Searcher's hits() method ties -together the three classes. - - sub hits { - my ( $self, $query ) = @_; - my $compiler = $query->make_compiler( - searcher => $self, - boost => $query->get_boost, - ); - my $matcher = $compiler->make_matcher( - reader => $self->get_reader, - need_score => 1, - ); - my @hits = $matcher->capture_hits; - return \@hits; - } - -=head2 PrefixQuery - -Our PrefixQuery class will have two attributes: a query string and a field -name. - - package PrefixQuery; - use base qw( Lucy::Search::Query ); - use Carp; - use Scalar::Util qw( blessed ); - - # Inside-out member vars and hand-rolled accessors. - my %query_string; - my %field; - sub get_query_string { my $self = shift; return $query_string{$$self} } - sub get_field { my $self = shift; return $field{$$self} } - -PrefixQuery's constructor collects and validates the attributes. - - sub new { - my ( $class, %args ) = @_; - my $query_string = delete $args{query_string}; - my $field = delete $args{field}; - my $self = $class->SUPER::new(%args); - confess("'query_string' param is required") - unless defined $query_string; - confess("Invalid query_string: '$query_string'") - unless $query_string =~ /\*\s*$/; - confess("'field' param is required") - unless defined $field; - $query_string{$$self} = $query_string; - $field{$$self} = $field; - return $self; - } - -Since this is an inside-out class, we'll need a destructor: - - sub DESTROY { - my $self = shift; - delete $query_string{$$self}; - delete $field{$$self}; - $self->SUPER::DESTROY; - } - -The equals() method determines whether two Queries are logically equivalent: - - sub equals { - my ( $self, $other ) = @_; - return 0 unless blessed($other); - return 0 unless $other->isa("PrefixQuery"); - return 0 unless $field{$$self} eq $field{$$other}; - return 0 unless $query_string{$$self} eq $query_string{$$other}; - return 1; - } - -The last thing we'll need is a make_compiler() factory method which kicks out -a subclass of L<Compiler|Lucy::Search::Compiler>. - - sub make_compiler { - my ( $self, %args ) = @_; - my $subordinate = delete $args{subordinate}; - my $compiler = PrefixCompiler->new( %args, parent => $self ); - $compiler->normalize unless $subordinate; - return $compiler; - } - -=head2 PrefixCompiler - -PrefixQuery's make_compiler() method will be called internally at search-time -by objects which subclass L<Lucy::Search::Searcher> -- such as -L<IndexSearchers|Lucy::Search::IndexSearcher>. - -A Searcher is associated with a particular collection of documents. These -documents may all reside in one index, as with IndexSearcher, or they may be -spread out across multiple indexes on one or more machines, as with -L<LucyX::Remote::ClusterSearcher>. - -Searcher objects have access to certain statistical information about the -collections they represent; for instance, a Searcher can tell you how many -documents are in the collection... - - my $maximum_number_of_docs_in_collection = $searcher->doc_max; - -... or how many documents a specific term appears in: - - my $term_appears_in_this_many_docs = $searcher->doc_freq( - field => 'content', - term => 'foo', - ); - -Such information can be used by sophisticated Compiler implementations to -assign more or less heft to individual queries or sub-queries. However, we're -not going to bother with weighting for this demo; we'll just assign a fixed -score of 1.0 to each matching document. - -We don't need to write a constructor, as it will suffice to inherit new() from -Lucy::Search::Compiler. The only method we need to implement for -PrefixCompiler is make_matcher(). - - package PrefixCompiler; - use base qw( Lucy::Search::Compiler ); - - sub make_matcher { - my ( $self, %args ) = @_; - my $seg_reader = $args{reader}; - - # Retrieve low-level components LexiconReader and PostingListReader. - my $lex_reader - = $seg_reader->obtain("Lucy::Index::LexiconReader"); - my $plist_reader - = $seg_reader->obtain("Lucy::Index::PostingListReader"); - - # Acquire a Lexicon and seek it to our query string. - my $substring = $self->get_parent->get_query_string; - $substring =~ s/\*.\s*$//; - my $field = $self->get_parent->get_field; - my $lexicon = $lex_reader->lexicon( field => $field ); - return unless $lexicon; - $lexicon->seek($substring); - - # Accumulate PostingLists for each matching term. - my @posting_lists; - while ( defined( my $term = $lexicon->get_term ) ) { - last unless $term =~ /^\Q$substring/; - my $posting_list = $plist_reader->posting_list( - field => $field, - term => $term, - ); - if ($posting_list) { - push @posting_lists, $posting_list; - } - last unless $lexicon->next; - } - return unless @posting_lists; - - return PrefixMatcher->new( posting_lists => \@posting_lists ); - } - -PrefixCompiler gets access to a L<SegReader|Lucy::Index::SegReader> -object when make_matcher() gets called. From the SegReader and its -sub-components L<LexiconReader|Lucy::Index::LexiconReader> and -L<PostingListReader|Lucy::Index::PostingListReader>, we acquire a -L<Lexicon|Lucy::Index::Lexicon>, scan through the Lexicon's unique -terms, and acquire a L<PostingList|Lucy::Index::PostingList> for each -term that matches our prefix. - -Each of these PostingList objects represents a set of documents which match -the query. - -=head2 PrefixMatcher - -The Matcher subclass is the most involved. - - package PrefixMatcher; - use base qw( Lucy::Search::Matcher ); - - # Inside-out member vars. - my %doc_ids; - my %tick; - - sub new { - my ( $class, %args ) = @_; - my $posting_lists = delete $args{posting_lists}; - my $self = $class->SUPER::new(%args); - - # Cheesy but simple way of interleaving PostingList doc sets. - my %all_doc_ids; - for my $posting_list (@$posting_lists) { - while ( my $doc_id = $posting_list->next ) { - $all_doc_ids{$doc_id} = undef; - } - } - my @doc_ids = sort { $a <=> $b } keys %all_doc_ids; - $doc_ids{$$self} = \@doc_ids; - - # Track our position within the array of doc ids. - $tick{$$self} = -1; - - return $self; - } - - sub DESTROY { - my $self = shift; - delete $doc_ids{$$self}; - delete $tick{$$self}; - $self->SUPER::DESTROY; - } - -The doc ids must be in order, or some will be ignored; hence the C<sort> -above. - -In addition to the constructor and destructor, there are three methods that -must be overridden. - -next() advances the Matcher to the next valid matching doc. - - sub next { - my $self = shift; - my $doc_ids = $doc_ids{$$self}; - my $tick = ++$tick{$$self}; - return 0 if $tick >= scalar @$doc_ids; - return $doc_ids->[$tick]; - } - -get_doc_id() returns the current document id, or 0 if the Matcher is -exhausted. (L<Document numbers|Lucy::Docs::DocIDs> start at 1, so 0 is -a sentinel.) - - sub get_doc_id { - my $self = shift; - my $tick = $tick{$$self}; - my $doc_ids = $doc_ids{$$self}; - return $tick < scalar @$doc_ids ? $doc_ids->[$tick] : 0; - } - -score() conveys the relevance score of the current match. We'll just return a -fixed score of 1.0: - - sub score { 1.0 } - -=head1 Usage - -To get a basic feel for PrefixQuery, insert the FlatQueryParser module -described in L<Lucy::Docs::Cookbook::CustomQueryParser> (which supports -PrefixQuery) into the search.cgi sample app. - - my $parser = FlatQueryParser->new( schema => $searcher->get_schema ); - my $query = $parser->parse($q); - -If you're planning on using PrefixQuery in earnest, though, you may want to -change up analyzers to avoid stemming, because stemming -- another approach to -prefix conflation -- is not perfectly compatible with prefix searches. - - # Polyanalyzer with no SnowballStemmer. - my $analyzer = Lucy::Analysis::PolyAnalyzer->new( - analyzers => [ - Lucy::Analysis::StandardTokenizer->new, - Lucy::Analysis::Normalizer->new, - ], - ); - -=cut -
