Re: [RFC] Search::TokenIndex

David Manura Fri, 13 Feb 2004 20:54:52 -0800

Paul,

Terminology can be clarified some. In search terminology, a "corpus" is a set of "documents" each associated with a set of representative keywords called "terms," and one seeks to locate documents based on a query on terms. This terminology can be quite general, so a "document" could be a book, a web page, a paragraph, or even a sentence or line or text (although that doesn't make much sense). Similarly, a "term" could be understood in all the senses in which you use "token." I think "term" is more abstract because "token" reminds me of lexing, and one can apply various transformations to lexed tokens (including discarding stop words) in order to come up with a final set of terms for a document.

In this sense, your bit vectors represent an "inverted index mapping terms-to-documents," and in this implementation the term positions and term frequencies within documents are not preserved. Rather, the model only stores whether a term exists or does not exist within a given document. This is called a "Boolean model" as opposed to a "vector model" (as in Search::VectorSpace). A vector model is a generalization of the Boolean model, in which case terms within a document are assigned fractional weights based on term frequencies (rather than using the Boolean 1 or 0 to encode whether a term exists or does not exist in a given document). When a document is further understood as a list of terms (rather than a set), you can do word proximity searching or phrase matching (actually, you can match consecutive terms representing a phrase, and this approximates matching the actual phrase and is often to first step in doing so).

So, I propose the module name Search::Boolean. The name might be extended if this module could foreseeably not be the only, the most scalable, or the most efficient implementation of Boolean searching or if the module is some unique form of Boolean searching. I don't think I would call the model Search::BooleanInvertedIndex since Boolean searching usually implies the use of an inverted index in some form (i.e. it's redundantly redundant). If the module will be called Search::Boolean, however, the "find" method given

  # Perform searches using AND logic
  @matches = $index->find('hamburger patties')

should implement the full Boolean logic--AND, OR, and NOT.

So, I think things like this

$index->add("$filename:$linepos", $_) foreach ...;

are not really useful with this module. It may allow me to run the query "Find all lines containing the words 'search' and 'engine'", but it does not easily or efficiently then allow me to then run the query "Find all -files- containing the words 'search' and 'engine'. It can be done with some post-processing, but the essential point is that the simple Boolean model is not very appropriate, especially for a large corpus. Someone seeing the name "Search::Boolean" can instantly realize the limitations of the underlying model used in the module, and for this reason I find it appropriate.

-davidm

Paul Hoffman wrote:

On Thursday, February 12, 2004, at 10:45 PM, David Manura wrote:

 > I wonder if the module name should imply something about the fact
 > that word positions and frequencies in documents are not stored.

Good point.  Thanks for your comments.  Actually, word positions
*can* be tracked, though it's easier to index words by line
position:

  my $index = Search::TokenIndex->new(
      'tokenizer' => sub { map { lc } $_[0] =~ /\b[A-Za-z]{2,9}/g }
  );
  my $linepos = tell $fh;
  while (my $line = <$fh>) {
      chomp $line;
      $index->add($linepos, $_);
      $linepos = tell $fh;
  }
  ...
  foreach (qw(some words to look for)) {
      my @lineposes = $index->find($_);
      print map { seek $fh, $_, 0; scalar <$fh> } @lineposes;
  }

This code uses one index per file; you can get around that, but it
won't be very efficient:

    $index->add("$filename:$linepos", $_) foreach ...;
                ^^^^^^^^^^^^^^^^^^^^

(Long ramble about gigabyte-sized indexes omitted.)

 > The data structure used precludes the possibility of implementing
 > a useful ranking algorithm or a proximity search (including
 > phrase matching) algorithm.

Yes, though you can do phrase matching if your tokenizer is set up
to return phrases (as well as individual words, presumably).  Still,
it probably wouldn't be a good idea to do it using this module
alone.

 > This is an important point and is something I would want to know
 > right off the bat when selecting a searching module.

Agreed.  I'll probably have to relegate this information to the
README, though, because I can't think how to cover it all in the
module name.  Unless someone else comes up with something clever.

(I thought of calling it Search::Approximate but it can do exact
indexing and searching too.)

 > Searching modules could be named in a variety of ways:
 >
 >   - by data structure (e.g. InvertedIndex, SuffixTree, Signature,
 >     ...)
 >   - by algoritm (e.g. VectorSpace or Probabilistic)
 >   - what it searches (e.g. Word, Character, Token, Phoneme, Byte,
 >     Image, ...)
 >   - search engine name (e.g. SWISH)

Thanks, that's a good summary.  It uses a hash of bit vectors, and
I've considered naming it Search::DocVector ("search documents using
a vector") but don't want to overuse the word "document" because
that might give the impression that it's only for indexing and
searching text files.

It can search by word, byte, file name (if you want an index by
filename of all files in your home directory, for example), and so
on.

I think it belongs in Search::* rather than Text::* because it's not
just for text, though certainly that's the emphasis.  (It's
Text::TokenIndex in my development version.  Grrr, I wish CVS could
handle file renames.)

I'm thinking it should have the word 'index' in the name, since it
builds an index that can be (un)serialized.  It's a lot like
Search::InvertedIndex, actually, but without all the power and
complexity.  (I haven't used Search::InvertedIndex, though, so who
am I to say? :-)

I thought of calling it Search::Simple, but that seems much too
vague.

 > Which to use might depend on which is most essential in the
 > module and cannot change.  So, could the data structure,
 > algorithm, or other factors used by the module plausibly change
 > in the future without renaming the module?

It indexes and searches tokens, where a token is any arbitrary
scalar value; I think that's what it boils down to.  (It's possible
to use non-scalar tokens as long as you (un)serialize the index
yourself rather than letting this module do it for you.  I think.)

> -davidm

Thanks again,

Paul.

 > Paul Hoffman wrote:
 > > I would welcome any comments on this module, which is nearing
 > > completion.
 > >
 > > Purpose:
 > >
 > >     Fast, unsophisticated approximate or exact text searching
 > >
 > > Advantages:
 > >
 > >     Simple API, easy to understand
 > >     Easy to get started if you just want to index text files
 > >     Versatile -- can index words, soundex codes, things other
 > >       than text
 > >     Reasonably fast -- indexing is O(n) where n is the number of
 > >       "tokens" in the set of indexed documents, and searching is
 > >       O(m) where m is the number of tokens in the search string
 > >     Can use tied hash for storage
 > >     No prerequisites
 > >     Easy to (un)serialize indexes
 > >
 > > Disadvantages:
 > >
 > >     Doesn't rank search results
 > >     Doesn't remember where tokens occur in a document
 > >     Doesn't know anything about proximity searching
 > >     Probably doesn't scale well to large document sets
 > >     Works strictly at the byte level -- doesn't know anything
 > >       about languages or encodings
 > >     No Boolean searching (everything is AND)
 > >     Doesn't rank search results
 > >     Did I mention it doesn't rank search results?
 > >
 > > Similar modules:
 > >
 > >     Search::VectorSpace
 > >     Search::FreeText
 > >     BitstringSearch
 > >     ... and others that depend on DBI
 > >
 > > A draft README follows...
 > >
 > >
 > > NAME
 > >
 > >     Search::TokenIndex - token-based text indexing and retrieval
 > >
 > >
 > > SYNOPSIS
 > >
 > >     # Index documents
 > >     $index = Search::TokenIndex->new(tokenizer => \&my_tokenizer);
 > >     foreach $file (<*.txt>) {
 > >         $index->add($file, IO::File->new($file));
 > >     }
 > >     $index->add('any unique key', 'text to be indexed');
 > >     $index->add('some other key', \$more_text_to_be_indexed);
 > >     $index->add('yet another key', $a_file_handle);
 > >
 > >     # Perform searches using AND logic
 > >     @matches = $index->find('hamburger patties');
 > >
 > >     # Fancy stuff
 > >     $chapter_index = Search::TokenIndex->new(
 > >         tokenizer => \&my_tokenizer,
 > >         bit_hash  => \%my_tied_hash,
 > >         docid2num => sub { shift()->number },
 > >         docnum2id => sub { $book->chapter(shift()) },
 > >     );
 > >     foreach $chapter ($book->chapters) {
 > >         $chapter_index->add($chapter, $chapter->body);
 > >     }
 > >
 > >     # Dump and load indexes
 > >     $index->dump(IO::File->new('myindex'));
 > >     $index->load(IO::File->new('anotherindex'));
 > >     die "Too big!"
 > >         if $index->dumped_size > $my_disk_space;
 > >
 > >     # Remove and update documents
 > >     $index->remove($somefile);
 > >     $index->update($anotherfile)
 > >         if file_has_changed($anotherfile);
 > >
 > >
 > > DESCRIPTION
 > >
 > >     Search::TokenIndex is designed for reasonably quick text
 > >     indexing and retrieval of moderate-sized document collections.
 > >
 > >     Rather than indexing whole words, it's meant to be used to index
 > >     word tokens -- abbreviated, hashed, or otherwise truncated forms
 > >     of words found in the indexed documents.  (On the other hand,
 > >     it's easy to make it index whole words if that's what you
 > >     prefer.)
 > >
 > >     A token can be anything you wish: a word stem (using, perhaps,
 > >     Lingua::Stem), a soundex code (see Text::Soundex), the output of
 > >     a hash function, etc.  You provide a tokenizer--a function that
 > >     takes a string (or filehandle) as its sole argument and returns
 > >     a list of tokens found in the string (or filehandle contents).
 > >     Once a document has been indexed, searches may be performed on
 > >     the index.  Search results are not ranked in any way.
 > >
 > >     (There's a default tokenizer but it's not very clever.)
 > >
 > >     Search::TokenIndex isn't meant as a complete solution to text
 > >     indexing; it's just one (albeit important) part of the puzzle.
 > >     If your needs are simple, however, the other parts can be very
 > >     easy to make.
 > >
 > >
 > > IMPLEMENTATION
 > >
 > >     Search::TokenIndex uses a hash of bit vectors (constructed using
 > >     Perl's vec() function), one for each token that occurs in the
 > >     documents you've indexed.  This means that each document must be
 > >     represented by an integer that serves as the position of the
 > >     document's bit in the bit vector.  (You identify documents by
 > >     their document id, which can be anything but is normally a
 > >     scalar value.)
 > >
 > >     If a particular token occurs in, say, document number 1000, then
 > >     the bit vector for that token will have a 1 in its 1000th bit.
 > >     If that's the only document the token occurs in, all 999 other
 > >     bits will be 0.  That's not very memory-efficient, but it makes
 > >     for fast searching.  If you index eight hundred documents
 > >     containing a total of ten thousand unique tokens, your index
 > >     will require about one million bytes (800 documents  10000
 > >     tokens / 8 bits per byte).
 > >
 > >     This hash of bit vectors is kept in core memory by default, but
 > >     you can specify, for example, a tied hash backed by a DBM file.
 > >

--
Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan
[EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/

Re: [RFC] Search::TokenIndex

Reply via email to