Re: [RFC] Search::TokenIndex

David Manura Thu, 12 Feb 2004 19:45:43 -0800

Paul,

I wonder if the module name should imply something about the fact that word positions and frequencies in documents are not stored. The data structure used precludes the possibility of implementing a useful ranking algorithm or a proximity search (including phrase matching) algorithm. This is an important point and is something I would want to know right off the bat when selecting a searching module.

Searching modules could be named in a variety of ways:

  - by data structure (e.g. InvertedIndex, SuffixTree, Signature, ...)
  - by algoritm (e.g. VectorSpace or Probabilistic)
  - what it searches (e.g. Word, Character, Token, Phoneme, Byte, Image, ...)
  - search engine name (e.g. SWISH)

Which to use might depend on which is most essential in the module and cannot change. So, could the data structure, algorithm, or other factors used by the module plausibly change in the future without renaming the module?

-davidm

Paul Hoffman wrote:

I would welcome any comments on this module, which is nearing
completion.

Purpose:

Fast, unsophisticated approximate or exact text searching

Advantages:

    Simple API, easy to understand
    Easy to get started if you just want to index text files
    Versatile -- can index words, soundex codes, things other
      than text
    Reasonably fast -- indexing is O(n) where n is the number of
      "tokens" in the set of indexed documents, and searching is
      O(m) where m is the number of tokens in the search string
    Can use tied hash for storage
    No prerequisites
    Easy to (un)serialize indexes

Disadvantages:

    Doesn't rank search results
    Doesn't remember where tokens occur in a document
    Doesn't know anything about proximity searching
    Probably doesn't scale well to large document sets
    Works strictly at the byte level -- doesn't know anything
      about languages or encodings
    No Boolean searching (everything is AND)
    Doesn't rank search results
    Did I mention it doesn't rank search results?

Similar modules:

    Search::VectorSpace
    Search::FreeText
    BitstringSearch
    ... and others that depend on DBI

A draft README follows...

NAME

Search::TokenIndex - token-based text indexing and retrieval

SYNOPSIS

# Index documents $index = Search::TokenIndex->new(tokenizer => \&my_tokenizer); foreach $file (<*.txt>) { $index->add($file, IO::File->new($file)); } $index->add('any unique key', 'text to be indexed'); $index->add('some other key', \$more_text_to_be_indexed); $index->add('yet another key', $a_file_handle); # Perform searches using AND logic @matches = $index->find('hamburger patties'); # Fancy stuff $chapter_index = Search::TokenIndex->new( tokenizer => \&my_tokenizer, bit_hash => \%my_tied_hash, docid2num => sub { shift()->number }, docnum2id => sub { $book->chapter(shift()) }, ); foreach $chapter ($book->chapters) { $chapter_index->add($chapter, $chapter->body); } # Dump and load indexes $index->dump(IO::File->new('myindex')); $index->load(IO::File->new('anotherindex')); die "Too big!" if $index->dumped_size > $my_disk_space; # Remove and update documents $index->remove($somefile); $index->update($anotherfile) if file_has_changed($anotherfile);

DESCRIPTION

    Search::TokenIndex is designed for reasonably quick text
    indexing and retrieval of moderate-sized document collections.

    Rather than indexing whole words, it's meant to be used to index
    word tokens -- abbreviated, hashed, or otherwise truncated forms
    of words found in the indexed documents.  (On the other hand,
    it's easy to make it index whole words if that's what you
    prefer.)

    A token can be anything you wish: a word stem (using, perhaps,
    Lingua::Stem), a soundex code (see Text::Soundex), the output of
    a hash function, etc.  You provide a tokenizer--a function that
    takes a string (or filehandle) as its sole argument and returns
    a list of tokens found in the string (or filehandle contents).
    Once a document has been indexed, searches may be performed on
    the index.  Search results are not ranked in any way.

(There's a default tokenizer but it's not very clever.)

    Search::TokenIndex isn't meant as a complete solution to text
    indexing; it's just one (albeit important) part of the puzzle.
    If your needs are simple, however, the other parts can be very
    easy to make.

IMPLEMENTATION

    Search::TokenIndex uses a hash of bit vectors (constructed using
    Perl's vec() function), one for each token that occurs in the
    documents you've indexed.  This means that each document must be
    represented by an integer that serves as the position of the
    document's bit in the bit vector.  (You identify documents by
    their document id, which can be anything but is normally a
    scalar value.)

    If a particular token occurs in, say, document number 1000, then
    the bit vector for that token will have a 1 in its 1000th bit.
    If that's the *only* document the token occurs in, all 999 other
    bits will be 0.  That's not very memory-efficient, but it makes
    for fast searching.  If you index eight hundred documents
    containing a total of ten thousand unique tokens, your index
    will require about one million bytes (800 documents * 10000
    tokens / 8 bits per byte).

    This hash of bit vectors is kept in core memory by default, but
    you can specify, for example, a tied hash backed by a DBM file.

Re: [RFC] Search::TokenIndex

Reply via email to