I wonder if the module name should imply something about the fact that word positions and frequencies in documents are not stored. The data structure used precludes the possibility of implementing a useful ranking algorithm or a proximity search (including phrase matching) algorithm. This is an important point and is something I would want to know right off the bat when selecting a searching module.
Searching modules could be named in a variety of ways:
- by data structure (e.g. InvertedIndex, SuffixTree, Signature, ...) - by algoritm (e.g. VectorSpace or Probabilistic) - what it searches (e.g. Word, Character, Token, Phoneme, Byte, Image, ...) - search engine name (e.g. SWISH)
Which to use might depend on which is most essential in the module and cannot change. So, could the data structure, algorithm, or other factors used by the module plausibly change in the future without renaming the module?
-davidm
Paul Hoffman wrote:
I would welcome any comments on this module, which is nearing completion.
Purpose:
Fast, unsophisticated approximate or exact text searching
Advantages:
Simple API, easy to understand Easy to get started if you just want to index text files Versatile -- can index words, soundex codes, things other than text Reasonably fast -- indexing is O(n) where n is the number of "tokens" in the set of indexed documents, and searching is O(m) where m is the number of tokens in the search string Can use tied hash for storage No prerequisites Easy to (un)serialize indexes
Disadvantages:
Doesn't rank search results Doesn't remember where tokens occur in a document Doesn't know anything about proximity searching Probably doesn't scale well to large document sets Works strictly at the byte level -- doesn't know anything about languages or encodings No Boolean searching (everything is AND) Doesn't rank search results Did I mention it doesn't rank search results?
Similar modules:
Search::VectorSpace Search::FreeText BitstringSearch ... and others that depend on DBI
A draft README follows...
NAME
Search::TokenIndex - token-based text indexing and retrieval
SYNOPSIS
# Index documents
$index = Search::TokenIndex->new(tokenizer => \&my_tokenizer);
foreach $file (<*.txt>) {
$index->add($file, IO::File->new($file));
}
$index->add('any unique key', 'text to be indexed');
$index->add('some other key', \$more_text_to_be_indexed);
$index->add('yet another key', $a_file_handle);
# Perform searches using AND logic
@matches = $index->find('hamburger patties');
# Fancy stuff
$chapter_index = Search::TokenIndex->new(
tokenizer => \&my_tokenizer,
bit_hash => \%my_tied_hash,
docid2num => sub { shift()->number },
docnum2id => sub { $book->chapter(shift()) },
);
foreach $chapter ($book->chapters) {
$chapter_index->add($chapter, $chapter->body);
}
# Dump and load indexes
$index->dump(IO::File->new('myindex'));
$index->load(IO::File->new('anotherindex'));
die "Too big!"
if $index->dumped_size > $my_disk_space;
# Remove and update documents
$index->remove($somefile);
$index->update($anotherfile)
if file_has_changed($anotherfile);
DESCRIPTION
Search::TokenIndex is designed for reasonably quick text indexing and retrieval of moderate-sized document collections.
Rather than indexing whole words, it's meant to be used to index word tokens -- abbreviated, hashed, or otherwise truncated forms of words found in the indexed documents. (On the other hand, it's easy to make it index whole words if that's what you prefer.)
A token can be anything you wish: a word stem (using, perhaps, Lingua::Stem), a soundex code (see Text::Soundex), the output of a hash function, etc. You provide a tokenizer--a function that takes a string (or filehandle) as its sole argument and returns a list of tokens found in the string (or filehandle contents). Once a document has been indexed, searches may be performed on the index. Search results are not ranked in any way.
(There's a default tokenizer but it's not very clever.)
Search::TokenIndex isn't meant as a complete solution to text indexing; it's just one (albeit important) part of the puzzle. If your needs are simple, however, the other parts can be very easy to make.
IMPLEMENTATION
Search::TokenIndex uses a hash of bit vectors (constructed using Perl's vec() function), one for each token that occurs in the documents you've indexed. This means that each document must be represented by an integer that serves as the position of the document's bit in the bit vector. (You identify documents by their document id, which can be anything but is normally a scalar value.)
If a particular token occurs in, say, document number 1000, then the bit vector for that token will have a 1 in its 1000th bit. If that's the *only* document the token occurs in, all 999 other bits will be 0. That's not very memory-efficient, but it makes for fast searching. If you index eight hundred documents containing a total of ten thousand unique tokens, your index will require about one million bytes (800 documents * 10000 tokens / 8 bits per byte).
This hash of bit vectors is kept in core memory by default, but you can specify, for example, a tied hash backed by a DBM file.