Text::Document

Andrea Spinelli Mon, 26 Nov 2001 03:16:59 -0800

Hello,

we have written three modules in a distribution
that we have called Text-Document, from the
name of the main package (the are two are
called Text::DocumentCollection and
Text::Bloom).


Thay have been subjected to review in comp.lang.perl.modules
(with little response, so we assume nobody is hurt...).

The aim of the distribution is dealing with text documents
from the perspective of information retrieval, so
we think they belong to the Text:: namespace.

One of us is already a PAUSE author (ASPINELLI), but
up to now he just maintained an already-existing
package, so he took the perhaps misguided step
of uploading the distribution (Text-Document.1.04.tar.gz)
to PAUSE without registering the name.

Now we ask for the registration of Text::Document,
and of course we are ready to
delete the registration from PAUSE and
upload it with a revised name, if necessary.

Below you find the README and the pods for the
modules.

Thanks in advance
        Andrea Spinelli, [EMAIL PROTECTED]
        Walter Vannini, [EMAIL PROTECTED]

----------README

Text::Document is a collection of modules which allow to operate
on text documents from the perspective of Information Retrieval.

Text::Document scans documents, extracts terms, compares pairs
of documents using the Jaccard and Cosine similarity measures.

Text::Bloom allows to compute  Bloom filters which compactly
store information about term presence in documents, thereby
allowing for efficient storage of document 'signatures'.

Text::DocumentCollection is a collection of documents, allowing
for persistency and for such calculations as the Inverse Document
Frequency (IDF).

Version 1.01 of the package Text::Document is
Copyright (C) 2001 Andrea Spinelli  and Walter Vannini

All documents in this package can be  used with the same limitations
as Perl itself.

Anyway, we are eager to know about your experiences with this thing, 
at
[EMAIL PROTECTED] and/or [EMAIL PROTECTED]

------Document.pod

=head1 NAME

  Text::Document - a text document subject to statistical analysis

=head1 SYNOPSIS

  my $t = Text::Document->new();
  $t->AddContent( 'foo bar baz' );
  $t->AddContent( 'foo barbaz; ' );

  my @freqList = $t->KeywordFrequency();
  my $u = Text::Document->new();
  ...
  my $sj = $t->JaccardSimilarity( $u );
  my $sc = $t->CosineSimilarity( $u );


=head1 DESCRIPTION

C<Text::Document> allows to perform simple
Information-Retrieval-oriented statistics on pure-text documents.

Text can be added in chunks, so that the document may be
incrementally built, for instance by a class like
C<HTML::Parser>.

A simple algorithm splits the text into terms; the algorithm
may be redefined by subclassing and redefining C<ScanV>.

The C<KeywordFrequency> function computes term frequency
over the whole document.

=head1 FORESEEN REUSE

The package may be {re}used either by simple instantiation,
or by subclassing (defining a descendant package).  In the
latter case the methods which are foreseen to be redefined are
those ending with a C<V> suffix.  Redefining other methods
will require greater attention.

=head1 CLASS METHODS

=head2 new

The creator method.  No arguments.

  my $d = Text::Document->new();

=head2 NewFromString

Take a string written by C<WriteToString> (see below)
and create a new C<Text::Document> with the same contents;
call C<die> whenever the restore is impossible or ill-advised,
for instance when the current version of the package is different
from the original one, or the compression library in unavailable.

  my $b = Text::Document::NewFromString( $str );

The return value is a blessed reference; put in another way,
this is an alternative contructor.

The string should have been written by C<WriteToString>;
you may of course tweak the string contents, but
at this point you're entirely on you own.

=head1 INSTANCE METHODS

=head2 AddContent

Used as

  $d->AddContent( 'foo bar baz foo9' );
  $d->AddContent( 'mary had a little lamb' );

Successive calls accumulate content; there is currently no way
of resetting the content to zero.

=head2 Terms

Returns a list of all distinct terms in the document, in no
particular order.

=head2 Occurrences

Returns the number of occurrences of a given term.

  $d->AddContent( 'foo baz bar foo foo');
  my $n = $d->Occurrences( 'foo' ); # now $n is 3

=head2 ScanV

Scan a string and return a list of terms.

Called internally as:

  my @terms = $self->ScanV( $text );

=head2 KeywordFrequency

Returns a reference list of pairs I<[term,frequency]>, sorted by
ascending frequency.

  my $listRef = $d->KeywordFrequency();
  foreach my $pair (@{$listRef}){
      my ($term,$frequency) = @{$pair};
    ...
  }

Terms in the document are sampled and their frequencies of occurrency
are sorted in ascending order;
finally, the list is returned to the user.

=head2 WriteToString

Convert the document (actually, some parameters
and the term counters) into a string which can be saved and
later restored with C<NewFromString>.

  my $str = $d->WriteToString();

The string begins with a header which encodes the
originating package, its version, the parameters
of the current instance.

Whenever possible, C<Compress::Zlib> is used in order to
compress the bit vector in the most efficient way.
On systems without C<Compress::Zlib>, the bit string is
saved uncompressed.

=head2 JaccardSimilarity

Compute the Jaccard measure of document similarity, which is defined
as follows: given two documents I<D> and I<E>, let I<Ds> and I<Es> be 
the set
of terms occurring in I<D> and  I<E>, respectively. Define I<S> as the
intersection of I<Ds> and I<Es>, and I<T> as their union. Then
the Jaccerd  similarity is the the number of  elements
of I<S> divided by the number of elements of I<T>.

It is called as follows:

  my $sim = $d->JaccardSimilarity( $e );

If neither document has any terms the result is undef (a rare 
evenience).
Otherwise the similarity is a real number between 0.0 (no terms in 
common)
and 1.0 (all terms in common).

=head2 CosineSimilarity

Compute the cosine similarity between two documents I<D> and
I<E>.

Let I<Ds> and I<Es> be the set
of terms occurring in I<D> and  I<E>, respectively. Define I<T> as the
union of I<Ds> and I<Es>, and let I<ti> be the I<i>-th element of 
I<T>.

Then the term vectors of I<D> and  I<E> are

  Dv = (nD(t1), nD(t2), ..., nD(tN))
  Ev = (nE(t1), nE(t2), ..., nE(tN))

where nD(ti) is the  number of occurrences of term ti in I<D>,
and nE(ti) the same for I<E>.

Now we are at last ready to define the cosine similarity I<CS>:

  CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev))

Here (... , ...) is the scalar product and Norm is the Euclidean
norm (square root of the sum of squares).

C<CosineSImilarity> is called as

   $sim = $d->CosineSimilarity( $e );

It is C<undef> if either I<D> or I<E> have no occurrence of any term.
Otherwise, it is a number between 0.0 and 1.0. Since term occurrences
are always non-negative, the cosine is obviously always non-negative.

=head1 AUTHORS

  [EMAIL PROTECTED] (Andrea Spinelli)
  [EMAIL PROTECTED] (Walter Vannini)

=head1 HISTORY

  2001-11-02 - initial revision

=head DISCARDED CHOICES

We did not use C<Storable>, because we wanted to fine-tune
compression and version compatibility.  However, this
choice may be easily reversed redefining WriteToString and
NewFromString.

---------Bloom.pod

=head1 NAME

  Text::Bloom - Evaluate Bloom signature of a set of terms

=head1 SYNOPSIS

  my $b = Text::Bloom->new();
  $b->Compute( qw( foo bar baz ) );
  my $sig = $b->WriteToString();
  $b->WriteToFile( 'afile.sig' );
  my $b2 = Text::Bloom::NewFromFile( 'afile.sig' );
  my $b3 = Text::Bloom->new();
  $b3->Compute( qw( foo bar barbaz ) );
  my $sim = $b->Similarity( $b2 );
  my $b4 = Text::Bloom::NewFromString( $sig );

=head1 DESCRIPTION

C<Text::Bloom> applies the Bloom filtering technique to
the statistical analysis of documents.

The terms in the document are quantized using a base-36
radix representation; each term thus corresponds to an
integer in the range 0..I<p-1>, where I<p> is a prime,
currently set to the greatest prime less than 2^32.

Each quantized value is mapped to I<d> integers in the range
0..I<size-1>, where I<size> is an integer less than I<p>,
currently 2^17, using a  family of hash functions,
computed by the C<HashV> function.

Each hashed value is used as the index in a large bit vector.
Bits corresponding to terms present in the document are set to
1; all other bits are set to 0.

Of course, collisions may cause the same bit to be set twice,
by different terms. It follows that, if the document contains
I<n> distinct terms, in the resulting bit vector at most
I<n * d> bits are set to 1.

The resulting bit string is a very compact representation of the
presence/absence of terms in the document, and  is therefore
characterised as a I<signature>. Moreover, it does not
depend on a pre-set dictionary of terms.

The signature may be used for:

=over 4

=item *

testing whether a given set of terms is present in the document,

=item *

computing which fraction of terms are common to two documents.

=back

The bit representation may be written to and read from a file.
C<Text::Bloom> prepends a header to the bit stream proper;
moreover, whenever the package C<Compress::Zlib> is available,
the bit vector is compressed, so that disk space requirements
are drastically reduced, especially for small documents.

The hash function is obviously a crucial component of the filter;
the reference implementation uses a radix representation of
strings. Each term must therefore match the regular
expression C</[0-9a-z]+/>.

There are quite a few viable alternatives, which can be pursued
by subclassing and redefining the method C<QuantizeV>.

=head1 FORESEEN REUSE

The package may be {re}used either by simple instantiation,
or by subclassing (defining a descendant package).  In the
latter case the methods which are foreseen to be redefined are
those ending with a C<V> suffix.  Redefining other methods
will require greater attention.

=head1 CLASS METHODS

=head2 new

The constructor. No arguments are required.

  $b = Text::Bloom->new();

=head2 NewFromString

Take a string written by C<WriteToString> (see below)
and create a new C<Text::Bloom> with the same contents;
call C<die> whenever the restore is impossible or ill-advised,
for instance when the current version of the package is different
from the original one, or the compression library in unavailable.

  my $b = Text::Bloom::NewFromString( $str );

The return value is a blessed reference; put in another way,
this is an alternative contructor.

The string should have been written by C<WriteToString>;
you may of course tweak the string contents, but
at this point you're entirely on you own.

=head2 NewFromFile

Utility function that reads a binary file and performs a 
C<NewFromString>
on its content; see its counterpart, C<WriteToFile>.

  my $b2 = Text::Document::NewFromFile( 'foo.sig' );

=head1 INSTANCE METHODS

=head2 Size

Set and get the size of the filter, in bits. The default size
is currently 128K.

  print 'size is ' . $b->Size() . "\n";
  $b->Size( 65536 );

The C<Size> method must be called before the C<Compute> method
in order to have effect.

=head2 Compute

Compute the Bloom signature from the given set of words
and store it internally.

  $b->Compute( qw( foo bar baz foobar bazbaz ) );

Makes use of the C<QuantizeV> method.

=head2 QuantizeV

Convert a term into an integer; must return
an integer in the range 0 .. C<$Text::Bloom::p-1>.

It is called as

  my $hash = $b->QuantizeV( $term );

The current version is designed for strings matching
C</[a-z0-9]+/>. Other characters do not cause errors,
but degrade the hash function performance.

This function is a likely candidate for redefinition.

=head2 HashV

Convert an integer to a (smaller) integer, according
to one of a class of similar functions.

It is internally called as:

  my $index = $b->HashV( $order, $value );

The C<$value> must belong  to the  interval
0..C<$Text::Bloom::p-1>, while the index  must
lie in 0..I<size-1>. C<$order> is
a small integer from 0 to I<d-1>.

The default implementation is

  index = m[order] * value + q[order]   (mod size)

the values of I<m> and I<q> are taken from the array
C<@Text::Bloom::hashParam>; the form of the  function
is taken from [2].

=head2 WriteToString

Convert the Bloom signature into a string which can be saved and
later restored with C<NewFromString>. C<Compute> must have
been called previously.

  my $str = $b->WriteToString();

The string begins with a header which encodes the
originating package, its version, the parameters
of the current instance.

Whenever possible, C<Compress::Zlib> is used in order to
compress the bit vector in the most efficient way.
On systems without C<Compress::Zlib>, the bit string is
saved uncompressed.

=head2 WriteToFile

These convenience functions just call their String counterparts
and read/write the file specified in the argument.

  $b->WriteToFile( 'foo.sig' );

=head1 AUTHORS

  [EMAIL PROTECTED] (Andrea Spinelli)
  [EMAIL PROTECTED] (Walter Vannini)

=head1 BIBLIOGRAPHY

=over 4

=item [1]

Burton H. Bloom, "Space/time trade-offs in hash coding with allowable 
errors",
I<Communications of the ACM>, B<13>, 7, July 1970, pages 422-426. 
(available
electronically from ACM Digital Library).

=item [2]

M. V. Ramakrishna, "Practical Performance of Bloom FIlters
and Parallel Free-Text Searching",
I<Communications of the ACM>, B<32>, 10, October 1989, pages 1237-
1239.
(available electronically from ACM Digital Library).

=back

=head1 BUGS

On Win32 we have experienced some instabilities when dealing
with a large number of signatures; in this case Perl crashes
without apparent explanation. The main suspect is  Bit::Vector,
but without any evidence.

=head1 HISTORY

  2001-11-02 - initial revision

--------------DocumentCollection.pod

=head1 NAME

  Text::DocumentCollection - a collection of documents

=head1 SYNOPSIS

=head1 DESCRIPTION

=head1 CLASS METHODS

=head2 new

The constructor; arguments must be passed as maps
from keys to values. The key C<file> is mandatory.

  my $c = Text::DocumentCollection->new( file => 'coll.db' );

Documents from the collection are saved as in the  specified file,
which is  currently handled by a C<DB_File> hash.

=head1 INSTANCE METHODS

=head2 Add

Add a document to the collection, tagging it with
a unique key.

  $c->Add( $key, $doc );

C<Add> C<die>s if the key is already present.

To change an existing key, use C<Delete> and then C<Add>.

=head2 Delete

Discard a document from the collection.

=head2 NewFromDB

Loads the collection from the given DB file:

  my $c = Text::DocumentCollection->NewFromDB( file => 'coll.db' );

The file must be either empty or created by a former invocation
of C<new> or C<NewFromDB>, followed by any number of C<Add>
and/or C<Delete>.

Currently, all documents in  the  collection are  revived
(by calling C<NewFromString>). This poses performance problems
for huge collections; a caching mechanism would be an option
in this case.

=head2 IDF

Inverse Term frequency of a given term.

The definition we used is, given a term I<t>, a set of documents
I<DOC> and the binary relationship I<has-term>:

  IDF(t) = log2( #DOC / #{ d in DOC | d has-term t } )

The logarithm is in base 2, since this is related to an
information measurement, and # is the cardinality operator.

=head2 EnumerateV

Enumerates all the document in the collection. Called as:

  my @result = $c->EnumerateV( \&Callback, 'the rock' );

The function C<Callback> will be called on each element
of the collection as:

  my @l = CallBack( $c, $key, $doc, $rock );

where C<$rock> is the second argument to C<Callback>.

Since C<$c> is the first argument, the callback may be
an instance method of C<Text::DocumentCollection>.

The final result is obtained by concatenating all the
partial results (C<@l> in the example above).  If you do
not want a result, simply return the empty list ().

There is no particular order of enumeration, so there
is no particular order in which results are concatenated.

=head1 AUTHORS

  [EMAIL PROTECTED]
  [EMAIL PROTECTED]

--
Andrea Spinelli - Software Architect
e-mail: [EMAIL PROTECTED]
phone: +39-035-636029
fax: +39-035-638129

Text::Document

Reply via email to