Hello,
we have written three modules in a distribution
that we have called Text-Document, from the
name of the main package (the are two are
called Text::DocumentCollection and
Text::Bloom).
Thay have been subjected to review in comp.lang.perl.modules
(with little response, so we assume nobody is hurt...).
The aim of the distribution is dealing with text documents
from the perspective of information retrieval, so
we think they belong to the Text:: namespace.
One of us is already a PAUSE author (ASPINELLI), but
up to now he just maintained an already-existing
package, so he took the perhaps misguided step
of uploading the distribution (Text-Document.1.04.tar.gz)
to PAUSE without registering the name.
Now we ask for the registration of Text::Document,
and of course we are ready to
delete the registration from PAUSE and
upload it with a revised name, if necessary.
Below you find the README and the pods for the
modules.
Thanks in advance
Andrea Spinelli, [EMAIL PROTECTED]
Walter Vannini, [EMAIL PROTECTED]
----------README
Text::Document is a collection of modules which allow to operate
on text documents from the perspective of Information Retrieval.
Text::Document scans documents, extracts terms, compares pairs
of documents using the Jaccard and Cosine similarity measures.
Text::Bloom allows to compute Bloom filters which compactly
store information about term presence in documents, thereby
allowing for efficient storage of document 'signatures'.
Text::DocumentCollection is a collection of documents, allowing
for persistency and for such calculations as the Inverse Document
Frequency (IDF).
Version 1.01 of the package Text::Document is
Copyright (C) 2001 Andrea Spinelli and Walter Vannini
All documents in this package can be used with the same limitations
as Perl itself.
Anyway, we are eager to know about your experiences with this thing,
at
[EMAIL PROTECTED] and/or [EMAIL PROTECTED]
------Document.pod
=head1 NAME
Text::Document - a text document subject to statistical analysis
=head1 SYNOPSIS
my $t = Text::Document->new();
$t->AddContent( 'foo bar baz' );
$t->AddContent( 'foo barbaz; ' );
my @freqList = $t->KeywordFrequency();
my $u = Text::Document->new();
...
my $sj = $t->JaccardSimilarity( $u );
my $sc = $t->CosineSimilarity( $u );
=head1 DESCRIPTION
C<Text::Document> allows to perform simple
Information-Retrieval-oriented statistics on pure-text documents.
Text can be added in chunks, so that the document may be
incrementally built, for instance by a class like
C<HTML::Parser>.
A simple algorithm splits the text into terms; the algorithm
may be redefined by subclassing and redefining C<ScanV>.
The C<KeywordFrequency> function computes term frequency
over the whole document.
=head1 FORESEEN REUSE
The package may be {re}used either by simple instantiation,
or by subclassing (defining a descendant package). In the
latter case the methods which are foreseen to be redefined are
those ending with a C<V> suffix. Redefining other methods
will require greater attention.
=head1 CLASS METHODS
=head2 new
The creator method. No arguments.
my $d = Text::Document->new();
=head2 NewFromString
Take a string written by C<WriteToString> (see below)
and create a new C<Text::Document> with the same contents;
call C<die> whenever the restore is impossible or ill-advised,
for instance when the current version of the package is different
from the original one, or the compression library in unavailable.
my $b = Text::Document::NewFromString( $str );
The return value is a blessed reference; put in another way,
this is an alternative contructor.
The string should have been written by C<WriteToString>;
you may of course tweak the string contents, but
at this point you're entirely on you own.
=head1 INSTANCE METHODS
=head2 AddContent
Used as
$d->AddContent( 'foo bar baz foo9' );
$d->AddContent( 'mary had a little lamb' );
Successive calls accumulate content; there is currently no way
of resetting the content to zero.
=head2 Terms
Returns a list of all distinct terms in the document, in no
particular order.
=head2 Occurrences
Returns the number of occurrences of a given term.
$d->AddContent( 'foo baz bar foo foo');
my $n = $d->Occurrences( 'foo' ); # now $n is 3
=head2 ScanV
Scan a string and return a list of terms.
Called internally as:
my @terms = $self->ScanV( $text );
=head2 KeywordFrequency
Returns a reference list of pairs I<[term,frequency]>, sorted by
ascending frequency.
my $listRef = $d->KeywordFrequency();
foreach my $pair (@{$listRef}){
my ($term,$frequency) = @{$pair};
...
}
Terms in the document are sampled and their frequencies of occurrency
are sorted in ascending order;
finally, the list is returned to the user.
=head2 WriteToString
Convert the document (actually, some parameters
and the term counters) into a string which can be saved and
later restored with C<NewFromString>.
my $str = $d->WriteToString();
The string begins with a header which encodes the
originating package, its version, the parameters
of the current instance.
Whenever possible, C<Compress::Zlib> is used in order to
compress the bit vector in the most efficient way.
On systems without C<Compress::Zlib>, the bit string is
saved uncompressed.
=head2 JaccardSimilarity
Compute the Jaccard measure of document similarity, which is defined
as follows: given two documents I<D> and I<E>, let I<Ds> and I<Es> be
the set
of terms occurring in I<D> and I<E>, respectively. Define I<S> as the
intersection of I<Ds> and I<Es>, and I<T> as their union. Then
the Jaccerd similarity is the the number of elements
of I<S> divided by the number of elements of I<T>.
It is called as follows:
my $sim = $d->JaccardSimilarity( $e );
If neither document has any terms the result is undef (a rare
evenience).
Otherwise the similarity is a real number between 0.0 (no terms in
common)
and 1.0 (all terms in common).
=head2 CosineSimilarity
Compute the cosine similarity between two documents I<D> and
I<E>.
Let I<Ds> and I<Es> be the set
of terms occurring in I<D> and I<E>, respectively. Define I<T> as the
union of I<Ds> and I<Es>, and let I<ti> be the I<i>-th element of
I<T>.
Then the term vectors of I<D> and I<E> are
Dv = (nD(t1), nD(t2), ..., nD(tN))
Ev = (nE(t1), nE(t2), ..., nE(tN))
where nD(ti) is the number of occurrences of term ti in I<D>,
and nE(ti) the same for I<E>.
Now we are at last ready to define the cosine similarity I<CS>:
CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev))
Here (... , ...) is the scalar product and Norm is the Euclidean
norm (square root of the sum of squares).
C<CosineSImilarity> is called as
$sim = $d->CosineSimilarity( $e );
It is C<undef> if either I<D> or I<E> have no occurrence of any term.
Otherwise, it is a number between 0.0 and 1.0. Since term occurrences
are always non-negative, the cosine is obviously always non-negative.
=head1 AUTHORS
[EMAIL PROTECTED] (Andrea Spinelli)
[EMAIL PROTECTED] (Walter Vannini)
=head1 HISTORY
2001-11-02 - initial revision
=head DISCARDED CHOICES
We did not use C<Storable>, because we wanted to fine-tune
compression and version compatibility. However, this
choice may be easily reversed redefining WriteToString and
NewFromString.
---------Bloom.pod
=head1 NAME
Text::Bloom - Evaluate Bloom signature of a set of terms
=head1 SYNOPSIS
my $b = Text::Bloom->new();
$b->Compute( qw( foo bar baz ) );
my $sig = $b->WriteToString();
$b->WriteToFile( 'afile.sig' );
my $b2 = Text::Bloom::NewFromFile( 'afile.sig' );
my $b3 = Text::Bloom->new();
$b3->Compute( qw( foo bar barbaz ) );
my $sim = $b->Similarity( $b2 );
my $b4 = Text::Bloom::NewFromString( $sig );
=head1 DESCRIPTION
C<Text::Bloom> applies the Bloom filtering technique to
the statistical analysis of documents.
The terms in the document are quantized using a base-36
radix representation; each term thus corresponds to an
integer in the range 0..I<p-1>, where I<p> is a prime,
currently set to the greatest prime less than 2^32.
Each quantized value is mapped to I<d> integers in the range
0..I<size-1>, where I<size> is an integer less than I<p>,
currently 2^17, using a family of hash functions,
computed by the C<HashV> function.
Each hashed value is used as the index in a large bit vector.
Bits corresponding to terms present in the document are set to
1; all other bits are set to 0.
Of course, collisions may cause the same bit to be set twice,
by different terms. It follows that, if the document contains
I<n> distinct terms, in the resulting bit vector at most
I<n * d> bits are set to 1.
The resulting bit string is a very compact representation of the
presence/absence of terms in the document, and is therefore
characterised as a I<signature>. Moreover, it does not
depend on a pre-set dictionary of terms.
The signature may be used for:
=over 4
=item *
testing whether a given set of terms is present in the document,
=item *
computing which fraction of terms are common to two documents.
=back
The bit representation may be written to and read from a file.
C<Text::Bloom> prepends a header to the bit stream proper;
moreover, whenever the package C<Compress::Zlib> is available,
the bit vector is compressed, so that disk space requirements
are drastically reduced, especially for small documents.
The hash function is obviously a crucial component of the filter;
the reference implementation uses a radix representation of
strings. Each term must therefore match the regular
expression C</[0-9a-z]+/>.
There are quite a few viable alternatives, which can be pursued
by subclassing and redefining the method C<QuantizeV>.
=head1 FORESEEN REUSE
The package may be {re}used either by simple instantiation,
or by subclassing (defining a descendant package). In the
latter case the methods which are foreseen to be redefined are
those ending with a C<V> suffix. Redefining other methods
will require greater attention.
=head1 CLASS METHODS
=head2 new
The constructor. No arguments are required.
$b = Text::Bloom->new();
=head2 NewFromString
Take a string written by C<WriteToString> (see below)
and create a new C<Text::Bloom> with the same contents;
call C<die> whenever the restore is impossible or ill-advised,
for instance when the current version of the package is different
from the original one, or the compression library in unavailable.
my $b = Text::Bloom::NewFromString( $str );
The return value is a blessed reference; put in another way,
this is an alternative contructor.
The string should have been written by C<WriteToString>;
you may of course tweak the string contents, but
at this point you're entirely on you own.
=head2 NewFromFile
Utility function that reads a binary file and performs a
C<NewFromString>
on its content; see its counterpart, C<WriteToFile>.
my $b2 = Text::Document::NewFromFile( 'foo.sig' );
=head1 INSTANCE METHODS
=head2 Size
Set and get the size of the filter, in bits. The default size
is currently 128K.
print 'size is ' . $b->Size() . "\n";
$b->Size( 65536 );
The C<Size> method must be called before the C<Compute> method
in order to have effect.
=head2 Compute
Compute the Bloom signature from the given set of words
and store it internally.
$b->Compute( qw( foo bar baz foobar bazbaz ) );
Makes use of the C<QuantizeV> method.
=head2 QuantizeV
Convert a term into an integer; must return
an integer in the range 0 .. C<$Text::Bloom::p-1>.
It is called as
my $hash = $b->QuantizeV( $term );
The current version is designed for strings matching
C</[a-z0-9]+/>. Other characters do not cause errors,
but degrade the hash function performance.
This function is a likely candidate for redefinition.
=head2 HashV
Convert an integer to a (smaller) integer, according
to one of a class of similar functions.
It is internally called as:
my $index = $b->HashV( $order, $value );
The C<$value> must belong to the interval
0..C<$Text::Bloom::p-1>, while the index must
lie in 0..I<size-1>. C<$order> is
a small integer from 0 to I<d-1>.
The default implementation is
index = m[order] * value + q[order] (mod size)
the values of I<m> and I<q> are taken from the array
C<@Text::Bloom::hashParam>; the form of the function
is taken from [2].
=head2 WriteToString
Convert the Bloom signature into a string which can be saved and
later restored with C<NewFromString>. C<Compute> must have
been called previously.
my $str = $b->WriteToString();
The string begins with a header which encodes the
originating package, its version, the parameters
of the current instance.
Whenever possible, C<Compress::Zlib> is used in order to
compress the bit vector in the most efficient way.
On systems without C<Compress::Zlib>, the bit string is
saved uncompressed.
=head2 WriteToFile
These convenience functions just call their String counterparts
and read/write the file specified in the argument.
$b->WriteToFile( 'foo.sig' );
=head1 AUTHORS
[EMAIL PROTECTED] (Andrea Spinelli)
[EMAIL PROTECTED] (Walter Vannini)
=head1 BIBLIOGRAPHY
=over 4
=item [1]
Burton H. Bloom, "Space/time trade-offs in hash coding with allowable
errors",
I<Communications of the ACM>, B<13>, 7, July 1970, pages 422-426.
(available
electronically from ACM Digital Library).
=item [2]
M. V. Ramakrishna, "Practical Performance of Bloom FIlters
and Parallel Free-Text Searching",
I<Communications of the ACM>, B<32>, 10, October 1989, pages 1237-
1239.
(available electronically from ACM Digital Library).
=back
=head1 BUGS
On Win32 we have experienced some instabilities when dealing
with a large number of signatures; in this case Perl crashes
without apparent explanation. The main suspect is Bit::Vector,
but without any evidence.
=head1 HISTORY
2001-11-02 - initial revision
--------------DocumentCollection.pod
=head1 NAME
Text::DocumentCollection - a collection of documents
=head1 SYNOPSIS
=head1 DESCRIPTION
=head1 CLASS METHODS
=head2 new
The constructor; arguments must be passed as maps
from keys to values. The key C<file> is mandatory.
my $c = Text::DocumentCollection->new( file => 'coll.db' );
Documents from the collection are saved as in the specified file,
which is currently handled by a C<DB_File> hash.
=head1 INSTANCE METHODS
=head2 Add
Add a document to the collection, tagging it with
a unique key.
$c->Add( $key, $doc );
C<Add> C<die>s if the key is already present.
To change an existing key, use C<Delete> and then C<Add>.
=head2 Delete
Discard a document from the collection.
=head2 NewFromDB
Loads the collection from the given DB file:
my $c = Text::DocumentCollection->NewFromDB( file => 'coll.db' );
The file must be either empty or created by a former invocation
of C<new> or C<NewFromDB>, followed by any number of C<Add>
and/or C<Delete>.
Currently, all documents in the collection are revived
(by calling C<NewFromString>). This poses performance problems
for huge collections; a caching mechanism would be an option
in this case.
=head2 IDF
Inverse Term frequency of a given term.
The definition we used is, given a term I<t>, a set of documents
I<DOC> and the binary relationship I<has-term>:
IDF(t) = log2( #DOC / #{ d in DOC | d has-term t } )
The logarithm is in base 2, since this is related to an
information measurement, and # is the cardinality operator.
=head2 EnumerateV
Enumerates all the document in the collection. Called as:
my @result = $c->EnumerateV( \&Callback, 'the rock' );
The function C<Callback> will be called on each element
of the collection as:
my @l = CallBack( $c, $key, $doc, $rock );
where C<$rock> is the second argument to C<Callback>.
Since C<$c> is the first argument, the callback may be
an instance method of C<Text::DocumentCollection>.
The final result is obtained by concatenating all the
partial results (C<@l> in the example above). If you do
not want a result, simply return the empty list ().
There is no particular order of enumeration, so there
is no particular order in which results are concatenated.
=head1 AUTHORS
[EMAIL PROTECTED]
[EMAIL PROTECTED]
--
Andrea Spinelli - Software Architect
e-mail: [EMAIL PROTECTED]
phone: +39-035-636029
fax: +39-035-638129