I'm trying to understand Bogofilter better. I have been using it with so-so
success for about a year, but always by copy-and-paste of other people's
scripts from the internet. Now I'm attempting to read the doc.s and understand.
But --- it's rather slow going:
In 'man bogofilter', under CLASSIFICATION OPTIONS, there is :
"The -R option tells bogofilter to output an R data frame in text form on the
standard output. See the section on integration with R, below, for further
detail."
and 'below' is:
" The -R option tells bogofilter to generate an R data frame. The data
frame contains one row per token analyzed. Each such
row contains the token, the sum of its database "good" and "spam"
counts, the "good" count divided by the number of
non-spam messages used to create the training database, the "spam" count
divided by the spam message count, Robinson´s
f(w) for the token, the natural logs of (1 - f(w)) and f(w), and an
indicator character (+ if the token´s f(w) value
exceeded the minimum deviation from 0.5, - if it didn´t). There is one
additional row at the end of the table that
contains a label in the token field, followed by the number of words
actually used (the ones with + indicators),
Robinson´s P, Q, S, s and x values and the minimum deviation.
The R data frame can be saved to a file and later read into an R session
(see the R project website[5] for information
about the mathematics package R). Provided with the bogofilter
distribution is a simple R script (file bogo.R) that can
be used to verify bogofilter´s calculations. Instructions for its use
are included in the script in the form of comments.
"
I have processed some spam and ham to create a bogofilter database. I want to
use the -R option to create the TEXT data frame and examine its contents.
I use the following:
$ bogofilter -R > bogo-rframe
This should, to my understanding, write a text file in bogo-rframe, but it has
been running for about an hour and shows no sign of terminating. What is wrong?
Please help.
There were about 3500 messages of spam and of ham, and the scoring took well
under a minute. Do I really need to use R to look at what is perported to be a
text file?
TIA