I'm trying to understand Bogofilter better. I have been using it with so-so 
success for about a year, but always by copy-and-paste of other people's 
scripts from the internet. Now I'm attempting to read the doc.s and understand. 
But --- it's rather slow going:

In 'man bogofilter', under CLASSIFICATION OPTIONS, there is :
"The -R option tells bogofilter to output an R data frame in text form on the 
standard output. See the section on integration with R, below, for further 
detail."
and 'below' is:
"       The -R option tells bogofilter to generate an R data frame. The data 
frame contains one row per token analyzed. Each such
       row contains the token, the sum of its database "good" and "spam" 
counts, the "good" count divided by the number of
       non-spam messages used to create the training database, the "spam" count 
divided by the spam message count, Robinson´s
       f(w) for the token, the natural logs of (1 - f(w)) and f(w), and an 
indicator character (+ if the token´s f(w) value
       exceeded the minimum deviation from 0.5, - if it didn´t). There is one 
additional row at the end of the table that
       contains a label in the token field, followed by the number of words 
actually used (the ones with + indicators),
       Robinson´s P, Q, S, s and x values and the minimum deviation.

       The R data frame can be saved to a file and later read into an R session 
(see the R project website[5] for information
       about the mathematics package R). Provided with the bogofilter 
distribution is a simple R script (file bogo.R) that can
       be used to verify bogofilter´s calculations. Instructions for its use 
are included in the script in the form of comments.
"

I have processed some spam and ham to create a bogofilter database. I want to 
use the -R option to create the TEXT data frame and examine its contents.

I use the following:

$ bogofilter -R > bogo-rframe

This should, to my understanding, write a text file in bogo-rframe, but it has 
been running for about an hour and shows no sign of terminating. What is wrong? 
Please help.

There were about 3500 messages of spam and of ham, and the scoring took well 
under a minute. Do I really need to use R to look at what is perported to be a 
text file?

TIA

Reply via email to