Would it be possible to drop the HTML and use text/plain mail? :)
On Fri, 2010-10-15 at 18:21 -0300, Marco Ribeiro wrote:
> I'm sorry I wasn't clear. I am looking for downloadable ham corpora in
> order to try to develop a way to find new rules in an automatic or
> semi-automatic way.
> After I generate new rules, I would need to test their accuracy
> somehow, the mass check seems to be a good way. So I guess my question
> about the mass check is wether or not my rules will be tested on
> others' corpora as well as on my own corpus.
Well, you got that kind of backwards.
The daily mass-check is to evaluate the SA rules' performance and
accuracy, and to generate frequent re-scoring based on recent spam. For
that, the rules already need to be part of the SA rule-set, so to speak,
or at least under evaluation.
The rules used for the mass-check run are in SVN. To commit rules there,
one needs to be a committer to the SA project first.
Again, the mass-check is done for re-scoring of the live rule-set und
publishing new rules, pushed to the users via sa-update. The current
infrastructure and workflow probably is not well-suited for experi-
mentation, but that depends on the nature of your automated rule
generation. Also, depending on the nature and amount of rules, it might
impose a considerably increased load to the contributors.
Anyway, without knowing some clear details first, we cannot even know if
it might be possible.
> I read that, but I wasn't sure wether or not it was a warning against
> using others' corpora for means other than evaluating rules. Thanks
> for the clarification and for the quick reply.
Well, not absolutely sure, but I believe most mass-check contributors
are running it locally on their machines, and just upload the logs.
--
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}