Re: Ham Corpora

Marco Ribeiro Sat, 16 Oct 2010 07:06:39 -0700

> Well, you got that kind of backwards.
> The daily mass-check is to evaluate the SA rules' performance and
> accuracy, and to generate frequent re-scoring based on recent spam. For
> that, the rules already need to be part of the SA rule-set, so to speak,
> or at least under evaluation.


Thank you for clarifying that, I had really gotten it backwards.

> Again, the mass-check is done for re-scoring of the live rule-set und
> publishing new rules, pushed to the users via sa-update. The current
> infrastructure and workflow probably is not well-suited for experi-
> mentation, but that depends on the nature of your automated rule
> generation. Also, depending on the nature and amount of rules, it might
> impose a considerably increased load to the contributors.

Well, I could just evaluate my rules on my own corpora (when I can
find a good Ham corpus) and then submit to SA whichever rules work
well. As one of my ideas is generating a ton of low-support but
high-confidence rules, that would probably increase considerably the
load to the contributors, as you said.

Once again, thanks for making things clearer,
Marco Túlio Ribeiro

2010/10/15 Karsten Bräckelmann <[email protected]>
>
> Would it be possible to drop the HTML and use text/plain mail? :)
>
> On Fri, 2010-10-15 at 18:21 -0300, Marco Ribeiro wrote:
> > I'm sorry I wasn't clear. I am looking for downloadable ham corpora in
> > order to try to develop a way to find new rules in an automatic or
> > semi-automatic way.
>
> > After I generate new rules, I would need to test their accuracy
> > somehow, the mass check seems to be a good way. So I guess my question
> > about the mass check is wether or not my rules will be tested on
> > others' corpora as well as on my own corpus.
>
> Well, you got that kind of backwards.
>
> The daily mass-check is to evaluate the SA rules' performance and
> accuracy, and to generate frequent re-scoring based on recent spam. For
> that, the rules already need to be part of the SA rule-set, so to speak,
> or at least under evaluation.
>
> The rules used for the mass-check run are in SVN. To commit rules there,
> one needs to be a committer to the SA project first.
>
> Again, the mass-check is done for re-scoring of the live rule-set und
> publishing new rules, pushed to the users via sa-update. The current
> infrastructure and workflow probably is not well-suited for experi-
> mentation, but that depends on the nature of your automated rule
> generation. Also, depending on the nature and amount of rules, it might
> impose a considerably increased load to the contributors.
>
> Anyway, without knowing some clear details first, we cannot even know if
> it might be possible.
>
>
> > I read that, but I wasn't sure wether or not it was a warning against
> > using others' corpora for means other than evaluating rules.  Thanks
> > for the clarification and for the quick reply.
>
> Well, not absolutely sure, but I believe most mass-check contributors
> are running it locally on their machines, and just upload the logs.
>
>
> --
> char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
> (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
>

Re: Ham Corpora

Reply via email to