Re: updates being published

2006-01-18 Thread Henry Stern
There's a problem with the updates_spamassassin_org.cf file. It contains: include updates_spamassassin_org/MIRRORED.BY include updates_spamassassin_org/languages include updates_spamassassin_org/triplets.txt include updates_spamassassin_org/user_prefs.template The first file does not exist.

Re: Nightly corpus run issues w/ hstern plugin

2005-12-27 Thread Henry Stern
Should be fixed now. I forgot to check whether str2time was able to parse the date that it was given. Can you check one of the messages that was generating the warning to verify? Cheers, Henry Theo Van Dinter wrote: I got 500K of: Use of uninitialized value in gmtime at

Re: 3.0.5 rescoring

2005-12-01 Thread Henry Stern
I'd expect that the 700k message corpus will be more prone to errors than the 2M message corpus. It still might be good enough. I'm not convinced that rescoring (as opposed to putting in new rules) will do much for 3.0.5's accuracy. If people really want to go to the trouble of running the

Re: SA-Train (fwd)

2005-11-20 Thread Henry Stern
Hi Alexander, Does your implementation respect the additional constraints required by SpamAssassin? The constraints are as follows: 1. Only nice rules may have scores less than 0. 2. No rule may have a score above 5. Constraint 1 is required because it must be impossible for a spammer to add

Re: SpamAssassin perceptron curiousity

2005-09-07 Thread Henry Stern
Most of this stuff is legacy code from the craig-evolve.c days. I didn't modify logs-to-c's output function. If it ain't broke, don't fix it. num_mutable is the number of mutable tests (instead of immutable tests). Thanks for your attention to detail. Henry Justin Mason wrote: -BEGIN

Mass-checks

2005-07-27 Thread Henry Stern
As far as I know, I am only waiting on one person's mass-check results. Unless you speak up before he uploads them, I'm going to start the score generation without you! ;) Henry signature.asc Description: OpenPGP digital signature

Re: Mass-checks

2005-07-27 Thread Henry Stern
Mass check submissions are closed. I won't be picking up any more. Thanks everyone! Henry Stern wrote: As far as I know, I am only waiting on one person's mass-check results. Unless you speak up before he uploads them, I'm going to start the score generation without you! ;) Henry

Re: NOTICE: rescore mass-checks

2005-07-24 Thread Henry Stern
I'm not sure what I'll *need* to make good scores. Last time around, the results were pants (--reuse was broken), so I don't have much to go on as far as numbers are concerned. Cheers, Henry On Wed, 20 Jul 2005, Justin Mason wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Theo Van

Re: PROPOSAL: create SpamAssassin Rules Project

2005-07-24 Thread Henry Stern
+1 On Tue, 19 Jul 2005, Daniel Quinlan wrote: I propose we create a Rules Project as a part of Apache SpamAssassin. Initially, the project will consist of the existing (empty) rules directory in Subversion (the CVS replacement used by the ASF). Each committer will have their own sandbox to

Re: mass-checks flux redux

2005-07-10 Thread Henry Stern
+1 Daniel Quinlan wrote: I propose we start over with the rules unzeroed (not a massively significant change, but I think it is helpful) and Michael's reuse patch so messages without X-Spam-Status will have non-realtime results (a more significant change). Please vote on this and we'll repeat

Take trunk/masses out of R-T-C mode

2005-07-09 Thread Henry Stern
(09:29:43) Henry: can we take masses/* out of R-T-C mode, since this is the rare time that it gets any attention? (09:29:58) Daniel Quinlan: yes (09:30:04) Daniel Quinlan: for MINOR changes ;-) (09:30:52) Daniel Quinlan: post to dev@ about it and note my agreement (09:31:01) Daniel Quinlan: just

Re: CEAS chat and a hackathon

2005-07-07 Thread Henry Stern
I'm making the trek across the pond! Henry Theo Van Dinter wrote: On Wed, Jun 22, 2005 at 04:28:41AM -0400, Duncan Findlay wrote: It'd be nice to get as many developers as possible in the same room - in fact it'll probably be a record. I think there'll be 5 of us in the area during CEAS?

Re: Question about perceptron scoring

2005-06-26 Thread Henry Stern
Sidney Markowitz wrote: As part of a term project I'm about to finish I've been looking at some aspects of the perceptron scoring we do and have some ideas for alternatives I would like to try. Can someone tell me how many email samples and how many rules typically go into the perceptron run

Re: [OT] Boosting and other potential research topics

2005-05-24 Thread Henry Stern
I was thinking of you when I wrote that. The open research question is: Can we find all the matches for n regexes in o(n^2+m)? Can we tell which of the component regexes have matched? Henry Scott A Crosby wrote: On Tue, 17 May 2005 14:01:09 +0100, Henry Stern [EMAIL PROTECTED] writes: 3

Re: [OT] Boosting and other potential research topics

2005-05-17 Thread Henry Stern
I've only just noticed this thread now. Sorry for the delay in response. -- Re: Boosting. I'm really not a fan of ensemble learning algorithms such as boosting and bagging. IMO, it is a hack used to prop up unstable learning algorithms such as ID3 and C5.0. What would be far more useful is an

Re: RFC: New subproject, BlogSpamAssassin

2005-01-31 Thread Henry Stern
Hello all, Sorry for the delay here. The list was created a few days ago, but I am in the middle of an overseas move. The list is [EMAIL PROTECTED] To subscribe, send e-mail to [EMAIL PROTECTED] I won't be able to participate much (if at all) for the time being but for an initial topic of

RFC: BlogSpamAssassin, proposal

2005-01-03 Thread Henry Stern
Hello everyone, I hope that you have all had safe and enjoyable holidays. My apologies for starting this discussion so close to Christmas. In all honesty, I had forgotten that Christmas was coming. To start things off, I propose that we create a sub-project of SpamAssassin consisting of mailing

Re: RFC: New subproject, BlogSpamAssassin

2004-12-30 Thread Henry Stern
I'm going to get back to work on this on January 2nd once my apartment is cleaned up from the NYE party. ;) Henry Dougal Campbell wrote: Harry wrote: As for starting a project, I think it would be good idea. I think there may be a cat herding issue though. So, any news on the cat-herding front,

Re: A Feature I've always wanted - Test for multiple hits on same rule

2004-12-27 Thread Henry Stern
I'd have to take this into account when optimising the scores. Then, since the scores would be optimised for multiple hits, spammers would only have to reduce the number of hits to evade SpamAssassin. It's the same reason why we use a Bernoulli event model in Bayes. Henry Marc Perkel wrote: This

Re: RFC: New subproject, BlogSpamAssassin

2004-12-23 Thread Henry Stern
There is no permanent solution to email spam (not yet anyway) and I doubt there will be one for weblogs, its an arms race ;) SA3 could go a Weblog spam is completely different from e-mail spam. The objective of the e-mail spammer is for you to read their message and respond quickly. The opposite

RFC: New subproject, BlogSpamAssassin

2004-12-22 Thread Henry Stern
as well as to other weblog software developers that I have missed. I look forward to collaborating with you in the future. Best regards, Henry Stern Committer, SpamAssassin

Re: RFC: New subproject, BlogSpamAssassin

2004-12-22 Thread Henry Stern
. Rather than porting SpamAssassin to weblogs, I'm suggesting that we take what we know from the spam e-mail domain and help to come up with a permanent solution to weblog spam. Henry Michael Parker wrote: On Wed, Dec 22, 2004 at 03:00:16PM -0400, Henry Stern wrote: I'm very interested to hear any

Re: Idea: New way to train Bayes

2004-12-06 Thread Henry Stern
Sidney Markowitz wrote: Nick Leverton said that papers he has seen found that learn on error always works better than learn everything. But I recall one that looked more carefully at longer term results and found that learn on error degrades over time. They found it best to retrain on fresh data

Re: Spam assassin corpus

2004-12-05 Thread Henry Stern
Hi Vaishnavi, I wrote a parser for the 12000 message SpamAssassin public corpus (http://spamassassin.apache.org/publiccorpus) based on SpamAssassin's Bayes code. If you would like to use it, you can download both the parser and a pre-tokenized corpus from

Re: proposal: an automated rule-qa system

2004-11-21 Thread Henry Stern
- (g) -- possibly -- do a quick perceptron run to evaluate if the rule overlaps with other rules too much. The perceptron won't tell us much about overlap, but I'm sure that I can come up with something to help out in that department... after I finish my thesis. Henry P.S. Writing a thesis is

Re: Possible large whitelist from DMOZ data

2004-10-06 Thread Henry Stern
Hi Jeff, You might want to reconsider your use of the entire DMOZ directory. There may be some subtrees that you can ignore. Of the 1338 DMOZ false positives, how many of them are from the same sections on DMOZ? Henry Jeff Chan wrote: Daniel Quinlan, one of the principal SpamAssassin architects

Re: Cluster analysis in Mac spam filter

2004-10-03 Thread Henry Stern
To the best of my knowledge, Apple Mail uses latent semantic analysis for clustering. I wrote a Slashdot comment about this a while back: http://slashdot.org/comments.pl?sid=108111cid=9194254 Henry Sidney Markowitz wrote: I stumbled across this article