Re: svn commit: r169334 - in /spamassassin/trunk: MANIFEST lib/Mail/SpamAssassin/Conf.pm lib/Mail/SpamAssassin/HTML.pm lib/Mail/SpamAssassin/PerMsgStatus.pm lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm lib/Mail/SpamAssassin/Util.pm rules/20_uri_tests.cf t/uri.t t/uri_html.t

2005-05-09 Thread Sidney Markowitz
Theo Van Dinter wrote:
 Sorry to be a killjoy here.

I have no problem with the criticism, but I think I've hit the end of what
I'm going to do on this one now that it's working without breaking anything.
I'm running out of time for some schoolwork that's due in a month and will
have to concentrate on that.

The changes you are talking about are about cleaner design, and I'm +1 for
that. And now is a good time to do it, while the issues are fresh, so that
it doesn't become some awkward design embedded in old code. But I won't be
making those changes. We're still in CTR mode in 3.1, right? I was acting
like it was RTC on this one because it felt right to get feedback first when
it looked like it would require some changes to the object design, but I
don't think it needs a lot of discussion for the last cleanup, so go to it.

 -- sidney


Re: proposal for release

2004-09-22 Thread Sidney Markowitz
Daniel Quinlan said:
 I propose that we make a 3.0.0 release

Are we going to be able to close bug 3675 first (he asks innocently, after
having made trouble on reaching consensus on that very bug :-) )? That's
the only one left marked with a 3.0 target.

 -- sidney




Re: proposal for release

2004-09-22 Thread Sidney Markowitz
Daniel Quinlan said:
 Yes.  It's 4-to-3 in favor of orange

+1 for 3.0 release!




Re: [SpamAssassin Wiki] Updated: InstallingOnWindows

2004-09-23 Thread Sidney Markowitz
  (Note: someone may want to address
 [http://it.slashdot.org/comments.pl?sid=122734cid=10320250 these
 complaints] about this document.)

I did post a response in
http://it.slashdot.org/comments.pl?sid=122734cid=10325272 [Anyone got
some spare mod points? :-) ].

There is one issue I missed and I would like someone who can install
SpamAssassin on a Windows machine to confirm something for me as I am
temporarily Windows-deprived while my laptop is being repaired.

The slashdot post complains about the complexity of the steps the Wiki
page lists for generating the HTML doc files.

Near as I can tell

 nmake text_html_doc

should be all that is required and would work under Windows.

Can someone please verify that and then we can update the Wiki?

Thanks,

 -- sidney




Cluster analysis in Mac spam filter

2004-10-03 Thread Sidney Markowitz
I stumbled across this article
http://www.macdevcenter.com/pub/a/mac/2004/05/18/spam_pt2.html
while Googling around for anything that relates cluster analysis 
techniques to spam filtering.

This may be old knowledge to some people here, but was new to me. 
Apparently the trainable spam filter in Apple's Mail program does not 
use the Bayesian approach that we are familiar with. It uses a cluster 
discovery tool that was developed for document search and retrieval.

It would be interesting to compare this approach to Bayes. I'm also 
curious if this provides some hints about using some techniques from 
bioinformatics (as Justin referred to in a recent message to this list) 
such as UPGMA cluster analysis( http://www.nmsr.org/upgma.htm ).

 -- sidney


Re: Cluster analysis in Mac spam filter

2004-10-03 Thread Sidney Markowitz
Henry Stern wrote:
Apple Mail uses latent semantic analysis 
for clustering
That sounds right. Some people there were looking at that for document 
retrieval when I worked at Apple Research in the mid-90's.

By the way, have you seen the work applying cased-based reasoning to 
spam filtering? There are two articles on that at

http://www.cs.tcd.ie/publications/tech-reports/tr-index.04.html
with a bit more at the home page of one of the authors:
http://www.comp.dit.ie/sjdelany/
I've been thinking about whether there might be benefit in making a 
finer distinctions than just spam or not-spam, by clustering into 
perhaps spam topics. Why should the characteristics for porn spam, 
multilevel marketing spam, Nigerian 419, etc., be combined? Would there 
be benefit from making their differences explicit?

 -- sidney


Re: Cluster analysis in Mac spam filter

2004-10-03 Thread Sidney Markowitz
Henry,
In the paper An Assessment of Case-Based Reasoning for Spam Filtering
 http://www.comp.dit.ie/sjdelany/publications/AICS%202004%20(crc).pdf
the authors compare CBR and a naive Bayes (NB) with one conclusion (on 
their test data, with their implementation of NB) that daily updating of 
the training data using misclassified mails caused an improvement in FPs 
but a degradation in FN rate that led to an overall negative effect on 
their measure of performance.

How does that compare to your results on the effect of training and 
learn on error vs learn on everything?

If CBR does end up better than NB when used with learn on error, that is 
an advantage in terms of computational resources required.

 -- sidney


Re: Cluster analysis in Mac spam filter

2004-10-03 Thread Sidney Markowitz
Sidney Markowitz wrote:
caused an improvement in FPs 
but a degradation in FN rate
Typo - I left out mention that the result was using NB, and not using CBR.
 -- sidney


Re: reporting to spamcop

2004-10-05 Thread Sidney Markowitz
andrew collier wrote:
i have the following problem when reporting spam
This mailing list is used by SpamAssassin developers to discuss ongoing 
development work on SpamAssassin. Your question has nothing to do with that.

Your question is appropriate for the SpamAssassin users mailing list 
(see the SpamAssassin wiki article 
http://wiki.apache.org/spamassassin/MailingLists )

Be sure to search the list archives and the wiki for an answer before 
you post your question. You get an answer faster by finding it already 
posted than by asking it again, if the answer is already available.

If you have identified a bug in SpamAssassin (which is not in evidence 
in the message you posted) the appropriate action is to confirm it on 
the SpamAssassin users mailing list and by searching the SpamAssassin 
wiki and the Bugzilla database, then report it there.

 -- sidney


Re: limit on number of URIs decoded?

2004-10-13 Thread Sidney Markowitz
Justin Mason wrote:
The first fix is truncation of the text before passing to TextCat.
Michael, I think you were looking at this?  the results are impressive,
if the text is truncated to 32k bytes:
It was me. I've been looking at ways to not have to create so much 
garbage (I'm a lisp hacker -- I'm not using the word in the pejorative 
sense) in that loop in create_lm, but the simplest way of dealing with 
it this is to truncate $input to perhaps 10,000 bytes in the call to 
create_lm. Since TextCat is just a heuristic for determining the 
language and there is no incentive for spammers to, for example, prefix 
a Spanish language message with 10,000 bytes of English words just to 
slip through the spam filters of English-only speakers, the first 10,000 
bytes is plenty as a limit. Language recognition accuracy does not 
improve noticeably past one or two thousand characters, while going to 
less than 10,000 does not provide much additional speed or memory 
benefit. If there is no real language text in the first 10,000 
characters of rendered body, then it will not be recognized as any 
language and the rule will not fire, failing safely.

I propose putting in the truncate for 3.0.1 as a quick and safe way of 
around the problem we saw with that malformed MIME message. I'll keep 
playing with the loop just in case I can speed it up enough for the 3.1 
time frame to not have to truncate, but we should do the quick fix right 
away.

 -- sidney


Re: svn commit: rev 54716 - in spamassassin/trunk: . t

2004-10-14 Thread Sidney Markowitz
Added: spamassassin/trunk/t/memory_cycles.t
I just noticed this now while trying to make test on a machine that 
doesn't have Devel::Cycle. Is that going to be a documented requirement now?

 -- sidney


Re: svn commit: rev 54716 - in spamassassin/trunk: . t

2004-10-14 Thread Sidney Markowitz
Justin Mason wrote:
the test should be a no-op without that module did that not work?
This is extracted from output of make test, running under Cygwin with 
perl 5.8.5

t/memory_cycles.Can't locate Devel/Cycle.pm in @INC (@INC 
contains:
t . ../blib/lib /c/sasvn/trunk/blib/lib /c/sasvn/trunk/blib/arch 
/usr/lib/perl5/5.8.5/cygwin-thread-multi-64int /usr/lib/perl5/5.8.5 
/usr/lib/perl5/site_perl/5.8.5/cygwin-thread-multi-64int 
/usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl 
/usr/lib/perl5/vendor_perl/5.8.5/cygwin-thread-multi-64int 
/usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl) at 
t/memory_cycles.t line 66.
BEGIN failed--compilation aborted at t/memory_cycles.t line 66.

 -- sidney


Re: [Query] Whitelist

2004-10-14 Thread Sidney Markowitz
ratan kamath wrote:
Query: If a mail arrives [...]
This mailing list is used by SpamAssassin developers to discuss ongoing 
development work on SpamAssassin. Your question has nothing to do with that.

Your question is appropriate for the SpamAssassin users mailing list 
(see the SpamAssassin wiki article 
http://wiki.apache.org/spamassassin/MailingLists )

Be sure to search the list archives and the wiki for an answer before 
you post your question. You get an answer faster by finding it already 
posted than by asking it again, if the answer is already available.



Help with bug 3917

2004-10-27 Thread Sidney Markowitz
Fred,
I noticed you mentioned in a bug comment about getting some information 
using Ethereal. If you are also running Cygwin, could you help a bit 
with bug #3917? I'm stuck because of some firewall issues that I have 
not yet tracked down on the home machine where I can test.

What I'm trying to do is get a network capture of the problem to see 
what exactly is failing when there is the protocol error. I have shown 
that running the test case using spamd on Cygwin and spamc on another 
box (which could be linux) will demonstrate the problem. Unfortunately, 
Ethereal (or anything using winpcap) will not capture anything when the 
client and server are on the same machine, and I can only get my 
machines to talk through an ssh tunnel which prevents sniffing.

So if you have Cygwin and another box and the time and can reproduce the 
problem, that would help.

Thanks,
 -- sidney


Re: ?

2004-11-02 Thread Sidney Markowitz
Alexandr Orlov wrote:
X-Spam-Status: SpamAssassin Failed
It does not appear anywhere within the SpamAssassin source code.
Googling for that exact header showed up a number of messages with it, 
all spam. At first I thought it must be a fake header added by some 
spammers to try to fool SpamAssassin, but it always appears at the top 
of the mail, after an Envelope-To header and before the first Received 
header. I don't see how a spammer could place a header there.

Check with a sysadmin for the mail server from which you receive mail to 
see if they add that header and why.

It might make a good spam sign... Does anyone here see the header in 
corpus mail?

 -- sidney


Re: svn commit: r106170 - /spamassassin/trunk/spamd/spamd.raw

2004-11-22 Thread Sidney Markowitz
Daniel Quinlan wrote:
Please try to use the more standard perl formatting:
Do you see anything wrong other than two of the lines being more than 80 
 characters? I'll check in an update to fix that as soon as I finish 
running a make test on the change.

 -- sidney


Re: svn commit: r106170 - /spamassassin/trunk/spamd/spamd.raw

2004-11-22 Thread Sidney Markowitz
Justin Mason wrote:
Sidney -- I think it's the
foo( bar )
vs.
foo(bar)
I prefer that too. I copied the style that was already in the code, and 
I looked for something about that in the style guide and did not see any 
mention of it one way or the other. Unless it is there and I missed it, 
you or Daniel should add something about it on the wiki page.

 -- sidney


Re: svn commit: r106173 - /spamassassin/trunk/spamd/spamd.raw

2004-11-22 Thread Sidney Markowitz
Daniel Quinlan wrote:
Heh, I was most talking about the paren style, actually, not the line
length (although now that you mention it).
There are a few hundred spaced parens in spamd.raw. I'll fix the lines I 
changed if you want, but if it's ok with you I won't do a massive edit 
of the file.

Or I can just keep it mind for next time I check something in.
 -- sidney


Re: svn commit: r106170 - /spamassassin/trunk/spamd/spamd.raw

2004-11-22 Thread Sidney Markowitz
Daniel Quinlan wrote:
   *   No space between function name and its opening parenĀ­
   thesis.
I did see that. That would allow foo( bar ) which is what I did. If you 
want foo(bar) as a preferred style it would have to be added to the wiki 
page.

 -- sidney


Re: svn commit: r106600 - /spamassassin/trunk/t/SATest.pm

2004-11-26 Thread Sidney Markowitz
I just tried a quick build and make test in Windows XP to see what it 
would do, and

1. I could not reach the svn server from svn, although I could ping it. 
Is it down?

2. I got lots and lots of
 Use of uninitialized value in concatenation (.) or string at 
..\lib/Mail/SpamAssassin/ArchiveIterator.pm line 1023.

3. I realized that I would not be able to test the use of netstat anyway 
because Windows does not run spamd. You can set environment variables to 
tell the spamc tests to assume that spamd is already running on some ip 
address and port, but that isn't relevant to this issue.

 -- sidney


Re: svn commit: r106600 - /spamassassin/trunk/t/SATest.pm

2004-11-26 Thread Sidney Markowitz
The error message from ArchiveIterator.pm is because Windows does not 
define $HOME environment variable by default. It has $HOMEDRIVE and 
$HOMEPATH which together server the same purpose. The code in 
ArchiveIterator.pm has to be changed to check for Windows, or else we 
can document the need to set a $HOME. Do we use $HOME anywhere else?

I just found it because I used to have HOME defined in my XP environment 
for other reasons.

 -- sidney


Re: svn commit: r106600 - /spamassassin/trunk/t/SATest.pm

2004-11-26 Thread Sidney Markowitz
Malte S. Stretz wrote:
What does getpwuid() say on Windows?
Not implemented :-)
You can't use getpwuid in Windows. The usual portable implementation 
checks for running under Windows and uses $ENV{'HOMEDRIVE'} . 
$ENV{'HOMEPATH'} if it is instead of $ENV{'HOME'}, being careful about 
the former using '\' separators instead of '/'.

 -- sidney


Re: svn commit: r106600 - /spamassassin/trunk/t/SATest.pm

2004-11-26 Thread Sidney Markowitz
Malte S. Stretz wrote:
So maybe we should add a M::SA::Util::get_home() which first 
tries $ENV{HOME}, then on Windows $ENV{HOMEDRIVE}\$ENV{HOMEDIR}, then 
portable_getpwuid()[7], then... foo?
portable_getpwuid() doesn't seem to do anything useful under Windows for 
this purpose and shouldn't be needed anyway. It just returns 'unknown' 
for the name, which works when you don't care about the actual user name.

The first two steps are fine, and probably enough, except that you would 
  not have to add the '\' separator, it is already in HOMEDIR.

Question: In ArchiveIterator.pm does everything work if that is what it 
uses for HOME or does anything have to be done to convert \ to / ?

 -- sidney


Re: svn commit: r106600 - /spamassassin/trunk/t/SATest.pm

2004-11-26 Thread Sidney Markowitz
Malte S. Stretz wrote:
oops :)  But I'm glad you didn't notice my HOMEx debugging glitch :)
I did, but I understood what it was for :-)
I spoke too soon about it working. When I add a -w to the perl command 
it barfs in catpath, I think because it expects to be passed all three 
arguments, volume, dir, and file. I'll try adding a third argument of '' 
and see what it does. Or I could try reading the doc on catpath first :)

 -- sidney


Re: MIT spam conference

2004-11-27 Thread Sidney Markowitz
Daniel Quinlan wrote:
[EMAIL PROTECTED] (Justin Mason) writes:
CFP ends in 4 days though.
If the trend in conference quality continues
Oh, then I _do_ have time to design, research, write, and propose a 
paper for it!

 -- sidney :-)


Re: Can anyone here write some plain English?

2004-12-06 Thread Sidney Markowitz
Loren Wilton wrote:
Doesn't the free VC install include nmake?  The normal one does.
No, that's the problem. No nmake, no winsock.h, necessitating two more 
big downloads in addition to the free toolkit.

The DDK also includes Nmake, and a considerably newer version than what
Well, I guess that can be mentioned as yet another alternative if it is 
more practical to order a CD than to download a few hundred megabytes. 
At some point it may be too much information for a readme file and more 
appropriate for a page in the wiki.

 -- sidney


Re: Idea: New way to train Bayes

2004-12-06 Thread Sidney Markowitz
Any comments?  Interest in co-authoring a research paper (*poke*,
I might have some ideas about it... especially if it could be related to 
classification of cancer cells based on microarray gene expression data :-)

Now I have something to think about on my ferry commute this morning.
 -- sidney
p.s. Finish your thesis first :-)


signature.asc
Description: OpenPGP digital signature


Re: Idea: New way to train Bayes

2004-12-06 Thread Sidney Markowitz
Nick Leverton said that papers he has seen found that learn on error 
always works better than learn everything. But I recall one that looked 
more carefully at longer term results and found that learn on error 
degrades over time. They found it best to retrain on fresh data every 
few months. (I don't have the reference handy).

That makes sense if you consider that spam (and possibly ham) patterns 
change over time, even more so to the degree that spam patterns are 
actively adapting to try to beat spam filters.

BTW, at least one spam learning filter I've seen reduces its memory 
requirements by using a small hash size (like 32 bits) for representing 
tokens. Such systems will show poorer results for learn everything 
compared to learn on error simply because of collision effects once they 
learn too many tokens.

What I haven't seen discussed is the effect of token expiration as is 
done SpamAssassin. Wouldn't that produce he same effect as periodic 
retraining, thereby allowing learn on everything to work well? Doesn't 
that prevent the problems of converging to a mean and slowing down the 
learning? How does the effect of token expiration compare to the use of 
back-propagation?

 -- sidney


signature.asc
Description: OpenPGP digital signature


Re: YOU ARE ON THE WAY TO DESTRUCTION

2004-12-16 Thread Sidney Markowitz
Daniel Quinlan wrote:
Bugzilla says we can release 3.0.2 so I therefore propose we release 3.0.2.
+1!
 -- sidney
http://www.sidney.com


signature.asc
Description: OpenPGP digital signature


Re: buildbot failure in [...]

2004-12-17 Thread Sidney Markowitz
Justin Mason wrote:
(b) however the -parker- and -sidney- ones *are* getting annoying. ;)  I
suggest we turn off those slaves until we can figure out how to get
buildbot to work with dynamic-IP slaves...
I'm running three slaves on one machine, two of them on the same VMWare 
virtual machine and one running native. Most of the time they do not 
generate errors. I have a static ip. The problem cannot be that buildbot 
doesn't work with such a configuration, or else it would never work.

I wonder if svn has trouble with all the clients trying to run at the 
same time on the same physical machine.

 -- sidney


Re: buildbot failure in [...]

2004-12-17 Thread Sidney Markowitz
Justin Mason wrote:
Sidney, have you tried setting  --keepalive=300
I'll try that. What Michael says does make sense. I'm behind a NAT.
Is there a way of setting a port that the slave listens on? I can 
configure my NAT to let the slaves be designated servers on some port if 
I can make it a fixed port and assign a different port number to each of 
them. I'm sure if it is possible I could find it by RTFM, but I have not 
had a lot of time to learn about buildbot and twistd.

By the way I have to call twistd directly instead of buildbot in order 
to get everything to work in Cygwin and Win32. They need the -n option 
in order to run, and in Win32 I have to give it the -r win32, which I 
would have expected to be automatic when running a win32 buildbot.

Cygwin command: twistd -l - -n -f ../buildbot.tap
Win32 command:  twistd -l - -n -r win32 -f ..\buildbot.tap
 -- sidney


Re: buildbot failure in [...]

2004-12-17 Thread Sidney Markowitz
Justin Mason wrote:
might be worth signing up to buildbot-devel (it's very low traffic)
and mention that...
I'm going away on holiday soon for a couple of weeks. I'll look at that 
after I come back. There may be some issues to work out if I'm going to 
test with their latest cvs version and that's not what our server is 
running, and I won't have time for it before I leave.

I did find where to stick the --keepalive option. Buildbot doesn't take 
it, so I hardcoded it at the end of the mktap command line that is put 
together in runner.py.

Cygwin and Win32 are running with it now. I'll restart the Fedora Core 3 
one as soon as I finish a system update I'm doing on that machine right now.

 -- sidney


Make test failure in SPF test

2004-12-18 Thread Sidney Markowitz
I'm seeing the following in make test in the spf test. It doesn't show 
in the buildbot test because they skip SPF. (As an aside, why do they 
skip it?)

$ t/spf.t
1..2
# Running under perl version 5.008005 for cygwin
# Current time local: Sun Dec 19 09:49:57 2004
# Current time GMT:   Sat Dec 18 20:49:57 2004
# Using Test.pm version 1.25
/usr/bin/perl -T -w ../spamassassin -C log/test_rules_copy 
--siteconfig path log/localrules.tmp -p log/test_default.cf  -t  
data/nice/spf1
Checking helo_pass
Not found: helo_pass =  SPF_HELO_PASS
not ok 1
# Failed test 1 in t/SATest.pm at line 549
Checking pass
Not found: pass =  SPF_PASS
not ok 2
# Failed test 2 in t/SATest.pm at line 549 fail #2

 -- sidney
http://www.sidney.com


signature.asc
Description: OpenPGP digital signature


Re: buildbot failure in [...]

2004-12-18 Thread Sidney Markowitz
I had a power glitch here which rebooted the server. I think it happened 
in the middle of the svn update causing all three slave jobs to fail, 
and I think that it was a power glitch that caused the reboot. I'm not 
going to bother to bring the buildbot slaves online again before I leave 
on Holiday. The keepalive may have kept them going ok before the power 
failure, but it was too short a timespan to be sure.

 -- sidney


Re: Unsuscribe?

2004-12-21 Thread Sidney Markowitz
William Holman wrote:
I've been over-ruled by those who pay the bills, so I can't use
SpamAssassin since it's open source
What bills? -- It's open source! :-)
If you look at the SpamAssassin wiki you can find a list of products 
that are based on SpamAssassin that your billpayers can feel happy 
paying for while not getting access to all the source code. That way 
they can continue to be blissfully ignorant suckers and you can still 
use SpamAssassin. (I don't mean to reflect on the commercial products, 
only on the apparent attitude of your bill payers).

http://wiki.apache.org/spamassassin/CommercialProducts
How do I unsubscribe from the lists?
If you view the headers of any email on these mailing lists you will see 
a header like this one on this list:

list-unsubscribe: mailto:[EMAIL PROTECTED]
Send an email to that address from the address that you want to 
unsubscribe and it will be done. Subject and body text are ignored.

 -- sidney
http://www.sidney.com/


Re: Target Milestone of Future is harmful

2005-01-10 Thread Sidney Markowitz
Justin Mason said:
 It's a manageability thing.

Another way to look at is that the only person you can be certain has an
interest in a new bug report is the one who submitted it. If the target
milestone is still Future and you feel strongly about it, then it is up
to you to evangelize the bug report until a developer is convinced to
change the milestone. Submitting a patch is one of the most effective
ways to do that.

I think that a bug remaining with target Future is more symptom than
cause. It allows bug reports that appear not to be important to address
right now to not get lost while still being mostly ignored. Anyone who
cares about a specific bug report should speak up and make their case for
it.

 -- Sidney Markowitz
http://www.sidney.com




Re: SURBL whitelist volume chicken-egg problem

2005-01-15 Thread Sidney Markowitz
Robert Menschel said:
 Don't drop the 125, but simply add to the
 whitelist a number of the new top 100

I like that idea as a second choice. If the list is only updated when
there is a new release of SpamAssassin then it will not grow too rapidly.
It would be quite a few years to get to a table of a thousand entries.

There would be a minor problem with a whitelisted domain expiring and
getting snapped up by a spammer. That could be taken care of by the SURBL
people checking if a domain that is being added to the SURBL is on the
whitelist and informing the SA team so it can be removed.

But it is only my second choice, if there is no way for SURBL to monitor
domains in email independent of the SA queries. I like the idea of them
getting feeds from ISPs like the one sonic.net offered. That way they can
maintain a current list of most common domains in ham mail independent of
the SpamAssassin release cycle. SpamAssassin could download the list more
or less often depending on how volatile the list is. My guess is that
monthly is fine, as that is much better than once per SA release cycle.

 Sidney Markowitz
 http://www.sidney.com




Re: SURBL whitelist volume chicken-egg problem

2005-01-15 Thread Sidney Markowitz
Jeff Chan said:
 There are a number of reasons for not doing a whitelist RBL:

 1.  Excessive queries:  Whitehat domains come up a lot
 in messages.

I was thinking along the lines of something that SpamAssassin downloads
once a month, or queries to find out if there is an update available and
only downloads if there is. Since the idea is to limit DNS queries, of
course it would not be implemented as a DNS-based whitelist that is
checked for every URI. It could be stored on a DNS if you could trust
people not to misuse it, but it must be designed for infrequent downloads
in bulk, with queries of URIs done to a local database.

 2.  Potential misuse:  Inadvertently blacklisting whitehats,
 i.e. user error.

If it is separate enough from the blacklist, i.e., it is queried and used
in a totally different way than a DNS query of each URI domain, then I
don't see much potential for misuse. You simply have a list of the top n
non-spam domains that can be downloaded in bulk and document how to do it
and that it is to be used to reduce the number of DNS queries.

 3.  Possibility of negative scoring:  Some application would
 probably try to negative score them

SpamAssassin would not do it. You would not encourage that. Your
documentation would make it clear that it is a list of domains not to
bother DNS querying that do not indicate either spam or ham when they
appear in an email. Even if some misguided programmer missed all that, I
don't see how it would be in a mainstream popular antispam program with
enough use to effect spammers' behavior.

 Sidney Markowitz
 http://sidney.com




Re: SURBL whitelist volume chicken-egg problem

2005-01-17 Thread Sidney Markowitz
Daryl C. W. O'Shea said:
 The emails generated could be used to calculate
 the domains most often seen.

I would be afraid of it being too easy for malicious people to hack by
sending in false data, DoS attacks on the email addresses, etc. Also
there is no reason to load down some email address with data from
everyone who is running SpamAssassin. Feeds from a few large ISPs would
be accurate enough for the purpose and more trustworthy.

 Sidney Markowitx
 http://www.sidney.com




spamassassin svn server appears to be down

2005-02-07 Thread Sidney Markowitz
Is anyone else seeing problems accessing the SpamAssassin svn? I can't
connect to the server using svn from my machine and
http://svn.apache.org/viewcvs.cgi/spamassassin/trunk/?root=Apache-SVN
does not respond either. Ping works.

 -- sidney




Re: [Bug 4124] New: New spamassassin script doesn't work due to tainting

2005-02-16 Thread Sidney Markowitz
Malte S. Stretz wrote:
 Ok, I added some code for this in r153131.  Could you please test it (just 
 do a 'make clean; make'), especially on Windows?

Ok, my Windows machine is working and the disk is mostly restored now,
and I found the right thread to report this...

Malte, the current makefile is broken for Windows in one place.

Line 1152 of Makefile.PL has

ifeq $(INSTALLDIRS) site
INSTALLSCRIPTREALLY = $(INSTALLSITEBIN)
else
INSTALLSCRIPTREALLY = $(INSTALLSCRIPT)
endif

In Windows we use nmake instead of GNU make, which has different
conditional syntax:

!IF $(INSTALLDIRS) == site
INSTALLSCRIPTREALLY = $(INSTALLSITEBIN)
!ELSE
INSTALLSCRIPTREALLY = $(INSTALLSCRIPT)
!ENDIF


Everything else seems fine when I change that in the generated makefile.
I'll leave it up to you to decide the cleanest way to either
conditionalize that for Windows or eliminate the need for a preprocessor
conditional.

 -- sidney


Re: [Bug 4124] New: New spamassassin script doesn't work due to tainting

2005-02-16 Thread Sidney Markowitz
Daniel Quinlan wrote:
 We support nmake?

That's the Microsoft nmake, not to be confused with any other make
program of the same name. It's what is available on Windows. For
compatibility we have to put all the fancy logic in the perl of
Makefile.PL so the resulting makefile is written to a dumbed down common
denominator.

I'll test out Malte's fix when he checks it in.

 -- sidney


Re: [Bug 4124] New: New spamassassin script doesn't work due to tainting

2005-02-17 Thread Sidney Markowitz
Malte S. Stretz wrote:
 Sidney, could you test r154095 on Windows please?

It works. BTW, my buildbot slaves are running again so you can see
immediately, e.g.,

http://bugzilla.spamassassin.org:8010/trunk-sidney-win32/builds/51

 -- sidney


Re: [PATCH] Config file for spamc

2005-02-23 Thread Sidney Markowitz
John Madden wrote:
 So, I put together this patch. It causes spamc to read
 /etc/mail/spamassassin/spamc.conf (if it exists)

John,

Would you open a ticket for this on Bugzilla at
http://bugzilla.spamassassin.org as an RFE (severity: enhancement) and
attach your patch there using the Create a New Attachment button?

That will keep it from getting lost or overlooked.

I don't think the patch is complete, as it is more *nix specific than
spamc itself is. There needs to be something to specify a different
configuration file for Windows, VMS, etc., either by compile-time
conditionalization or a preprocessor variable that can be set in the
makefile used on different platforms.

That said, if you post what you have in Bugzilla, anyone who wants to
finish it can work on it, and anyone who wants to use it as is on their
own site will be able to find it when they search the bug list for this
problem.

Thanks,

 -- Sidney Markowitz
http://www.sidney.com/


signature.asc
Description: OpenPGP digital signature


make test failures

2005-02-27 Thread Sidney Markowitz
t/debug.t and t/spf.t both have failures. I'm not sure how long ago they
started failing as the failures are hidden by the warning-only failures
in rule_names.t.

Is there a way that we can distinguish between rule_names and the other
failures so that we can go back to sending notification emails on
buildbot failures?

 -- sidney


Re: make test failures

2005-02-27 Thread Sidney Markowitz
I fixed the test failure in t/debug.t checking in to r155617.

The test was just missing a new dbg message tag, replacetags, so I added
it to the list.

I'm less sure about what is the correct thing to do for the failure in
t/spf.t. In that case there is a test for SPF_HELO_FAIL in the test
spam. But as far as I can tell, spamassassin.org has a ?all in its SPF
record, which should mean that the result code of 'neutral' for the helo
test is correct. Should it fail? Do we need a new test case that
generates an SPF HELO failure?

 -- sidney


Re: make test failures

2005-02-28 Thread Sidney Markowitz
Justin Mason wrote:
 According to the SPF people, we shouldn't
 be using -all on a domain that may possible emit mail. So I changed
 the record...

That can't be right. Try out the wizard at
 http://spf.pobox.com/wizard.html?mydomain=spamassassin.org

It gives you two choices in the last question Do the above lines
describe all the hosts that send mail from spamassassin.org. If the
answer is yes, you get ~all in the record, if it is no you get ?all.

If you can list all sending domains, sending ip addresses, and ISP mail
servers that are allowed to send mail from a spamassassin.org address,
then you can use ~all and we can use from spamassassin.org in the SPF
test for a failed HELO. If you can't list all of them in the record, we
are forced to use ?all and we need a different domain to use for the test.

I don't want to make the change to enable the test for Windows until we
have the test fixed to not fail.

 -- sidney


signature.asc
Description: OpenPGP digital signature


Re: make test failures

2005-03-01 Thread Sidney Markowitz
Justin Mason wrote:
  According to the SPF people, we shouldn't
 be using -all on a domain that may possible emit mail

Even if, as I think, ~all is correct if you can enumerate all legal
senders for the domain, there still is a problem with making our test
depend on the current configuration of something that is being used for
some other purpose. There is always the risk that there will be a reason
for changing the configuration.

I got an idea from the tests in Mail::SPF::Query. How about if you
define a spf-test.spamassassin.org domain with an SPF record with ~all.
Then you are guaranteed that it will generate a fail but it can't mess
up any real email.

 -- sidney


Re: svn commit: r156102 - in spamassassin/trunk: lib/Mail/SpamAssassin/Plugin/Razor2.pm rules/50_scores.cf

2005-03-04 Thread Sidney Markowitz
Shelby,

This mailing list is for developer discussions. Developers consist of
the people who have commit access to our source control system, SVN.

As per Apache Foundation policies, the development process is
transparent. That means that the technical and design discussions we
developers have and all other parts of our decision process are held in
public view. Suggestions from the public and our responses to them are
in this same public forum.

In the end, we make the decisions, following a documented consensus-like
process in which any developer has veto power and only developers have a
vote.

The process would not work if every message from any one of us is
followed by a disagreeing comment from a non-developer. At some point we
need to be able to have our discussion and not to rehash those aspects
of the discussion on which we as developers agree. We certainly should
not have to deal with basic questions about the terminology we use every
day in our discussions such as the acronym for our source control
system, SVN.

You have made your points about your software. I believe that we have
stated clearly that we are not interested in hearing more about it until
you have some code and results that we can look at and test and that you
are ready to offer in a form that is compatible with our license.

This mailing list is for developer discussions. I could try to explain
what that means, but I'm afraid that you may not have the awareness of
personal or social boundaries to be able to use the explanation. I'll
put it in quantitative terms: 50% of the last 24 messages in my mailbox
for this list are from you. If you can keep the proportion of emails
from you to this list down to an amount typical of any other single
non-developer, then you will have some assurance that you are not making
inappropriate posts.

Please do not reply with a rebuttal to this email. Or even an apology if
you are so inclined. The fact that I think that your recent posts to
this list are off topic for the list is not debatable, it is my opinion.
It is shared by others of us who are responsible for keeping this
project together. We need to keep the noise level down and get back to
having developer discussions and writing code.

We'll see you again if and when you have some code to share.

Thank you,

 -- sidney


Re: svn commit: r156102 - in spamassassin/trunk: lib/Mail/SpamAssassin/Plugin/Razor2.pm rules/50_scores.cf

2005-03-04 Thread Sidney Markowitz
Daniel Quinlan wrote:
 aspects of the AL 2.0 don't really translate to services, but use does
 and that's my main concern with Razor2.

I find Theo's argument that use of the razor server is always free to a
user of a free SA distribution compelling.

Code being free but charging for service is in the best tradition of
Free and of Open Source software. Redhat's up2date is open source code
(GPL?), using it to access their server possibly costs money. Email
client software can be free while the account on the mailhost it talks
to costs money.

If the razor services are free to anyone who has not paid for the
client, that is even more liberal than most service-based systems. It
also means that anyone to whom we distribute SpamAssassin can use the
razor servers for free, which seems compatible not just with the letter
but also the spirit of the Apache License.

 -- sidney


Re: bug squash next week?

2005-03-04 Thread Sidney Markowitz
I vote +0.5 for Fri Mar 11.

I'm voting for that date because it is a weekend here on the other side
of the world, which is the only time I can do anything.

I'm only voting 0.5 because I probably still won't have much time, even
on a weekend :-(.

 -- sidney


Re: header modification

2005-03-04 Thread Sidney Markowitz
Frederik Eaton wrote:
 Is it possible to configure spamassassin to get back the original
 functionality of only modifying headers of spam

1. Look up the doc on rewrite_header and report_safe in man
Mail::SpamAssassin::Conf or other documentation

2. Any further questions about this or similar topics should be directed
to the SpamAssassin users mailing list, not to here. This list is for
developer discussions only. Don't even reply to this with an apology or
a thank you. I'll pretend that you have replied politely and leave it at
that :-)

 -- sidney


Re: svn commit: r156102 - in spamassassin/trunk: lib/Mail/SpamAssassin/Plugin/Razor2.pm rules/50_scores.cf

2005-03-04 Thread Sidney Markowitz
Daryl C. W. O'Shea wrote:
 Shouldn't people evaluate whether or not they are eligible to use Razor2
 before downloading (and installing) the razor-agents from Vipul's website?

That was the substance of the reply I tried to write last night but was
too sleepy to finish.

I thought about how I never configure razor in my test installations and
wondered how that happened when I was pretty much taking defaults if we
supposedly support razor out of the box.

I realized it's because you don't get razor unless you explicitly
install Razor2 module from CPAN.

So we do not distribute SpamAssassin configured to run Razor. We
distribute it configured to use Razor if the Razor2 module is installed
on the machine. Installing Razor2 is what gets someone involved with the
license to use the service.

As a result of that, I am now +1 on having the line to include razor
being in init.pre and I am an agnostic +0 on whether it is commented
out. As far as I can see it makes no difference if enabling the razor
plugin requires only installing Razor2, or if it requires installing
Razor2 and also uncommenting a line in init.pre.

 -- sidney


signature.asc
Description: OpenPGP digital signature


Re: svn commit: r156102 - in spamassassin/trunk: lib/Mail/SpamAssassin/Plugin/Razor2.pm rules/50_scores.cf

2005-03-04 Thread Sidney Markowitz
Duncan Findlay wrote:
 That's arguably a bug in the operating system then

I don't think it is even that, but I agree with you that it is not our
place to work around it.

Consider this: Razor is free to use if the client software is free. The
client module may come freely with the OS. The client plugin is freely
available from SpamAssassin. The only way it costs money to use it from
SpamAssassin is when somebody packages SpamAssassin with something else
as a commercial product and sells it. (Is that true? Does a large ISP
who uses SpamAssassin have to pay to enable razor on their high volume
site?) Someone who sells such a commercial package is responsible for
the configuration that they ship it with. At that point we are not
talking about the default configuration of the free core distribution of
SpamAssassin.

So I don't see the need from a licensing point of view of disabling
razor in init.pre.

And I'm still +0 on commenting it out anyway. At some point I guess it
will be time to stop discussing this if the votes are all that nobody is
-1 on commenting out the line in init.pre and Daniel is strongly +1 on
commenting it out.

 -- sidney


Daniel and SpamAssassin are on Slashdot!

2005-03-04 Thread Sidney Markowitz
Daniel and SpamAssassin are on Slashdot!

http://it.slashdot.org/article.pl?sid=05/03/04/2010218tid=111

 -- sidney


Re: svn commit: r156102 - in spamassassin/trunk: lib/Mail/SpamAssassin/Plugin/Razor2.pm rules/50_scores.cf

2005-03-05 Thread Sidney Markowitz
Shelby Moore wrote:
 Sidney Markowitz wrote:
 
This mailing list is for developer discussions. I could try to explain
what that means, but I'm afraid that you may not have the awareness of
personal or social boundaries to be able to use the explanation.
 
 There you go again trying to ERRONEOUSLY inpune my character.

Again? I stand by the politeness of the one other message I posted in
reply to your original proposal. I apologize if you consider my
statement about personal boundaries an insult. To me it is a
cultural/personality difference that is not worth trying to work around,
hence my appeal to numbers instead of trying to convince you that there
was anything one might find annoying in your posts.

This mailing list is for developer discussions. Developers consist of
the people who have commit access to our source control system, SVN.
  
 No where is that stated in public:

The only description of this mailing list, the one linked to by the
Lists link on the spamassassin.org home page, says that explicitly:

http://wiki.spamassassin.org/MailingLists#head-f67bc6dad74f08d4d8b6187fc92476b5a2aa4a2b

Unless you are looking for a definition of developers. The one I
provided (all committers) is not stated explicitly anywhere, but no
reasonable definition of developer would include more than the larger
list of contributors at
 http://svn.apache.org/repos/asf/spamassassin/trunk/CREDITS

In any case this is way off topic for a mailing list for developer
discussions. This is not a developer discussion. I will take your advice
about ignoring mail from you to avoid further off topic discussion.

 -- sidney


Re: header modification

2005-03-05 Thread Sidney Markowitz
Frederik Eaton wrote:
 As developers, you might want to add that information to the
 part of the man page I quoted

I assume that you are referring to the released version of SpamAssassin.
Looking at out latest development version I see that the wording has
already been changed to make that clearer in the next release.

 -- sidney



Re: header modification

2005-03-05 Thread Sidney Markowitz
Frederik Eaton wrote:
 Also, with all due respect, you really didn't have to be
 such an asshole

Reading my words quoted back to me, I agree. The question as you asked
it was more appropriate for the users list. My response to that effect
was posted to the list because people reading this list should see that
before they post. The Don't even reply... part was intended to be in a
lighthearted tone, and I see now that it does not come across as I
intended. For that I apologize.

Your post pointing out the misleading documentation once you had the
correct information was completely on topic for this list. I thank you
for pointing out the error even if it was one that had been found and
corrected in our development tree.

 -- sidney


Re: Fw: Spam - Internet gaming industry, Gaming Transac

2005-03-06 Thread Sidney Markowitz
Thanks for your interest in helping improve things, but please read

 http://wiki.spamassassin.org/DoYouWantMySpam

for the FAQ about not sending spam samples to our mailing lists.

 -- sidney



Re: client SMTP authorization

2005-03-10 Thread Sidney Markowitz
Tony Finch wrote:
 Is anyone planning to implement CSA for SpamAssassin?

I'm not, but I do have a question about it. Is it something that would
best be implemented on the MTA to reject fake SMTP servers, or does it
have a maybe case which would be best handled by a SpamAssassin rule
without outright rejecting the mail?

 -- sidney


Re: Proposal: 3.0.3 release schedule

2005-04-21 Thread Sidney Markowitz
Duncan Findlay wrote:
 That's a pretty significant change for a maintenance release.

Yes, and I mention it to bring it to his attention. I guess it's up to him
to decide whether or not to back port the patch, and then it is up to us
whether to accept it in an official 3.0.3 release, just like it is up to us
whether there is any official 3.0.3 release, and it is up to the Fedora crew
what they want to go into their FC4 distro.

I do think that with most of the change being encapsulated in a new object
and with the old code being definitely wrong, they might decide that it is
worth fixing all of those bugs with one change. Personally, I'm the reckless
type who would try to get 3.1 into FC 4. Lucky for them I'm not involved
with that :-)

 -- sidney


Re: svn commit: r164278 - /spamassassin/trunk/t/uri.t

2005-04-22 Thread Sidney Markowitz
[EMAIL PROTECTED] wrote:
 Added testcase from Bug4191

This test fails on my Fedora Core 3 system with svn trunk even though bug
4191 has a comment that says that it is fixed in 3.1.

t/uri...FAILED test 77
Failed 1/76 tests, 98.68% okay

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-24 Thread Sidney Markowitz
Theo Van Dinter wrote:
 I have ~300K of them.  http://www.kluge.net/~felicity/set1.txt

This should not be happening anymore since the patch for bug #4260 was
committed to trunk. Are you still getting them? The warning was only there
to help us track down that problem.

If we are sure that the problem has been fixed I'm also +1 on removing it.
It would be nice to know if it happens, but if the problem has been fixed it
is just some extra code that will never be run.

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-24 Thread Sidney Markowitz
Theo Van Dinter wrote:
 The output is from my Saturday weekly net run.  It looks like 4260 was
 committed as r161778, the nightly run was r164362.

Yuck, this looks like you are still getting DNS records in the wrong order.
Look at that first log entry. It says that a query for
usafreemerchantsource.com.multi.surbl.org is getting the response that is
correct for query for dns7.hichina.com. That's the problem that the warning
is supposed to help us catch. That should no longer be possible given that
there is a unique ID associated with each query and it is supposed to match
the ID in the response.

This is serious. It certainly proves the worth of having the warning in there.

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-24 Thread Sidney Markowitz
Theo Van Dinter wrote:
 It looks like 4260 was
 committed as r161778, the nightly run was r164362.

Do people think we should reopen 4260?

This could happen if the random ID isn't random enough or 16 bits isn't
large enough to avoid collisions. I don't see how that would happen if
different processes choose different ports to listen on, as there should be
no way then for queries to collide across processes and with the ID being
incremented each time there should be no collision within the same process.

If somehow the IDs are colliding, the fix would be to include some
information from the question along with the 16 bit ID to prevent that. I
have a small patch that will do that, but I would like to see it used in a
test to find out if it has anything to do with the problem before proposing
to use it for real.

Theo, would you be willing to run a mass test with this to see if it helps?

$ svn diff lib/Mail/SpamAssassin/DnsResolver.pm
Index: lib/Mail/SpamAssassin/DnsResolver.pm
===
--- lib/Mail/SpamAssassin/DnsResolver.pm(revision 164463)
+++ lib/Mail/SpamAssassin/DnsResolver.pm(working copy)
@@ -45,7 +45,7 @@
 use Mail::SpamAssassin::Logger;

 use IO::Socket::INET;
-
+use Digest::SHA1 qw(sha1_base64);
 our @ISA = qw();

 # a counter value to use for DNS ID numbers in new_dns_packet().
@@ -243,8 +243,8 @@
   return if $self-{no_resolver};

   my $pkt = $self-new_dns_packet($host, $type, $class);
-
-  my $id = $pkt-header-id;
+  $host =~ s/\.$//;
+  my $id = substr(sha1_base64($host . $pkt-header-id), -8);
   my $data = $pkt-data;
   my $dest = $self-{dest};
   if (!$self-{sock}-send ($pkt-data, 0, $self-{dest})) {
@@ -291,8 +291,11 @@
   defined $packet-answer)
   {
 my $header = $packet-header;
-my $id = $header-id;
-
+my @questions = $packet-question;
+my $ques = $questions[0];
+my $host = $ques-qname;
+my $nid = $header-id;
+my $id = substr(sha1_base64($host . $nid), -8);
 # dbg(dns: reply id=$id);

 my $cb = delete $self-{id_to_callback}-{$id};



Re: Question about dnsbl.t test

2005-04-24 Thread Sidney Markowitz
Daniel Quinlan wrote:
 It's not used in the t test itself.

Thanks, that helps. I suspect that whatever is causing the hang in bug 4278
has a symptom of a DNS query failing without hanging when it doesn't hang.
Now that I know that the $bind variable has nothing to do with it I can
track that down.

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-25 Thread Sidney Markowitz
Loren Wilton wrote:
 How about a simple debug printout of the id value sent and the id value
 received?  Maybe it is as simple as the id matching code is failing.

That's definitely a better idea considering that there is a bug in the patch
I posted that prevents any of the DNS stuff from working :-).

On the other hand, it does look like the id matching code is working and it
is difficult to see just from looking at tons of debug logs if IDs are
getting reused across processes and getting mixed up through use of the same
port. I'll see if I can get the sha1 version working better in case Theo is
inclined to try it to see what it does.

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-25 Thread Sidney Markowitz
This is the corrected patch that ensures that IDs are not colliding by
including the host name in an SHA1 hash with the 16 bit ID counter.

It is written a bit crudely, but if Theo or someone else who is seeing the
problem would try this in a mass test it would demonstrate whether the
problem has anything to do with this:

 -- sidney

Index: lib/Mail/SpamAssassin/Dns.pm
===
--- lib/Mail/SpamAssassin/Dns.pm(revision 164570)
+++ lib/Mail/SpamAssassin/Dns.pm(working copy)
@@ -22,6 +22,7 @@
 use Mail::SpamAssassin::Conf;
 use Mail::SpamAssassin::PerMsgStatus;
 use Mail::SpamAssassin::Constants qw(:ip);
+use Digest::SHA1 qw(sha1_base64);
 use File::Spec;
 use IO::Socket;
 use IPC::Open2;
@@ -145,7 +146,10 @@

   return $self-{resolver}-bgsend($host, $type, undef, sub {
   my $pkt = shift;
-  $self-{dnsfinished}-{$pkt-header-id} = $pkt;
+  my $h = $host;
+  $h =~ s/\.$//;
+  my $id = substr(sha1_base64($h . $pkt-header-id), -8);
+  $self-{dnsfinished}-{$id} = $pkt;
 });
 }

Index: lib/Mail/SpamAssassin/DnsResolver.pm
===
--- lib/Mail/SpamAssassin/DnsResolver.pm(revision 164570)
+++ lib/Mail/SpamAssassin/DnsResolver.pm(working copy)
@@ -45,7 +45,7 @@
 use Mail::SpamAssassin::Logger;

 use IO::Socket::INET;
-
+use Digest::SHA1 qw(sha1_base64);
 our @ISA = qw();

 # a counter value to use for DNS ID numbers in new_dns_packet().
@@ -243,8 +243,8 @@
   return if $self-{no_resolver};

   my $pkt = $self-new_dns_packet($host, $type, $class);
-
-  my $id = $pkt-header-id;
+  $host =~ s/\.$//;
+  my $id = substr(sha1_base64($host . $pkt-header-id), -8);
   my $data = $pkt-data;
   my $dest = $self-{dest};
   if (!$self-{sock}-send ($pkt-data, 0, $self-{dest})) {
@@ -291,8 +291,11 @@
   defined $packet-answer)
   {
 my $header = $packet-header;
-my $id = $header-id;
-
+my @questions = $packet-question;
+my $ques = $questions[0];
+my $host = $ques-qname;
+my $nid = $header-id;
+my $id = substr(sha1_base64($host . $nid), -8);
 # dbg(dns: reply id=$id);

 my $cb = delete $self-{id_to_callback}-{$id};


Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
Matt Sergeant wrote:
 May be a problem with forking. Here's part of the fork replacement I use
 in my code that uses the single-packet-DNS stuff:

Justin's code generates a number from the pid to initialize the ID
counter and keeps track of it itself instead of relying on the Net::DNS
code. Are there some systems in which fork does not result in a new pid?
Is it the case that the socket created in each process would use a
different source port on the local host? I don't see how there can be so
many collisions without both the pid and the source port being the same.

 -- sidney




Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
Theo Van Dinter wrote:
 The patch does make things *much* slower though, around 3x:
[...]
 Without the patch, lots of issues starting after 80%.

I don't claim that the patch is the most efficient way of dealing with it...
I just wanted to use SHA1 to ensure that there was no chance of an ID
collision. I think we have now verified that ID collision is the likely
proximate cause of the problem.

Still, can three SHA1 calculations compare to the time it takes for a DNS
query? I don't see how the computation would slow things down by a factor of
three.

Perhaps what you are seeing is the difference in wall clock time between
processing a reply to an old packet when it arrives right away vs rejecting
those packets and waiting for the actual reply. If that's what's happening
you are not going to get the faster time when everything works, as it is the
nameserver's response time that is slowing down the run. The faster time is
perhaps just a symptom of the bug?

There is still the question of where the collisions are coming from. Here's
another idea -- Instead of using substr(sha1_base64($host . $id), -7) use
something that combines the pid and id into a six byte string in the three
places in the code where sha1 is used. That will be faster than using SHA1,
which will let you know if the slowdown is due to computation or waiting for
good packets to arrive, and it will let you know if the problem is with
different processes using the same source port for sending the UDP queries.
If it is the latter, we may be able to avoid the collisions if we are better
about picking the source ports.

Are you up for some more few-hour tests? :-)

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
Sidney Markowitz wrote:
 use something that combines the pid and id

Brain fade... This patch works by matching information that is in the reply
packet to information in the query packet, which means it has to use the
host name and the packet ID. Duh! Sorry.

Still, we could try some debug log output to determine if the different
processes are using the same source ports. I don't see how we could have
collisions in the ID unless the source ports are the same. If that's it, we
would not have to use the host name to ensure that the reply matches the
query if we had a way of making the source ports different across processes.

Could you run with debug output that shows the pid, packet ID and source
port for the packets that are created in DnsResolver in a run that
demonstrates the bug?

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
I haven't been running mass-checks until now, but I just tried it with svn
trunk and got a couple of bogus rr warnings so far between the 50% and 60%
marks so far. It's taken two and a half hours to get that far, so this is a
very slow process. I just shut down the vmware session that was running the
Windows and Cygwin botslaves on that machine, and I hope that speeds things
up a bit. It's a 1Ghz Athlon machine running Fedora core 3.

In any case, it looks like I'll be able to run my own painfully slow tests
to try things out.

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
Matt Sergeant wrote:
 I didn't think you could do that because in newer versions of Net::DNS
 the id is a lexical variable. The only way to reinitialise it is to
 reload the module.

If I remember it correctly, Justin's code keeps its own counter and sets the
packet ID after creating the packet, making it independent of Net:DNS's counter.

I guess that would break down if there are any uses of Net::DNS by the same
process that do not go through his code. If that is what is happening and it
results in ID collision, the fix would be to use code like yours to reload
the module and rely on its own counter. I'll try that now that I can
reproduce the problem myself (painful as it is).

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
Sidney Markowitz wrote:
 I guess that would break down if there are any uses of Net::DNS by the same
 process that do not go through his code

grep doesn't find any other use of Net::DNS :-(

I just got another 10 bogus rr hits between the 60% and 70% marks on my mass
test run. I wonder what it could mean that it happens more towards the end
of a run that takes so long. Could nameservers take on the order of minutes
or a half hour to send back a UDP reply to a query?

At least now I know that the problem is reproducible here. Even though I
can't figure out why reloading Net::DNS should make difference, I'll try it
just in case.

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
Matt Sergeant wrote:
 May be a problem with forking

Do you think that this code fragment I see in SpamAssassin.pm should work as
well as your fork code, or could relying on this be part of the problem?

sub init {
  my ($self, $use_user_pref) = @_;

  # Allow init() to be called multiple times, but only run once.
  if (defined $self-{_initted}) {
# If the PID changes, reseed the PRNG and the DNS ID counter
if ($self-{_initted} != $$) {
  $self-{_initted} = $$;
  srand;
  $self-{resolver}-reinit_post_fork();
}
return;
  }

  # Note that this PID has run init()
  $self-{_initted} = $$;


 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
Theo Van Dinter wrote:
 I'm trying a small patch which basically calls the reinit function when
 the counter wraps to 0, as well as using rand when initializing.  This way
 it'll get a new random starting point and a new socket occasionally.

I think I understand the problem now. It's similar to what you said. I
noticed when debugging t/dnsbl.t that the one message in it generates 52 DNS
queries. When there are tens of thousands of messages in a mass check and
-j=4, there are going to be several wraparounds of the 16 bit ID. Apparently
nameserver responses can arrive quite late.

My guess about the slowdown you saw when using the sha1 patch is that while
it avoided errors from collisions, all the old reply packets were still read
and hashed before being discarded.

A fix would be to close and reopen the socket at each message, or as you
suggested when the counter wraps. But it should not be when the counter
wraps to zero, it should be when it wraps to its initial value.

I think it would be better to create the new socket with each message. If
old replies are arriving as they seem to, wouldn't it be more efficient to
not have a listener on the socket when they arrive?

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
Sidney Markowitz wrote:
 I think it would be better to create the new socket with each message. If
 old replies are arriving as they seem to, wouldn't it be more efficient to
 not have a listener on the socket when they arrive?

I got confused when I reread this, so I thought I should clarify it.

If the socket is not changed with each message, then a process sends out
queries on a port with one message, then continues to send out queries on
the same port in subsequent messages. The ID is incremented and the socket
is changed when the ID wraps, so there are no collisions.

However, packets still arrive in reply to queries sent for old messages.
Until the ID wraps, the replies are received, the ID is not found in the
pending list, and the packet is discarded.

There is no collision problem, but processing may take a lot longer than if
a new socket is created for each message causing the old replies to find no
listener on their port.

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-26 Thread Sidney Markowitz
I'm going to respond to yours and John Gardiner Myers replies in the bug
4260 discussion to keep everything tracked there now that I've re-opened the
bug.

 -- sidney


Re: uridnsbl: bogus rr run ...

2005-04-27 Thread Sidney Markowitz
Loren Wilton wrote:
 Depending on the value of the parameter
 that Perl is deducing from that statement, you may or may not be getting the
 results you expect.

From the doc:

srand
Sets the random number seed for the rand operator. If EXPR is omitted,
uses a semi-random value based on the current time and process ID, among
other things.


Since this call is only done when the process id is different (i.e., in a
fork) then srand with no arguments is correct for initializing rand for the
process.

 -- sidney



Re: Moving on to 3.0.4

2005-04-28 Thread Sidney Markowitz
Warren Togami wrote:
 Why bother pushing another tarball just for a single patch that
 affects only one distribution?

If I understand the preceding discussion correctly this is not a matter of
release early, release often carried to an extreme. It is an abort of the
release process for 3.0.3 after the version number has been frozen into a
tarball but before it has been announced. The idea is to abort the release
in order to accommodate another patch that would have gone into 3.0.3 if
there had been a 24 hour waiting period for final votes to be collected.

Michael, is that a correct assessment of the situation?

 -- sidney


failed test

2005-04-28 Thread Sidney Markowitz
I just saw this in a make test in Win32 that I am running right now.

I'm posting this to sa-dev because I have to go to sleep before the make
test finishes and so cannot see if it dies the same in Cygwin or elsewhere,
and I can't look at it right now:



t\meta..'..' is not recognized as an internal or
external command, operable program or batch file.
parse-rules-for-masses failed! at t\meta.t line 42.
tmp/rules.pl is unparseable: Can't locate log/rules-0.pl in @INC (@INC contains:
 ../blib/lib D:\sasvn\trunk\blib\lib D:\sasvn\trunk\blib\arch C:/Perl/lib
C:/Perl/site/lib . C:/Perl/lib C:/Perl/site/lib .) at t\meta.t line 45.
t\meta..dubious
Test returned status 2 (wstat 512, 0x200)
DIED. FAILED tests 1-2
Failed 2/2 tests, 0.00% okay


Another failed test in Win42

2005-04-28 Thread Sidney Markowitz
Sleep and kids don't always go together.

Here's the other test that failed in Win32, posted here in case anyone can
do anything with it. It works in Cygwin. After I post, I _will_ sleep...

t\bayessdbm.ok 48/52# Failed test 49 in t\bayessdbm.t at \
line 262
t\bayessdbm.NOK 49# Failed test 50 in t\bayessdbm.t at line 263
t\bayessdbm.NOK 50# Failed test 51 in t\bayessdbm.t at line 264
t\bayessdbm.NOK 51# Failed test 52 in t\bayessdbm.t at line 265
t\bayessdbm.FAILED tests 49-52
Failed 4/52 tests, 92.31% okay


Re: SpamAssassin 3.0.3 Released

2005-04-29 Thread Sidney Markowitz
Sidney Markowitz wrote:
 The correct fix for 3.0 branch, assuming that spf.t there is still testing a
 DNS record over which we have no control

Hmm, I looked. It doesn't. I'm downloading 3.0 branch now to see what is wrong.

 -- sidney


Re: SpamAssassin 3.0.3 Released

2005-04-29 Thread Sidney Markowitz
Now I remember what happened. We weren't using something like aol.com we
were using spamassassin.org, our real spf record. We changed it to do more
of the right thing and broke the test that counted on its old value.

I'll look into it more to see if there is a way to make it the test work the
way it is without breaking the mail.

 -- sidney


Re: svn commit: r168050 - /spamassassin/trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm

2005-05-04 Thread Sidney Markowitz
I just did a little experiment. I placed an entry for the ip address of one
of my web servers in /etc/hosts (or rather the Windows equivalent of it on
my PC) with host name www_host.exam_ple.com. I emailed myself a message
containing the text http://www_host.exam_ple.com

When I looked at the message in Thunderbird the URL was a hot link. Clicking
on it opened my browser looking at the site at that ip address.

Even if there are no host names in the SURBL right now with _ in them, if
SpamAssassin skips over those, just like Justin said spammers will start
using them.

 -- sidney


Re: boosting

2005-05-04 Thread Sidney Markowitz
Frederik Eaton wrote:
 How are the rule weights for spamassassin generated? There is a method
 called boosting

The rule weights are generated using a single-layer perceptron, as described
in the wiki link that Daniel mentioned.

I'm writing a paper this semester [I hope :-)] looking at the applicability
of the simple methods used by SpamAssassin to some classification problems
in microarray gene expression data. I expect to look at boosting along with
that, along the lines of  Jackson, J. and Craven, M.,  Learning Sparse
Perceptrons, Advances in Neural Information Processing Systems 8 (Conference
Proceedings of NIPS*95), 1996 http://www.mathcs.duq.edu/~jackson/bbp.pdf

So far I don't think that it has been tried. If anyone has looked at it it
would have been Henry Stern, who came up with the perceptron for
SpamAssassin rule scoring.

 -- sidney


Re: boosting

2005-05-04 Thread Sidney Markowitz
Fred wrote:
 There was similar work being done in the past to identify rules to be
 grouped into new meta rules, this (w|c)ould achieve similar results.
 http://bugzilla.spamassassin.org/show_bug.cgi?id=1363

I think I'm missing something here. Are you saying that automatically
grouping rules into meta rules that have similar classification properties
is equivalent to boosting? Or do you mean that it is another approach that
also can improve performance of weak learners?

In any case, you have given me an idea for the microarray gene expression
problem, so thanks! :-)

 -- sidney


Re: svn commit: r169047 - in /spamassassin/trunk: masses/corpora/mass-find-nonspam sa-learn.raw tools/speedtest

2005-05-07 Thread Sidney Markowitz
Theo Van Dinter wrote:
 -1
 
 Don't use M::SA unless its necessary (no reason to load a bajillion
 things).  Just use M::SA::Message.

I see that Mail::SpamAssassin-parse just calls
Mail::SpamAssassin::Message-new and returns the Message object. Is this the
correct syntax to use then instead of the call to parse() ?

 my $ma = Mail::SpamAssassin::Message-new({message=$dataref});

If that's it I'll make the change.

 -- sidney


Question about a proposed change

2005-05-10 Thread Sidney Markowitz
Does anyone have any objection to my checking in the following change? It
makes the code in Dns.pm independent of the format of the key that is used
to check the reply packets so that it will be easier to play with using
different keys such as hashes by changing only code in DnsResolver.pm.

  -- sidney

  --

Index: lib/Mail/SpamAssassin/Dns.pm
===
--- lib/Mail/SpamAssassin/Dns.pm(revision 169513)
+++ lib/Mail/SpamAssassin/Dns.pm(working copy)
@@ -145,7 +145,8 @@

   return $self-{resolver}-bgsend($host, $type, undef, sub {
   my $pkt = shift;
-  $self-{dnsfinished}-{$pkt-header-id} = $pkt;
+  my $id = shift;
+  $self-{dnsfinished}-{$id} = $pkt;
 });
 }

Index: lib/Mail/SpamAssassin/DnsResolver.pm
===
--- lib/Mail/SpamAssassin/DnsResolver.pm(revision 169513)
+++ lib/Mail/SpamAssassin/DnsResolver.pm(working copy)
@@ -296,7 +296,7 @@
   return 0;
 }

-$cb-($packet);
+$cb-($packet, $id);
 return 1;
   }
   else {

  --


signature.asc
Description: OpenPGP digital signature


Re: Question about a proposed change

2005-05-10 Thread Sidney Markowitz
Justin Mason wrote:
 looks fine to me -- however there are other calls to that bgsend() method
 elsewhere.  it may need to be made there too.

Good point. I forgot to grep to make sure I wan't missing anything. Make
test didn't show problems, but it wouldn't until I actually tried to
change from using the packet id to using something else. Grep found sub
res_bgsend in URIDNSBL.pm that needed the same two line change as in
Dns.pm. There were no others.

There is a call to bgsend in sub search in DnsResolver.pm, but it doesn't
use the id to verify the replies and so doesn't need any changes. Once
everything else is working I would like to change that to make it more
robust.

I'll wait a while to give everyone else a chance to respond across time
zones before I check it in.

 -- sidney




Re: Humorous to me ...

2005-05-10 Thread Sidney Markowitz
It's funny except I'm getting one of those challenge messages for each
one I send to this list. I don't want to give in to that crap by
responding to register my email address with a stranger. I guess that's
what blacklist-from is for.

I wonder if that service has the ability to whitelist-to a mailing list
address and if that person is just being clueless?

This is the second mailing list I've encountered this on in the past two
weeks, BTW. I hope it isn't a trend.

 -- sidney




Build broken?

2005-05-11 Thread Sidney Markowitz
Is the build broken or is it something I screwed up locally? mimheader.t and
uri_html.t are breaking when I run them:

$ t/mimeheader.t
1..2
# Running under perl version 5.008006 for cygwin
# Current time local: Wed May 11 23:12:38 2005
# Current time GMT:   Wed May 11 11:12:38 2005
# Using Test.pm version 1.25
/usr/bin/perl -T -w ../spamassassin -C log/test_rules_copy
--siteconfigpath log/localrules.tmp -p log/tst.cf  -
L -t  data/nice/004
[2300] warn: config: invalid regexp for rule MIMEHEADER_TEST1:
/(?-xism:application/msword)/: Search pattern not termina
ted
[2300] info: config: SpamAssassin failed to parse line, MIMEHEADER_TEST1
content-type =~ /application/msword/ is not v
alid for mimeheader, skipping: mimeheader MIMEHEADER_TEST1 content-type =~
/application/msword/
[2300] warn: config: invalid regexp for rule MIMEHEADER_TEST2:
/(?-xism:(?i)APPLICATION/MSWORD)/: Unmatched ( in regex;
marked by -- HERE in m/-xism:( -- HERE /
[2300] info: config: SpamAssassin failed to parse line, MIMEHEADER_TEST2
content-type =~ m!APPLICATION/MSWORD!i is not
 valid for mimeheader, skipping: mimeheader MIMEHEADER_TEST2 content-type
=~ m!APPLICATION/MSWORD!i
Checking  test1
Not found:  test1  =  MIMEHEADER_TEST1
not ok 1
# Failed test 1 in t/SATest.pm at line 575
Checking  test2
Not found:  test2  =  MIMEHEADER_TEST2
not ok 2
# Failed test 2 in t/SATest.pm at line 575 fail #2

   -

$ t/uri_html.t
1..2
# Running under perl version 5.008006 for cygwin
# Current time local: Wed May 11 23:12:56 2005
# Current time GMT:   Wed May 11 11:12:56 2005
# Using Test.pm version 1.25
did not find http://neverp4yretail.com/bam/[?]man=mic49
not ok 1
# Failed test 1 in t/uri_html.t at line 52
ok 2


Re: svn commit: r169596 - /spamassassin/trunk/lib/Mail/SpamAssassin/Conf/Parser.pm

2005-05-11 Thread Sidney Markowitz
Justin Mason wrote:
 Are there tests in the test suite for the redirector usage case btw?

Excuse me if I'm misunderstanding the question in my fog-before-first-coffee
of the morning...

The redirector patterns are hardcoded in sub try_canon in uri.t so any
change to them in 20_uri.cf has to be copied there.

The redirector patterns in 20_uri.cf are tested by one case in uri_html.t
which does not appear in uri_text.t. Once it is working, we should probably
add a case for each pattern and have them be in bot uri_html.t and uri_text.t.

 -- sidney


signature.asc
Description: OpenPGP digital signature


Re: t/dnsbl.t failing

2005-05-11 Thread Sidney Markowitz
Theo Van Dinter wrote:
 I don't know if this is a known issue, but it seems like tests 1-18
fail (of
 22) for t/dnsbl.t ...   From what I can see, most of the lookups
timeout at
 15s which blows the tests out of the water.

I've been seeing that with varying regularity depending on which network
I'm on. It's pretty consistently bad on my home DSL, but works if I run
it again immediately after. I guess then the queries are cached
somewhere.

Could it be that bugzilla just isn't a good machine to be using as the
spamassassin.org nameserver as it's too slow to respond?

 -- sidney




Re: t/dnsbl.t failing

2005-05-12 Thread Sidney Markowitz
I got some debugging output and it looks like something is quite wrong, but
I don't have time to look at it right now. Maybe tonight or tomorrow if
nobody  else catches it first.

 -- sidney


signature.asc
Description: OpenPGP digital signature


Re: Weekly net run, still has issues

2005-05-22 Thread Sidney Markowitz
Theo,

Could you try running with the bogus rr for domain warn statement in
URIDNSBL  modified to output $packet and $ent-{id} instead of
$packet-header-id? That will make the warning message a bit more
verbose, but you aren't seeing that many of them anyway, and it will
provide helpfule debug information.

 -- sidney




Re: weekly net run bugs

2005-05-23 Thread Sidney Markowitz
Justin Mason said:
 mystery solved ;)

Aww, I was looking forward to tracking down a really mysterious bug :)

 -- sidney




Buildbot question

2005-05-23 Thread Sidney Markowitz
Does anyone know if we should be able to use the latest version of
Buildbot, 0.6.5 with buildbot.spamassassin.org? I know that I could just
try it, but I don't want to spend time trying to get it to work only to
find that the master has to be upgraded first.

 -- sidney




Re: Buildbot question

2005-05-23 Thread Sidney Markowitz
Justin Mason said:
 yep, should be possible -- create a t/config file that enables it in
 the buildbot slave's checkout ;)

I thought that's under the control of the master. Doesn't the script
recreate the entire trunk every time? Oh, of course that would be too
expensive. Ok, I'll edit the t/config file.

Right now I'm having a problem getting buildbot 0.62 running under Cygwin
after it upgraded python to 2.4. It dies with a permission denied error
trying to switchuid. I haven't decided whether to try to track down the
problem or take a chance on upgrading to buildbot 0.65 first to see what
happens.

 -- sidney




Re: Buildbot question

2005-05-23 Thread Sidney Markowitz
Justin Mason said:
 afaik you can.

Ok, I'll try it. First I'll confirm that I can get the 0.6.2 that I have
installed running again, as I've had it down for a while.

Another question -- Can we have a way of enabling network test for the
buildbot runs? I can see how it should be an option, as some people might
not want to load their network every time they run an automated test, but
I don't mind, and the network stuff should be tested too.

 -- sidney




  1   2   3   4   5   >