RE: [Razor-users] Spam sent to mailinglists

Cyrus J. Lesser Mon, 07 Jun 2004 11:24:37 -0700

I agree with whomever isn't you.

The only reason to send in a revoke request, would be if you had
received a non-spam message and it had been flagged as spam by Razor.

If it 'aint in the system, then sending in a revoke is just waiting
everybody ressage-----
From: Mabry Tyson [mailto:[EMAIL PROTECTED] 
Sent: Sunday, 30 May 2004 8:40 PM
To: Matt Kettler
Cc: [EMAIL PROTECTED]
Subject: Re: [Razor-users] Spam sent to mailinglists

Matt Kettler wrote:

> At 03:16 PM 5/27/2004, Mabry Tyson wrote:
>
>> >
>> This illustrates one of my complaints of most spam reporting 
>> (including for Razor).   The basic idea people have is "I should 
>> report spam".
>> THAT ISN'T ENOUGH.  You have to report non-spam as well.   If all 
>> that is reported is spam, then there is nothing to compare it 
>> against.  You bias the system to recognize everything as spam and 
>> your false positives (non-spam classified as spam) soar (where 0.5% 
>> is too high).
>
>
> Dude, razor is NOT a bayes subsystem.
>
Dude, I didn't say it was Bayesian.  

> Do not try to apply bayes concepts to razor, as it doesn't work in 
> anything REMOTELY resembling the bayes model.

I surely don't see the word "probability" in anything I said.   
Performance statistics are not probabilities.  There is nothing in what 
I said that is remotely related to the Bayes rule or the propagation of 
probabilities.

>
> Unless the message is already in the razor database, reporting a 
> nonspam message to razor-revoke is pointless and a waste of bandwidth.

>
> Razor is a database of hashes of known spam messages. Period.

Wrong.

Obviously, you've never looked under the hood of razor and have no idea 
how it really works.  Try turning debugging on (see below).  First off, 
it doesn't do a hash of a message as a whole except in the simplest of 
cases.  It breaks it up into the mime parts (or, at least a rough 
approximation of the innermost mime parts).  That way, a spam attachment

is hashed (and has a separate score) from the body of a normal mail
message.

The Razor documentation is vague on how the score we get back from the 
razor servers is computed, but it does say:

> Note that even after a successful revoke, a mail might still be 
> considered spam in the Razor Catalogue. For instance, this can occur 
> if more trusted users consider the mail spam than not spam.

That is a voting system.   True, there isn't anything that says a revoke

on some mime part will be recorded if there wasn't a previous report.  
But, if a mime part is submitted in some number (say 100) trusted 
reports of spam, then it will poison any message it is in and cause it 
to be judged as spam.   If on the other hand, there were 200 trusted 
reports of that same mime part being in non-spam messages, then it may 
(I don't know the algorithms) report that part as non-spam.

Razor consists of agents (which is what you and I run), servers, and a 
database.   The database has hashes of mime parts of messages (or the 
body, if the message does not have mime parts) and the number of spam 
and non-spam reports for each hash..   When a message is reported as 
spam, the hashes of each part of that message is given a vote as spam.  
The part may have nothing to do with the spam qualities of the message.

If one of those pieces has had enough more votes as spam then the server

will report that the part is spam.  In the agent, if one part is reputed

to be spam, the whole message is marked as spam.

(There are clearly more parts to the database that deal with the 
reporters (such as you and me) and each one's trust score.)

I'm not sure how Razor computes scores that are different than 0% or 
100% spam.  For all I know, the razor servers could be doing some sort 
of Bayesian analysis based upon whether a mime part shows up in messages

that are scored as spam or non-spam.   (But I think not.)

>
> Razor is NOT a tokenizer. Razor is NOT a learning system. You don't 
> "train" razor. You can't teach it to recognize nonspam by feeding it 
> random nonspam messages. Razor recognizes only the exact message you 
> report. If that message never appears again in the world, reporting it

> is a waste.
>
Of course you train razor.  Otherwise you couldn't get today's spam into

it so you can catch that spam tomorrow.   It learns what you tell it 
(but no more).
But you don't train it in a way that it can make a judgement whether a 
mime part that it has never seen is or is not spam.

-----

For those of you that have never bothered looking at how the razor agent

works, read on.  You may be surprised at how little evidence is enough 
to damn a message to be judged as spam.

Here is a bit of the debug logging of razor processing a message where a

colleague mailed an article from the Fortune magazine site
(fortune.com).

Razor finds 24 pieces of the message (corresponding to the images, etc. 
which are each a separate mime part in this HTML-encoded message).

> [ 8] mail 1 Subject: [aic-coffee] Re: Fortune.com - Investing - Is a 
> Futures Marke
> [ 6] preproc: mail 1.0 went from 29847 bytes to 8160
> [ 6] preproc: mail 1.1 went from 275 bytes to 43
> [ 6] preproc: mail 1.2 went from 13675 bytes to 9957
> ...
> [ 6] preproc: mail 1.21 went from 15720 bytes to 11468
> [ 6] preproc: mail 1.22 went from 309 bytes to 67[ 6] preproc: mail 
> 1.23 went from 654 bytes to 425
> [ 6] computing sigs for mail 1.0, len 8160
> [ 6] computing sigs for mail 1.1, len 43
> [ 6] computing sigs for mail 1.2, len 9957
> ...
> [ 6] computing sigs for mail 1.21, len 11468
> [ 6] computing sigs for mail 1.22, len 67
> [ 6] computing sigs for mail 1.23, len 425
> ...
> [ 8] mail 1.0 e4 sig: EaeokfZ-zSmUii2CHD5uKJQ3nUsA
> [ 8] mail 1.1 e4 sig: TpyKug4ao2UsqHh0NPybuTtPM7EA
> [ 8] mail 1.2 e4 sig: 4uZWZ0BZsF64MPMCPwW5KW4eLb8A
> ...
> [ 8] mail 1.21 e4 sig: hlWqzXmY-zHYSsS2mmIeKN0QtPoA
> [ 8] mail 1.22 e4 sig: SIbJGi0MsRoEHK1GKbJAZghvUqQA
> [ 8] mail 1.23 e4 sig: 3dwMVJbdCPV73VMlHmC-Xfsbo-IA
> [ 8] preparing 24 queries
> ...
> [ 6] mail 1.0 e=4 sig=EaeokfZ-zSmUii2CHD5uKJQ3nUsA: sig not found.
> [ 6] mail 1.1 e=4 sig=TpyKug4ao2UsqHh0NPybuTtPM7EA: Is spam: cf 100 >=

> min_cf 6
> [ 6] mail 1.2 e=4 sig=4uZWZ0BZsF64MPMCPwW5KW4eLb8A: sig not found.
> ...
> [ 6] mail 1.21 e=4 sig=hlWqzXmY-zHYSsS2mmIeKN0QtPoA: sig not found.
> [ 6] mail 1.22 e=4 sig=SIbJGi0MsRoEHK1GKbJAZghvUqQA: Is spam: cf 100 
> >= min_cf 6
> [ 6] mail 1.23 e=4 sig=3dwMVJbdCPV73VMlHmC-Xfsbo-IA: sig not found.
> [ 7] method 4: mail 1.0: no-contention part, spam=0
> [ 7] method 4: mail 1.1: no-contention part, spam=1
> [ 7] method 4: mail 1.2: no-contention part, spam=0
> ...
> [ 7] method 4: mail 1.21: no-contention part, spam=0
> [ 7] method 4: mail 1.22: no-contention part, spam=1
> [ 7] method 4: mail 1.23: no-contention part, spam=0
> [ 7] method 4: mail 1: a non-contention part was spam, mail spam
> [ 3] mail 1 is known spam.
>
>
Only those two parts (1.1 and 1.22) were judged as spam.  I removed 
those two parts and re-ran razor.  All sizes and hashes of the remaining

parts were the same, but there are two less parts.  The message had no 
"spam=1" results and was not considered "known spam".   (It only takes 
one such part to cause the message to be considered spam.)

By the way, SpamAssassin will score this as though Razor has judged it 
to be 100% likely to be spam.

Here is the diff of the two messages to show the two parts. (The lines 
beginning with a single "-" are what was removed.  The output of the 
diff has been tweaked a bit to reduce confusion and unneeded verbiage)

> manresa<2> 30: diff -c 1264.msg 1264.2.msg
> ***************
> *** 633,645 ****
> - --------------DC5785B898CE5064281096E1
> - Content-Type: image/gif
> - Content-ID: <[EMAIL PROTECTED]>
> - Content-Transfer-Encoding: base64
> - Content-Disposition: inline;
filename="/tmp/nsmail3F27ED9766514A3.gif"
> -
> - R0lGODlhAQABAIAAAP///wAAACH5BAAAAAAALAAAAAABAAEAAAICRAEAOw==
>   --------------DC5785B898CE5064281096E1
>  --- 633,638 ----
> ***************
> *** 1958,1971 ****
> - --------------DC5785B898CE5064281096E1
> - Content-Type: image/gif
> - Content-ID: <[EMAIL PROTECTED]>
> - Content-Transfer-Encoding: base64
> - Content-Disposition: inline;
filename="/tmp/nsmail3F27ED9767A14A3.gif"
> -
> -
R0lGODlhAQABAJH/AP///wAAAP///wAAACH/C0FET0JFOklSMS4wAt7tACH5BAEAAAIALAAA
> - AAABAAEAAAICVAEAOw==
>   --------------DC5785B898CE5064281096E1--
> --- 1951,1956 ----

Such obvious spam!   Those two small sections caused this 125KB message 
to be judged as spam.

The first gif file is 43 bytes long and contains a 1x1 pixel image that 
is white.
The second gif is 67 bytes long and contains a 1x1 pixel image that is 
transparent.

Either one of those is enough....

Here's how those two pieces are used:

A (graphical) line used to break between sections in a table or to set 
the width of the table.
   <img SRC="cid:[EMAIL PROTECTED]" BORDER=0 height=5 
width=767>
A single pixel which was used in an otherwise empty cell in a table 
(apparently used for spacing)
   <img SRC="cid:[EMAIL PROTECTED]" BORDER=0 height=1 
width=1>

I suspect the person at Fortune magazine that created the web page that 
was included in this message has never constructed spam with these 
pieces.  It might also be that no message including a web page from 
Fortune was ever submitted as spam to razor.  But it might very well be 
that some spammer used the same web design tools as the author of the 
web page and his spam included the same small images.   Or maybe, he 
just happened to create identical small images.

I've used these small mime parts to illustrate the point, but there is 
no size limit for this effect.  If spammers included a popular icon of 
the Linux penguin and enough got reported, then any mail that also 
included that icon would be labelled spam.  (Of course it would have to 
be the same icon, bit-by-bit.)

----

I also have non-image versions of this same effect.   For instance, if 
you use IMAP for reading your mail, you probably have seen this as the 
body of the first message of your mail folder/file.

> This text is part of the internal format of your mail folder, and is
not
> a real message.  It is created automatically by the mail system
software.
> If deleted, important folder data will be lost, and it will be
re-created
> with the data reset to initial values.

Yep, some luser has reported this as spam and not enough people have not

reported it as non-spam.   Razor scores it as spam.  Sure, you're 
unlikely to get this in a new mail message, but it just shows that 
people report non-spam as spam.  (Oh, if you run a program that removes 
all your razor-reported spam in this folder, you'll lose this mesage.  
The claim of "important folder data will be lost" is a bit exagerated.)

----

I've also seen a mime section scored by razor as spam that was basically

an empty part of a forwarded message.  I'm sure some mail was reported 
as spam that had that section, but it is also reasonably common in 
regular mail (which is, of course, how I found it).

> manresa<2> 32: diff -c 1379.msg 1379.1.msg
> *** 1379.msg    Sat Apr 17 10:43:16 2004
> --- 1379.1.msg  Sat Apr 17 10:53:41 2004
> ***************
> *** 169,189 ****
>   > =A0
>   >
>
> - --Apple-Mail-108--11777278
> - Content-Transfer-Encoding: quoted-printable
> - Content-Type: text/enriched;
> -       charset=ISO-8859-1
> -
> -
> -
> - <excerpt>
> -
> -
> - =A0
> -
> -
> - </excerpt>=
> -
>   --Apple-Mail-108--11777278--
>
>   --Apple-Mail-106--11777282
> --- 169,174 ----

----

Then there's the whole other issue about whether what John reports as 
spam is non-spam to Bob.   I don't consider mail bounce messages to be 
spam, but maybe you do.  Maybe you are sick and tired of all those 
bounce messages of mail that a mail worm forged as though from you and 
sent off to some non-existent address.

Yep, those get reported as spam to Razor.   As a result, real bounce 
messages where you typo'ed someone's email address are being judged as 
spam because they include some mime part that is common to all of that 
site's (or all site's using that MTA) bounce messages.

I have more examples, but I believe the point has been made.
---------------

> You can't teach it to recognize nonspam by feeding it random nonspam 
> messages. Razor recognizes only the exact message you report. If that 
> message never appears again in the world, reporting it is a waste.

But non-spam messages are not necessarily "random".    If you don't feed

razor non-spam, then mime parts that are common to both spam and 
non-spam will cause the non-spam to be considered spam.   What is a 
waste is that message you sent that got rejected as spam because it 
contained a one-pixel image that razor once found in a spam message.

You have to report non-spam as well.   If all that is reported is spam, 
then there is nothing to compare it against.  You bias the system to 
recognize everything as spam and your false positives (non-spam 
classified as spam) soar (where 0.5% is too high).

-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.

Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Razor-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/razor-users

DISCLAIMER: This e-mail and any files transmitted with it may 
be privileged and confidential, and are intended only for the use of the 
intended recipient. If you are not the intended recipient or responsible for 
delivering this e-mail to the intended recipient, any use, dissemination, 
forwarding, printing or copying of this e-mail and any attachments is strictly 
prohibited. If you have received this e-mail in error, please REPLY TO the 
SENDER to advise the error AND then DELETE the e-mail from your system.
Any views expressed in this e-mail and any files transmitted with 
it are those of the individual sender, except where the sender specifically 
states them to be the views of our organisation.
Our organisation does not represent or warrant that 
the attached files are free from computer viruses or other defects. The user 
assumes all responsibility for any loss or damage resulting directly or 
indirectly from the use of the attached files. In any event, the liability to 
our organisation is limited to either the resupply of the attached files or the 
cost of having the attached files resupplied.

-------------------------------------------------------
This SF.Net email is sponsored by: GNOME Foundation
Hackers Unite!  GUADEC: The world's #1 Open Source Desktop Event.
GNOME Users and Developers European Conference, 28-30th June in Norway
http://2004/guadec.org
_______________________________________________
Razor-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/razor-users

RE: [Razor-users] Spam sent to mailinglists

Reply via email to