I agree with whomever isn't you. The only reason to send in a revoke request, would be if you had received a non-spam message and it had been flagged as spam by Razor.
If it 'aint in the system, then sending in a revoke is just waiting everybody ressage----- From: Mabry Tyson [mailto:[EMAIL PROTECTED] Sent: Sunday, 30 May 2004 8:40 PM To: Matt Kettler Cc: [EMAIL PROTECTED] Subject: Re: [Razor-users] Spam sent to mailinglists Matt Kettler wrote: > At 03:16 PM 5/27/2004, Mabry Tyson wrote: > >> > >> This illustrates one of my complaints of most spam reporting >> (including for Razor). The basic idea people have is "I should >> report spam". >> THAT ISN'T ENOUGH. You have to report non-spam as well. If all >> that is reported is spam, then there is nothing to compare it >> against. You bias the system to recognize everything as spam and >> your false positives (non-spam classified as spam) soar (where 0.5% >> is too high). > > > Dude, razor is NOT a bayes subsystem. > Dude, I didn't say it was Bayesian. > Do not try to apply bayes concepts to razor, as it doesn't work in > anything REMOTELY resembling the bayes model. I surely don't see the word "probability" in anything I said. Performance statistics are not probabilities. There is nothing in what I said that is remotely related to the Bayes rule or the propagation of probabilities. > > Unless the message is already in the razor database, reporting a > nonspam message to razor-revoke is pointless and a waste of bandwidth. > > Razor is a database of hashes of known spam messages. Period. Wrong. Obviously, you've never looked under the hood of razor and have no idea how it really works. Try turning debugging on (see below). First off, it doesn't do a hash of a message as a whole except in the simplest of cases. It breaks it up into the mime parts (or, at least a rough approximation of the innermost mime parts). That way, a spam attachment is hashed (and has a separate score) from the body of a normal mail message. The Razor documentation is vague on how the score we get back from the razor servers is computed, but it does say: > Note that even after a successful revoke, a mail might still be > considered spam in the Razor Catalogue. For instance, this can occur > if more trusted users consider the mail spam than not spam. That is a voting system. True, there isn't anything that says a revoke on some mime part will be recorded if there wasn't a previous report. But, if a mime part is submitted in some number (say 100) trusted reports of spam, then it will poison any message it is in and cause it to be judged as spam. If on the other hand, there were 200 trusted reports of that same mime part being in non-spam messages, then it may (I don't know the algorithms) report that part as non-spam. Razor consists of agents (which is what you and I run), servers, and a database. The database has hashes of mime parts of messages (or the body, if the message does not have mime parts) and the number of spam and non-spam reports for each hash.. When a message is reported as spam, the hashes of each part of that message is given a vote as spam. The part may have nothing to do with the spam qualities of the message. If one of those pieces has had enough more votes as spam then the server will report that the part is spam. In the agent, if one part is reputed to be spam, the whole message is marked as spam. (There are clearly more parts to the database that deal with the reporters (such as you and me) and each one's trust score.) I'm not sure how Razor computes scores that are different than 0% or 100% spam. For all I know, the razor servers could be doing some sort of Bayesian analysis based upon whether a mime part shows up in messages that are scored as spam or non-spam. (But I think not.) > > Razor is NOT a tokenizer. Razor is NOT a learning system. You don't > "train" razor. You can't teach it to recognize nonspam by feeding it > random nonspam messages. Razor recognizes only the exact message you > report. If that message never appears again in the world, reporting it > is a waste. > Of course you train razor. Otherwise you couldn't get today's spam into it so you can catch that spam tomorrow. It learns what you tell it (but no more). But you don't train it in a way that it can make a judgement whether a mime part that it has never seen is or is not spam. ----- For those of you that have never bothered looking at how the razor agent works, read on. You may be surprised at how little evidence is enough to damn a message to be judged as spam. Here is a bit of the debug logging of razor processing a message where a colleague mailed an article from the Fortune magazine site (fortune.com). Razor finds 24 pieces of the message (corresponding to the images, etc. which are each a separate mime part in this HTML-encoded message). > [ 8] mail 1 Subject: [aic-coffee] Re: Fortune.com - Investing - Is a > Futures Marke > [ 6] preproc: mail 1.0 went from 29847 bytes to 8160 > [ 6] preproc: mail 1.1 went from 275 bytes to 43 > [ 6] preproc: mail 1.2 went from 13675 bytes to 9957 > ... > [ 6] preproc: mail 1.21 went from 15720 bytes to 11468 > [ 6] preproc: mail 1.22 went from 309 bytes to 67[ 6] preproc: mail > 1.23 went from 654 bytes to 425 > [ 6] computing sigs for mail 1.0, len 8160 > [ 6] computing sigs for mail 1.1, len 43 > [ 6] computing sigs for mail 1.2, len 9957 > ... > [ 6] computing sigs for mail 1.21, len 11468 > [ 6] computing sigs for mail 1.22, len 67 > [ 6] computing sigs for mail 1.23, len 425 > ... > [ 8] mail 1.0 e4 sig: EaeokfZ-zSmUii2CHD5uKJQ3nUsA > [ 8] mail 1.1 e4 sig: TpyKug4ao2UsqHh0NPybuTtPM7EA > [ 8] mail 1.2 e4 sig: 4uZWZ0BZsF64MPMCPwW5KW4eLb8A > ... > [ 8] mail 1.21 e4 sig: hlWqzXmY-zHYSsS2mmIeKN0QtPoA > [ 8] mail 1.22 e4 sig: SIbJGi0MsRoEHK1GKbJAZghvUqQA > [ 8] mail 1.23 e4 sig: 3dwMVJbdCPV73VMlHmC-Xfsbo-IA > [ 8] preparing 24 queries > ... > [ 6] mail 1.0 e=4 sig=EaeokfZ-zSmUii2CHD5uKJQ3nUsA: sig not found. > [ 6] mail 1.1 e=4 sig=TpyKug4ao2UsqHh0NPybuTtPM7EA: Is spam: cf 100 >= > min_cf 6 > [ 6] mail 1.2 e=4 sig=4uZWZ0BZsF64MPMCPwW5KW4eLb8A: sig not found. > ... > [ 6] mail 1.21 e=4 sig=hlWqzXmY-zHYSsS2mmIeKN0QtPoA: sig not found. > [ 6] mail 1.22 e=4 sig=SIbJGi0MsRoEHK1GKbJAZghvUqQA: Is spam: cf 100 > >= min_cf 6 > [ 6] mail 1.23 e=4 sig=3dwMVJbdCPV73VMlHmC-Xfsbo-IA: sig not found. > [ 7] method 4: mail 1.0: no-contention part, spam=0 > [ 7] method 4: mail 1.1: no-contention part, spam=1 > [ 7] method 4: mail 1.2: no-contention part, spam=0 > ... > [ 7] method 4: mail 1.21: no-contention part, spam=0 > [ 7] method 4: mail 1.22: no-contention part, spam=1 > [ 7] method 4: mail 1.23: no-contention part, spam=0 > [ 7] method 4: mail 1: a non-contention part was spam, mail spam > [ 3] mail 1 is known spam. > > Only those two parts (1.1 and 1.22) were judged as spam. I removed those two parts and re-ran razor. All sizes and hashes of the remaining parts were the same, but there are two less parts. The message had no "spam=1" results and was not considered "known spam". (It only takes one such part to cause the message to be considered spam.) By the way, SpamAssassin will score this as though Razor has judged it to be 100% likely to be spam. Here is the diff of the two messages to show the two parts. (The lines beginning with a single "-" are what was removed. The output of the diff has been tweaked a bit to reduce confusion and unneeded verbiage) > manresa<2> 30: diff -c 1264.msg 1264.2.msg > *************** > *** 633,645 **** > - --------------DC5785B898CE5064281096E1 > - Content-Type: image/gif > - Content-ID: <[EMAIL PROTECTED]> > - Content-Transfer-Encoding: base64 > - Content-Disposition: inline; filename="/tmp/nsmail3F27ED9766514A3.gif" > - > - R0lGODlhAQABAIAAAP///wAAACH5BAAAAAAALAAAAAABAAEAAAICRAEAOw== > --------------DC5785B898CE5064281096E1 > --- 633,638 ---- > *************** > *** 1958,1971 **** > - --------------DC5785B898CE5064281096E1 > - Content-Type: image/gif > - Content-ID: <[EMAIL PROTECTED]> > - Content-Transfer-Encoding: base64 > - Content-Disposition: inline; filename="/tmp/nsmail3F27ED9767A14A3.gif" > - > - R0lGODlhAQABAJH/AP///wAAAP///wAAACH/C0FET0JFOklSMS4wAt7tACH5BAEAAAIALAAA > - AAABAAEAAAICVAEAOw== > --------------DC5785B898CE5064281096E1-- > --- 1951,1956 ---- Such obvious spam! Those two small sections caused this 125KB message to be judged as spam. The first gif file is 43 bytes long and contains a 1x1 pixel image that is white. The second gif is 67 bytes long and contains a 1x1 pixel image that is transparent. Either one of those is enough.... Here's how those two pieces are used: A (graphical) line used to break between sections in a table or to set the width of the table. <img SRC="cid:[EMAIL PROTECTED]" BORDER=0 height=5 width=767> A single pixel which was used in an otherwise empty cell in a table (apparently used for spacing) <img SRC="cid:[EMAIL PROTECTED]" BORDER=0 height=1 width=1> I suspect the person at Fortune magazine that created the web page that was included in this message has never constructed spam with these pieces. It might also be that no message including a web page from Fortune was ever submitted as spam to razor. But it might very well be that some spammer used the same web design tools as the author of the web page and his spam included the same small images. Or maybe, he just happened to create identical small images. I've used these small mime parts to illustrate the point, but there is no size limit for this effect. If spammers included a popular icon of the Linux penguin and enough got reported, then any mail that also included that icon would be labelled spam. (Of course it would have to be the same icon, bit-by-bit.) ---- I also have non-image versions of this same effect. For instance, if you use IMAP for reading your mail, you probably have seen this as the body of the first message of your mail folder/file. > This text is part of the internal format of your mail folder, and is not > a real message. It is created automatically by the mail system software. > If deleted, important folder data will be lost, and it will be re-created > with the data reset to initial values. Yep, some luser has reported this as spam and not enough people have not reported it as non-spam. Razor scores it as spam. Sure, you're unlikely to get this in a new mail message, but it just shows that people report non-spam as spam. (Oh, if you run a program that removes all your razor-reported spam in this folder, you'll lose this mesage. The claim of "important folder data will be lost" is a bit exagerated.) ---- I've also seen a mime section scored by razor as spam that was basically an empty part of a forwarded message. I'm sure some mail was reported as spam that had that section, but it is also reasonably common in regular mail (which is, of course, how I found it). > manresa<2> 32: diff -c 1379.msg 1379.1.msg > *** 1379.msg Sat Apr 17 10:43:16 2004 > --- 1379.1.msg Sat Apr 17 10:53:41 2004 > *************** > *** 169,189 **** > > =A0 > > > > - --Apple-Mail-108--11777278 > - Content-Transfer-Encoding: quoted-printable > - Content-Type: text/enriched; > - charset=ISO-8859-1 > - > - > - > - <excerpt> > - > - > - =A0 > - > - > - </excerpt>= > - > --Apple-Mail-108--11777278-- > > --Apple-Mail-106--11777282 > --- 169,174 ---- ---- Then there's the whole other issue about whether what John reports as spam is non-spam to Bob. I don't consider mail bounce messages to be spam, but maybe you do. Maybe you are sick and tired of all those bounce messages of mail that a mail worm forged as though from you and sent off to some non-existent address. Yep, those get reported as spam to Razor. As a result, real bounce messages where you typo'ed someone's email address are being judged as spam because they include some mime part that is common to all of that site's (or all site's using that MTA) bounce messages. I have more examples, but I believe the point has been made. --------------- > You can't teach it to recognize nonspam by feeding it random nonspam > messages. Razor recognizes only the exact message you report. If that > message never appears again in the world, reporting it is a waste. But non-spam messages are not necessarily "random". If you don't feed razor non-spam, then mime parts that are common to both spam and non-spam will cause the non-spam to be considered spam. What is a waste is that message you sent that got rejected as spam because it contained a one-pixel image that razor once found in a spam message. You have to report non-spam as well. If all that is reported is spam, then there is nothing to compare it against. You bias the system to recognize everything as spam and your false positives (non-spam classified as spam) soar (where 0.5% is too high). ------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Razor-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/razor-users DISCLAIMER: This e-mail and any files transmitted with it may be privileged and confidential, and are intended only for the use of the intended recipient. If you are not the intended recipient or responsible for delivering this e-mail to the intended recipient, any use, dissemination, forwarding, printing or copying of this e-mail and any attachments is strictly prohibited. If you have received this e-mail in error, please REPLY TO the SENDER to advise the error AND then DELETE the e-mail from your system. Any views expressed in this e-mail and any files transmitted with it are those of the individual sender, except where the sender specifically states them to be the views of our organisation. Our organisation does not represent or warrant that the attached files are free from computer viruses or other defects. The user assumes all responsibility for any loss or damage resulting directly or indirectly from the use of the attached files. In any event, the liability to our organisation is limited to either the resupply of the attached files or the cost of having the attached files resupplied. ------------------------------------------------------- This SF.Net email is sponsored by: GNOME Foundation Hackers Unite! GUADEC: The world's #1 Open Source Desktop Event. GNOME Users and Developers European Conference, 28-30th June in Norway http://2004/guadec.org _______________________________________________ Razor-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/razor-users