Hello Loren, Tuesday, July 26, 2005, 7:07:40 AM, you wrote:
>> Rather than the primary site language(s), I'd be more interested in >> the email's language. LW> I don't know if that is available, anyone know? If so it could be LW> useful as aggregate information. I don't know how reliable the algorithm is, but "ok_languages xx" has to test against something to determine whether the language of the email is OK or not... LW> I think site language (or actually the allowed_languages values) LW> would be useful too, since that would help shed light on FP hits LW> in rules. Agreed. >> The way to get that would be to receive mass-check-like info, perhaps >> a log of every email scanned, just the default (unmodified!) >> X-Spam-Status line. These can then be examined for overlaps, freqs, >> etc. LW> The log of course could be huge for a site getting a couple LW> million emails a day. I've never made an attempt to count the LW> number of rules in either the SA or SARE distributions, but I LW> would imagine it is in the small thousands, maybe less. A simple LW> 2D matrix of which rules hit with each other might be more compact LW> for a site that gets more than maybe 4K mails/day. My latest pre4 mass-check of 86,140 spam takes 68,961,570 bytes. That's 800.57545 bytes per message. Let's double that to allow for test rules: 1,600 bytes. Compressed (using gzip) it's 7,036,131. That's 81.682505 bytes per message, or 163.4 bytes doubled. A log of 10 million emails would therefore take 16 Gig of disk space, or 1.6 Gig compressed. Yes, that's too much to store daily, and too much to email. Any reason why your email was back to me privately, and not on the list? Bob Menschel
