Tony Earnshaw wrote:
> Tom Bombadil skrev, on 13-09-2007 20:01:
> 
>>> Not this way, you can train a batch of both spam and non-spam with
>>> dspam_train, but it wants both together.
>>>
>>>> If the signature is not present, the command above fails.
>>> That's what the signature's for ... you're teaching it to reverse the
>>> tokens that the signature points to, and they don't exist.
>>
>> Yes... my point is why having signatures at all? My signatures table is
>> HUGE.
> 
> You say nothing about what DB backend you're running, your user base, or
> anything else.
> 
> Our 1500+ user base (using a single shared group) MySQL 5.0 InnoDB
> ibdata1 file (that comprises all tables of all of our 3 MySQL databases)
> is 105MB and stable in size. I run dspam_clean -p on it each Sunday (man
> dspam_clean). I use dspam_clean rather than the SQL purge script.

I too have a mysql5 innodb backend (one file per table though). Because
I purge entries weekly, I have 2 to 3 weeks worth of signatures in my
table. Last time I checked the size of the dspam_signature_data was over
200GB.

As far as I understand, when using shared groups, the
dspam_signature_data grows depending on the number of the messages we
get, not the number of users... Even though one can correlate the number
of users and number of messages. We process about a million msgs a day.

> 
>> If I'm feeding dspam with the message in pristine format (without dspam
>> headers, and stuff), dspam could correct errors even if there is no
>> signatures... Am I dreaming here?
> 
> You're dreaming, that's not how it works ... if there's no signature, it
> can't correlate the signature data with the tokens, doesn't know what
> the original tokens were.

Yes... I understand that this is not the way it works (not documented).
But I see no reason why it couldn't (or shouldn't) work this way. The
message itself  could provide the tokens, instead pre-storing them in
the dspam_signature_data table.

The setup would be much more IO/disk-space friendly if we have to store
the tokens just once (in the msg itself), instead of twice (in the
message and in the dspam_signature_data table).

Thanks Tony... Have a great weekend!

Reply via email to