When looking at the "Use this HTML Parser" section on the GUI, I found this
line:

it is recommended to set MaxBytes to 50000 (be carefull on heavy load
systems - spam bomb regular expressions will take longer using 50000!).\

I'm going to change my settings and see how bad the rebuild time is.  I've
got enough processing power and RAM now, but the disks aren't SSD.  Just a
4 disk Raid 1+0 traditional HDD setup.  We'll see...

Since HTMl email accounts for a big percentage of all mail,  might it be a
good idea to update/expand the guidance in the MaxBytes section of the
GUI?



On Fri, Oct 29, 2021 at 8:40 PM K Post <nntp.p...@gmail.com> wrote:

> Summary:
> *Should/could any consideration be given to having ASSP scan the entire
> message at the time it is received for Bombs (only), while still using
> MaxBytes for Bayesian/HMM?*
>
> We've been having some cleverly crafted messages slipping through all
> filters that would be easy to catch with Bombs if only the catchable
> content came before MaxBytes.  These messages are 20kb+, They have a scam
> phone number at the very end of the larger than MaxBytes messages.  I
> want/need to use bombs to catch the scam phone numbers.
>
> With MaxBytes set to 3000, which is useful for faster RebuildSpamDB, these
> BombDataRE matches just aren't being caught.  If I increase MaxBytes, my
> BombDataRE catches them, but then rebuildspamdb is (probably? see below)
> longer than it needs to be.
>
> So, is there any value in considering a* MaxBytesAdditionalForBombs *variable
> which would be *added to MaxBytes *and only used when scanning for bombs
> as messages arrive?   Would that kill performance??  Other downsides?
>
> We could still only look at MaxBytes for Bayesian/HMM since it's only
> MaxBytes used when building those databases.
>
> What do you think?
>
> And while we're talking MaxBytes:
> I've asked this before, is the guidance for 3kb for MaxBytes once there's
> a mature corpus still a valid recommendation?  With unlimited horsepower
> and ram, sure, why not, do 30kb or 100kb.  That's not my reality, so I want
> to see where to best allocate resources. If 3kb is still the guidance, even
> though the spam files I'm seeing have a median size around 20kb, so be it.
> I feel like when that guidance was written, html wasn't used as
> prolifically in spam.  The median size of notspam in my corpus is about
> 40kb.  That's determined unscientifically by sorting by size and scrolling
> to approximately half way down.
>
> Thanks.  Have a good weekend.
> Ken
>
>
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to