http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4691





------- Additional Comments From [EMAIL PROTECTED]  2006-02-12 18:32 -------
i've been running it for 2+ months on a uri scraper that i use for uribl.com.  
and for about 4-5 weeks on ~1000 mtas.  since i'm scraping alot of html data, i 
have to use lots of rawbody rules and it comes in quite handy.   Something that 
I couldnt catch before was this

<html>
www.myspamuri.com
</html>

and now i can...

rawbody         HTML_URI_ONLY           m'<html> ?(<body> )?(www\.)?[a-z0-9\-]
{5,64}\.(com|net|info|biz) ?(</body>)? ?</html>'i
range           HTML_URI_ONLY           bytetrim 0:256


couple things i'll note from my previous patch.  the first dbg() call in 
get_range_data() causes lots of debug because its called per rule..  so that 
should be removed or commented out.

also negative offsets supplied on a byte range do not work due to this line..

+  if ($args && $args =~ m/(\d+)(:(\d+))?/) {

should be

+  if ($args && $args =~ m/(\-?\d+)(:(\d+))?/) {

This makes a rule like this start to work.

body            __FREEBIE_FOOTER        /Home.{1,5}Disclaimer.{1,5}Privacy 
Policy.{1,5}Unsubscribe/i
range           __FREEBIE_FOOTER        byte -256:256

i can make a new patch if necessary, but nothing else has changed.   adding 
range checking to full rule types should probably be added as well.

d







------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to