Hi,

There's a thread on clamav-users with the title
"PhishingScanURLs is dreadfully slow/CPU-intensive" that
I think developers should be aware of.  Below is my
latest post on the topic.  Basically, I strongly urge the
developers to make PhishingScanURLs default to off instead
of on.

Regards,

David.

-------- Original Message --------
Subject: Re: [Clamav-users] PhishingScanURLs is dreadfully slow/CPU-intensive
Date: Tue, 30 Oct 2007 11:15:21 -0400
From: David F. Skoll <[EMAIL PROTECTED]>
To: ClamAV users ML <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>

Graham Toal wrote:

> In fact with a decent string search algorithm (using a trie of
> strings) there should be very little extra overhead in adding more
> strings to be searched in parallel.

PhishingScanURLs does not use string matching.  It uses regexes,
and in general regex matching is NP-hard (though I don't think
Clam uses backreferences which are the worst culprits.)
It also involves calls to cli_html_normalise which looks scary/expensive.
cli_html_normalise is almost 1100 lines long and is filled with
fixed-length buffer declarations.  While that does not mean necessarily
that it's a security risk, it still sends shivers up my spine.
Nobody should be writing 1100-line functions!

See libclamav/phishcheck.c and libclamav/htmlnorm.c for the code in question.

> You're right in your assessment above.  It should be simple and
> lightweight.  That doesn't rule out scanning for URLs in the body
> text, it just means you have to do so efficiently, and IMHO using
> regexps is not efficient and seldom justified.

Exactly. :-)  So the Clam people should not be using regexes.

(Our customers, in fact, always run ClamAV in conjunction with an anti-spam
scanner, so it's no benefit to them to have Clam try to do anti-spam.)

Regards,

David.
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net

Reply via email to