On Mon, Jul 2, 2012 at 5:07 PM, Alexandre Dias <lexx...@gmail.com> wrote:

> Hello,
>
> I'm studying multi-pattern matching and I was browsing the source code for
> ClamAV's implementation of a multi-pattern matcher (Wu-Maber based)
> algorithm.
>
> I've got a question regarding the block and minimum size values.
>
> At the moment, both the block size and the minimum pattern length are set
> to 3 bytes.
>
> If I understood the algorithm correctly, this means that the only possible
> shift values are either 0 (at which point a match is possible), or 1
> (minimum pattern size - block size + 1).
>
> If this is the case, given that the algorithm can only move at most one
> byte at a time, what is the advantage of using this algorithm instead of
> Aho-Corasick (besides space efficiency) ?
>
> Thank you for your time.
>
> Best regards,
>
> -Alexandre Dias
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>

Space efficiency is important. We do need to care about memory usage. But
ruling that out, consider that ClamAV has different places and different
ways it uses pattern matching. For the sake of consistency with how it is
named in the code, I'll refer to the two modified styles of matching as B-M
(for Boyer-Moore/Wu-Manber style) and A-C (for Aho-Corasick).

ClamAV has over 113,000 signatures right now and they are split between the
A-C and B-M categories. ClamAV is not using pure pattern matching of either
style and has pre-filtering steps. Some signatures are scanning direct file
content. Other signatures are matching hashes [or in some cases, hashes of
file segments]. Files can have wildly varying lengths, while the hashes
have predetermined lengths. There are logical signatures that require
certain combinations of matches. ClamAV even uses pattern matching when
checking signatures at load time to filter out those that have been added
to the ignore lists. Any optimization would be impacted daily with each new
signature that is added. To sum up, there are quite a variety of needles
and haystacks involved in the searching.

Back to your question. You are correct that the shift values will be 0 or
1. While I cannot give you an analytical defense to the choice of minimum
pattern size & block size, there is a natural tension between the two. From
what I read, Wu & Manber used a block size of 3 in their May 1994 paper.
And any efficiency gained from longer shifts (which would be based on
values which never appear in any signature) could be targeted by malware
writers to eliminate it by forcing creation of signatures that fill that
gap. I also don't know the difference in effective cost of frequent partial
matches between A-C and B-M. These are things that could be measurable but
I do not have statistics at hand.

There is more history on the topic of algorithms and their use in ClamAV to
be found in the back history of the mailing list. Discussions of everything
from extended Boyer-Moore to bloom filters.

Hope this helps,

Dave R.
---
Dave Raynor
Sourcefire Vulnerability Research Team
dray...@sourcefire.com
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net

Reply via email to