On Mon, Jul 2, 2012 at 5:07 PM, Alexandre Dias <lexx...@gmail.com> wrote:
> Hello, > > I'm studying multi-pattern matching and I was browsing the source code for > ClamAV's implementation of a multi-pattern matcher (Wu-Maber based) > algorithm. > > I've got a question regarding the block and minimum size values. > > At the moment, both the block size and the minimum pattern length are set > to 3 bytes. > > If I understood the algorithm correctly, this means that the only possible > shift values are either 0 (at which point a match is possible), or 1 > (minimum pattern size - block size + 1). > > If this is the case, given that the algorithm can only move at most one > byte at a time, what is the advantage of using this algorithm instead of > Aho-Corasick (besides space efficiency) ? > > Thank you for your time. > > Best regards, > > -Alexandre Dias > _______________________________________________ > http://lurker.clamav.net/list/clamav-devel.html > Please submit your patches to our Bugzilla: http://bugs.clamav.net > Space efficiency is important. We do need to care about memory usage. But ruling that out, consider that ClamAV has different places and different ways it uses pattern matching. For the sake of consistency with how it is named in the code, I'll refer to the two modified styles of matching as B-M (for Boyer-Moore/Wu-Manber style) and A-C (for Aho-Corasick). ClamAV has over 113,000 signatures right now and they are split between the A-C and B-M categories. ClamAV is not using pure pattern matching of either style and has pre-filtering steps. Some signatures are scanning direct file content. Other signatures are matching hashes [or in some cases, hashes of file segments]. Files can have wildly varying lengths, while the hashes have predetermined lengths. There are logical signatures that require certain combinations of matches. ClamAV even uses pattern matching when checking signatures at load time to filter out those that have been added to the ignore lists. Any optimization would be impacted daily with each new signature that is added. To sum up, there are quite a variety of needles and haystacks involved in the searching. Back to your question. You are correct that the shift values will be 0 or 1. While I cannot give you an analytical defense to the choice of minimum pattern size & block size, there is a natural tension between the two. From what I read, Wu & Manber used a block size of 3 in their May 1994 paper. And any efficiency gained from longer shifts (which would be based on values which never appear in any signature) could be targeted by malware writers to eliminate it by forcing creation of signatures that fill that gap. I also don't know the difference in effective cost of frequent partial matches between A-C and B-M. These are things that could be measurable but I do not have statistics at hand. There is more history on the topic of algorithms and their use in ClamAV to be found in the back history of the mailing list. Discussions of everything from extended Boyer-Moore to bloom filters. Hope this helps, Dave R. --- Dave Raynor Sourcefire Vulnerability Research Team dray...@sourcefire.com _______________________________________________ http://lurker.clamav.net/list/clamav-devel.html Please submit your patches to our Bugzilla: http://bugs.clamav.net