On Sun, 22 Aug 2010, Tim Kientzle wrote:

On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote:
On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:
Mike Haertel <m...@ducky.net> writes:
GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character.

Boyer-Moore is for fixed search strings. I don't see how that optimization can work with a regexp search unless the regexp is so simple that you break it down into a small number of cases with known length and final character.

When I was working on making FreeGrep faster (years ago), I wrote down a few notes about possible algorithms, especially those that could be useful for fgrep functionality. I am just passing these onto the list.

Some algorithms:
1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
3. GNU fgrep:  Commentz-Walter
4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant)

Also, this may be a useful book:
http://www.dcc.uchile.cl/~gnavarro/FPMbook/

And of course, Russ Cox' excellent series of articles starting at:

 http://swtch.com/~rsc/regexp/regexp1.html

I saved that link from an E-mail earlier because it looked very interesting.

Later on, he summarizes some of the existing implementations, including comments about the Plan 9 implementation and his own RE2, both of which efficiently handle international text (which seems to be a major concern of Gabor's).

I believe Gabor is considering TRE for a good replacement regex library.

The key comment in Mike's GNU grep notes is the one about not breaking into lines. That's simply double-scanning the input; instead, run the matcher over blocks of text and, when it finds a match, work backwards from the match to find the appropriate line beginning. This is efficient because most lines don't match.

I do like the idea.

Boyer-Moore is great for fixed strings (a very common use case for grep) and for more complex patterns that contain long fixed strings (helps to discard most lines early). Sophisticated regex matchers implement a number of strategies and choose different ones depending on the pattern.

That is what fastgrep (in bsdgrep) attempts to accomplish with very simply regex lines (beginning of line, end of line and dot).

In the case of bsdgrep, it might make sense to use the regex library for the general case but implement a hand-tuned search for fixed strings that can be heavily optimized for that case. Of course, international text support complicates the picture; you have to consider the input character set (if you want to auto-detect Unicode encodings by looking for leading BOMs, for example, you either need to translate the fixed-string pattern to match the input encoding or vice-versa).

BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a faster search bypassing the regex library. :) It certainly was not written with any encodings in mind; it was purely ASCII. As I have not kept up with it, I do not know if anyone improved it or not.

Sean
--
s...@freebsd.org
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to