Re: why GNU grep is fast

Sean C. Farley Sun, 22 Aug 2010 18:37:51 -0700

On Sun, 22 Aug 2010, Tim Kientzle wrote:

On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote:
On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:
Mike Haertel <m...@ducky.net> writes:
GNU grep uses the well-known Boyer-Moore algorithm, which looksfirst for the final letter of the target string, and uses a lookuptable to tell it how far ahead it can skip in the input whenever itfinds a non-matching character.
Boyer-Moore is for fixed search strings. I don't see how thatoptimization can work with a regexp search unless the regexp is sosimple that you break it down into a small number of cases withknown length and final character.
When I was working on making FreeGrep faster (years ago), I wrotedown a few notes about possible algorithms, especially those thatcould be useful for fgrep functionality. I am just passing theseonto the list.
Some algorithms:
1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
3. GNU fgrep:  Commentz-Walter
4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant)

Also, this may be a useful book:
http://www.dcc.uchile.cl/~gnavarro/FPMbook/
And of course, Russ Cox' excellent series of articles starting at:

 http://swtch.com/~rsc/regexp/regexp1.html

I saved that link from an E-mail earlier because it looked veryinteresting.

Later on, he summarizes some of the existing implementations,including comments about the Plan 9 implementation and his own RE2,both of which efficiently handle international text (which seems to bea major concern of Gabor's).


I believe Gabor is considering TRE for a good replacement regex library.

The key comment in Mike's GNU grep notes is the one about not breakinginto lines. That's simply double-scanning the input; instead, run thematcher over blocks of text and, when it finds a match, work backwardsfrom the match to find the appropriate line beginning. This isefficient because most lines don't match.


I do like the idea.

Boyer-Moore is great for fixed strings (a very common use case forgrep) and for more complex patterns that contain long fixed strings(helps to discard most lines early). Sophisticated regex matchersimplement a number of strategies and choose different ones dependingon the pattern.

That is what fastgrep (in bsdgrep) attempts to accomplish with verysimply regex lines (beginning of line, end of line and dot).

In the case of bsdgrep, it might make sense to use the regex libraryfor the general case but implement a hand-tuned search for fixedstrings that can be heavily optimized for that case. Of course,international text support complicates the picture; you have toconsider the input character set (if you want to auto-detect Unicodeencodings by looking for leading BOMs, for example, you either need totranslate the fixed-string pattern to match the input encoding orvice-versa).

BTW, the fastgrep portion of bsdgrep is my fault/contribution to do afaster search bypassing the regex library. :) It certainly was notwritten with any encodings in mind; it was purely ASCII. As I have notkept up with it, I do not know if anyone improved it or not.


Sean
--
s...@freebsd.org

_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: why GNU grep is fast

Reply via email to