On Sun, 22 Aug 2010, Tim Kientzle wrote:
On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote:
On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:
Mike Haertel <m...@ducky.net> writes:
GNU grep uses the well-known Boyer-Moore algorithm, which looks
first for the final letter of the target string, and uses a lookup
table to tell it how far ahead it can skip in the input whenever it
finds a non-matching character.
Boyer-Moore is for fixed search strings. I don't see how that
optimization can work with a regexp search unless the regexp is so
simple that you break it down into a small number of cases with
known length and final character.
When I was working on making FreeGrep faster (years ago), I wrote
down a few notes about possible algorithms, especially those that
could be useful for fgrep functionality. I am just passing these
onto the list.
Some algorithms:
1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
3. GNU fgrep: Commentz-Walter
4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant)
Also, this may be a useful book:
http://www.dcc.uchile.cl/~gnavarro/FPMbook/
And of course, Russ Cox' excellent series of articles starting at:
http://swtch.com/~rsc/regexp/regexp1.html
I saved that link from an E-mail earlier because it looked very
interesting.
Later on, he summarizes some of the existing implementations,
including comments about the Plan 9 implementation and his own RE2,
both of which efficiently handle international text (which seems to be
a major concern of Gabor's).
I believe Gabor is considering TRE for a good replacement regex library.
The key comment in Mike's GNU grep notes is the one about not breaking
into lines. That's simply double-scanning the input; instead, run the
matcher over blocks of text and, when it finds a match, work backwards
from the match to find the appropriate line beginning. This is
efficient because most lines don't match.
I do like the idea.
Boyer-Moore is great for fixed strings (a very common use case for
grep) and for more complex patterns that contain long fixed strings
(helps to discard most lines early). Sophisticated regex matchers
implement a number of strategies and choose different ones depending
on the pattern.
That is what fastgrep (in bsdgrep) attempts to accomplish with very
simply regex lines (beginning of line, end of line and dot).
In the case of bsdgrep, it might make sense to use the regex library
for the general case but implement a hand-tuned search for fixed
strings that can be heavily optimized for that case. Of course,
international text support complicates the picture; you have to
consider the input character set (if you want to auto-detect Unicode
encodings by looking for leading BOMs, for example, you either need to
translate the fixed-string pattern to match the input encoding or
vice-versa).
BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a
faster search bypassing the regex library. :) It certainly was not
written with any encodings in mind; it was purely ASCII. As I have not
kept up with it, I do not know if anyone improved it or not.
Sean
--
s...@freebsd.org
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"