John Machin wrote:

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute).  If the matching was fast then I could possibly pickle the
lexer though (but it's not).


Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re?

Trivial stuff like:

          (Str('error in pkg_delete'), ('mtree', 'mtree')),
(Str('filesystem was touched prior to .make install'), ('mtree', 'mtree')),
          (Str('list of extra files and directories'), ('mtree', 'mtree')),
(Str('list of files present before this port was installed'), ('mtree', 'mtree')), (Str('list of filesystem changes from before and after'), ('mtree', 'mtree')),

          (re('Configuration .* not supported'), ('arch', 'arch')),

(re('(configure: error:|Script.*configure.*failed unexpectedly|script.*failed: here are the contents of)'),
           ('configure_error', 'configure')),
...

There are about 150 of them and I want to find which is the first match in a text file that ranges from a few KB up to 512MB in size.

> How large is "large"?
What kind of text?

It's compiler/build output.

Instead of grep, you might like to try nrgrep ... google("nrgrep
Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.

Thanks, looks interesting but I don't think it is the best fit here. I would like to avoid spawning hundreds of processes to process each file (since I have tens of thousands of them to process).

Kris

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to