On Sun Nov 29 14:03:23 EST 2009, jason.cat...@gmail.com wrote:

> I wrote a wrapper around grep to search for words regardless of
> accents.  I didn't want to worry about whether I used accents on
> characters (I sometimes use them inconsistently, and others decidedly
> do), but I still wanted to limit the results to exact matches if I
> supplied an accent.  Here's an example run.

hey, this is great stuff!  i really like the approach.  i played with
this a little bit, but quickly ran into problems.  the patterns get
really big in a hurry.  "reasonable" re size limits of say 300 characters
just don't work if you're doing expansion.  expanding "cooperate"
results in a 460-byte string!

so i went back to an old idea.  i hope you won't accuse me of topperism,
but you finally motivated me to work on something i threatened
to do at iwp9 2e: add folding to grep.  

it was right up my alley since i just recently redid the rune tables
that i've been using.  they're built directly from UnicodeData.txt.
it wasn't too hard to build a table that folds modified letters to
a base with the unicode data.  from there, i reused the same same
technique used for case folding.  since the table i'm using don't
fold case, "grep -Ii" makes sense.

performance is pretty good. worse case is about 2x the user time.
there's no overhead when the I flag isn't given.

the source is in /n/sources/contrib/quanstro/src/grepfold.
please let me know of any bugs.  i'm sure there are a few wierd
cases.  let me know if there are.

- erik

Reply via email to