i like the approach. back in basser computational linguistics days
frank was indexing a greek verb dictionary. to sort the keys - he used
tr | sort | tr.

i'm glad you didn't screw with grep. it's brilliant but the
implementation is not easily understood. i was in the room at the
time, so i have a headstart.

brucee

On 11/30/09, Jason Catena <[email protected]> wrote:
> I wrote a wrapper around grep to search for words regardless of
> accents.  I didn't want to worry about whether I used accents on
> characters (I sometimes use them inconsistently, and others decidedly
> do), but I still wanted to limit the results to exact matches if I
> supplied an accent.  Here's an example run.
>
>
> $ grep facade word
> treatment <a museum's east facade>.  A false, superficial, or artificial
>
> $ grëp facade word
> 89: to bow to man. façade. circa 1681.  French façade, from Italian
> 92: treatment <a museum's east facade>.  A false, superficial, or artificial
>
> $ grëp façade *
> style:21: crucial difference to pronunciation: cliché, soupçon, façade, café,
> wabisabi:51: or the crumbling stone façade of an old building.   Transience,
> word:89: to bow to man. façade. circa 1681.  French façade, from Italian
>
>
> Note that line word:92 (output by the second command) is not output by
> the third command, since I supplied an accent on that particular
> character (ç) in my input pattern.  I chose the umlaut or diæresis to
> remind me that grëp provides the -n option by default, so I'll get a
> line number and : in the output.  (I should probably just pass through
> all of grep's command-line options.)
>
>
> <grëp>=
> #!/usr/local/plan9/bin/rc
>
> regex=$1
> shift
>
> classes=`{cptmp classes}
> sed '/-/d;s,^\[(.),s/\1/\[\1,;s,$,/g,' charclass > $classes
>
> grep -n `{echo $regex | sed -f $classes} $*
>
>
> I translate each ordinary latin character in the input pattern (eg
> [0-9A-Za-z]) into a character class (the attached charclass file,
> which doesn't cut-and-paste well), and then call grep with the updated
> pattern.  The first sed command in grëp turns the character classes in
> charclass into s commands for sed.  The charclass file contains the
> square brackets because I also use it to cut-and-paste from when I
> need a character class for a sed script.
>
> The script cptmp creates a temporary copy of an existing file, or a
> temporary new file.
>
>
> <cptmp>=
> #!/usr/local/plan9/bin/rc
> flag e +
>
> if(~ $#TMPDIR 0)
>        TMPDIR=/tmp
> base=`{basename $1}
> tmp=$TMPDIR/$base.$USER.$pid
>
> if (test -f $1) {
>        cp -pr $1 $tmp
> }
> if not {
>        touch $tmp
> }
> chmod +wx $tmp
> echo $tmp
>
>
> Jason Catena
>
>

Reply via email to