i like the approach. back in basser computational linguistics days frank was indexing a greek verb dictionary. to sort the keys - he used tr | sort | tr.
i'm glad you didn't screw with grep. it's brilliant but the implementation is not easily understood. i was in the room at the time, so i have a headstart. brucee On 11/30/09, Jason Catena <[email protected]> wrote: > I wrote a wrapper around grep to search for words regardless of > accents. I didn't want to worry about whether I used accents on > characters (I sometimes use them inconsistently, and others decidedly > do), but I still wanted to limit the results to exact matches if I > supplied an accent. Here's an example run. > > > $ grep facade word > treatment <a museum's east facade>. A false, superficial, or artificial > > $ grëp facade word > 89: to bow to man. façade. circa 1681. French façade, from Italian > 92: treatment <a museum's east facade>. A false, superficial, or artificial > > $ grëp façade * > style:21: crucial difference to pronunciation: cliché, soupçon, façade, café, > wabisabi:51: or the crumbling stone façade of an old building. Transience, > word:89: to bow to man. façade. circa 1681. French façade, from Italian > > > Note that line word:92 (output by the second command) is not output by > the third command, since I supplied an accent on that particular > character (ç) in my input pattern. I chose the umlaut or diæresis to > remind me that grëp provides the -n option by default, so I'll get a > line number and : in the output. (I should probably just pass through > all of grep's command-line options.) > > > <grëp>= > #!/usr/local/plan9/bin/rc > > regex=$1 > shift > > classes=`{cptmp classes} > sed '/-/d;s,^\[(.),s/\1/\[\1,;s,$,/g,' charclass > $classes > > grep -n `{echo $regex | sed -f $classes} $* > > > I translate each ordinary latin character in the input pattern (eg > [0-9A-Za-z]) into a character class (the attached charclass file, > which doesn't cut-and-paste well), and then call grep with the updated > pattern. The first sed command in grëp turns the character classes in > charclass into s commands for sed. The charclass file contains the > square brackets because I also use it to cut-and-paste from when I > need a character class for a sed script. > > The script cptmp creates a temporary copy of an existing file, or a > temporary new file. > > > <cptmp>= > #!/usr/local/plan9/bin/rc > flag e + > > if(~ $#TMPDIR 0) > TMPDIR=/tmp > base=`{basename $1} > tmp=$TMPDIR/$base.$USER.$pid > > if (test -f $1) { > cp -pr $1 $tmp > } > if not { > touch $tmp > } > chmod +wx $tmp > echo $tmp > > > Jason Catena > >
