On Tue, Feb 2, 2010 at 5:16 PM, Rajagopal Swaminathan
<[email protected]> wrote:
> Greetings,
>
> I am able to get a wordlist in <file> by using the following command
> cat <file> | tr -sc A-Za-z '\012'
>
> My question is how to specify unicode character and ASCII such as
> hindi, tamiz etc. in the tr command.
>
> I am new to unicode.

If your intention is to get a list of words from a file containing
tamil unicode data, try this:


perl -e 'binmode(STDIN, ":utf8"); binmode(STDOUT, ":utf8"); while(
defined (my $c=getc(STDIN)) ) { if( $c =~ /[\x{0b80}-\x{0bff}\s]/ ) {
print $c }}' < infile.txt | perl -ne 'my @fields=split(); foreach my
$f ( @fields ) {print $f,"\n"}' | sort -u

0b80 - 0bff is tamil unicode range .Use gucharmap or kcharselect to
findout the ranges.
infile.txt is your inputfile

Above script sorts the words and remove duplicates too.

-santhosh
_______________________________________________
ILUGC Mailing List:
http://www.ae.iitm.ac.in/mailman/listinfo/ilugc

Reply via email to