Greetings,
On Wed, Feb 3, 2010 at 9:30 PM, Santhosh Thottingal <[email protected]> wrote: > On Tue, Feb 2, 2010 at 5:16 PM, Rajagopal Swaminathan > <[email protected]> wrote: > If your intention is to get a list of words from a file containing > tamil unicode data, try this: > > > perl -e 'binmode(STDIN, ":utf8"); binmode(STDOUT, ":utf8"); while( > defined (my $c=getc(STDIN)) ) { if( $c =~ /[\x{0b80}-\x{0bff}\s]/ ) { > print $c }}' < infile.txt | perl -ne 'my @fields=split(); foreach my > $f ( @fields ) {print $f,"\n"}' | sort -u > > 0b80 - 0bff is tamil unicode range .Use gucharmap or kcharselect to > findout the ranges. > infile.txt is your inputfile > > Above script sorts the words and remove duplicates too. I was also looking at perl's \p{Devanagari} but it did not wok Whew, thanks a lot. It really helped. Now I will have a go at devanagari (hindi, marathi and sanskrit) scripts. Thanks again You made my day :) Regards, Rajagopal _______________________________________________ ILUGC Mailing List: http://www.ae.iitm.ac.in/mailman/listinfo/ilugc
