Greetings,

On Wed, Feb 3, 2010 at 9:30 PM, Santhosh Thottingal
<[email protected]> wrote:
> On Tue, Feb 2, 2010 at 5:16 PM, Rajagopal Swaminathan
> <[email protected]> wrote:
> If your intention is to get a list of words from a file containing
> tamil unicode data, try this:
>
>
> perl -e 'binmode(STDIN, ":utf8"); binmode(STDOUT, ":utf8"); while(
> defined (my $c=getc(STDIN)) ) { if( $c =~ /[\x{0b80}-\x{0bff}\s]/ ) {
> print $c }}' < infile.txt | perl -ne 'my @fields=split(); foreach my
> $f ( @fields ) {print $f,"\n"}' | sort -u
>
> 0b80 - 0bff is tamil unicode range .Use gucharmap or kcharselect to
> findout the ranges.
> infile.txt is your inputfile
>
> Above script sorts the words and remove duplicates too.

I was also looking at perl's \p{Devanagari} but it did not wok

Whew, thanks a lot.

It really helped. Now I will have a go at devanagari (hindi, marathi
and sanskrit) scripts.

Thanks again

You made my day :)

Regards,

Rajagopal
_______________________________________________
ILUGC Mailing List:
http://www.ae.iitm.ac.in/mailman/listinfo/ilugc

Reply via email to