For people who wish to process texts in Chinese (Traditional, but also Simplfied via Encode::HanConvert) language, I have just uploaded Lingua::ZH::Toke on CPAN.
That module is a 'use utf8;'-friendly frontend to my Lingua::ZH::TaBE module; it allows you to manupulate linguistic objects like below (in big5): use Lingua::ZH::Toke; # add 'utf8' to use unicode strings # Create Lingua::ZH::Toke::Sentence object (->Sentence also works) my $token = Lingua::ZH::Toke->new( '那人卻在/燈火闌珊處/益發意興闌珊' ); # Easy tokenization via array deferencing print $token->[0] # Fragment - 那人卻在 ->[2] # Phrase - 卻在 ->[0] # Character - 卻 ->[0] # Pronounciation - ㄑㄩㄝˋ ->[2]; # Phonetic - ㄝ # Magic histogram via hash deferencing print $token->{'那人卻在'}; # 1 - One such fragment there print $token->{'意興闌珊'}; # 1 - One such phrase there print $token->{'發意興闌'}; # undef - That's not a phrase print $token->{'珊'}; # 2 - Two such character there print $token->{'ㄧˋ'}; # 2 - Two such pronounciation: 益意 print $token->{'ㄨ'}; # 3 - Three such phonetics: 那火處 # Iteration over fragments while (my $fragment = <$token>) { # Iteration over phrases while (my $phrase = <$token>) { # ... } } The 'phonetic' symbols are expressed in BoPoMoFo notation. There are also various utility methods (complex segmentation, etc.); see Lingua::ZH::TaBE for details. Comments welcome. :-) Thanks, /Autrijus/
msg01641/pgp00000.pgp
Description: PGP signature