[HanoiLUG] Identify Vietnamese Text

Jean Christophe André Wed, 14 May 2008 06:15:43 +0700

Tom Lancaster a ?crit :
> I posed this question to JC earlier to day IRL, but he asked me to remind him 
> about it by posing it again here:
> a) how does one use a computer to determine whether a piece of text is in 
> vietnamese language or not;
> b) are there any existing implementations of these heuristics / this 
> algorithm that I can use
>
> I think JC answered a) sufficiently by stating the heuristic: strip sample 
> text of diacritics and split into words; then compare each word with a list 
> of known vietnamese words that has also been stripped of diacritics - you can 
> use the percentage of matches to determine the probability of a given text 
> being vietnamese.
>
> question b) remains: are there existing implementations of this solution that 
> I can use?
>
> Apologies for lack of VN translation - perhaps someone else can help?
>   
For the best results, you probably should use some dictionary lookup, as
said above. It will be a bit slow, but not too slow if you optimize the
code (eg: using hash or dichotomy lookup). And it will allow you, using
multiple dictionaries, to sort match results from multiple languages at
the same time.


On the other hand, like I told you last Saturday, you could also try
some heuristic based on common knowledge about Vietnamese "words".

Here is some code example I wrote this Sunday:

#!/bin/bash
# Here is the algorithm:
# 1. transliterate the text to ascii-only characters
# 2. remove everything not alphabetic, apostrophe or white space
# 3. sort out "not Vietnamese" "words" from those that could be
# 4. count occurrences to get the percentage of "not Vietnamese" "words"
iconv --from=utf8 --to=ascii//translit |
tr -cs "[[:alpha:]]'[[:space:]]" " " | tr -s '[[:space:]]' "\n" |
sed -ne "
  s/^.*'.*$/not|\0/p;t
  s/^.*[aeiouy][^aeiouy[:space:]]\+[aeiouy].*$/not|\0/p;t
  s/^.*$/may|\0/p
" | tee /tmp/stats.txt |
sed -e 's/|.*$//' | awk '/may/{may+=1}/not/{not+=1}END{print"Probably
not Vietnamese at",int(100*not/(may+not)),"%"}'

It will give you a percentage of surely *not* Vietnamese "words", based
only on the knowledge that you can not have two vowels separated by one
consonant. Try it on some random web content: pure Vietnamese text will
always give you 0% and other text will give you positive result, meaning
there is some level of not Vietnamese "words" in it. You can also check
the details in "/tmp/stats.txt".

About Vietnamese language properties, there is some interesting English
document here:
  http://vietnamese-grammar.group.shef.ac.uk/

Still I'm not a Vietnamese language expert so anybody else who have
better information or links please contribute them! :-)

-- 
Jean Christophe "????" ANDR? ? Responsable technique r?gional
Bureau Asie-Pacifique (BAP) ? http://asie-pacifique.auf.org/
Agence universitaire de la Francophonie (AuF) ? http://www.auf.org/
Adresse postale : AUF, 21 L? Th?nh T?ng, T.T. Ho?n Ki?m, H? N?i, Vi?t Nam
T?l. : +84 4 9331108   Fax : +84 4 8247383   Mobile : +84 91 3248747
? Note personnelle : merci d'?viter de m'envoyer des fichiers PowerPoint  ?
? ou Word, voir http://www.gnu.org/philosophy/no-word-attachments.fr.html ?


-------------- section suivante --------------
Une pi?ce jointe non texte a ?t? nettoy?e...
Nom: signature.asc
Type: application/pgp-signature
Taille: 252 octets
Desc: OpenPGP digital signature
Url: 
http://lists.hanoilug.org/pipermail/hanoilug/attachments/20080514/eb2d3f1e/attachment.pgp

[HanoiLUG] Identify Vietnamese Text

Trả lời cho