Tom Lancaster a ?crit :
> I posed this question to JC earlier to day IRL, but he asked me to remind him
> about it by posing it again here:
> a) how does one use a computer to determine whether a piece of text is in
> vietnamese language or not;
> b) are there any existing implementations of these heuristics / this
> algorithm that I can use
>
> I think JC answered a) sufficiently by stating the heuristic: strip sample
> text of diacritics and split into words; then compare each word with a list
> of known vietnamese words that has also been stripped of diacritics - you can
> use the percentage of matches to determine the probability of a given text
> being vietnamese.
>
> question b) remains: are there existing implementations of this solution that
> I can use?
>
> Apologies for lack of VN translation - perhaps someone else can help?
>
For the best results, you probably should use some dictionary lookup, as
said above. It will be a bit slow, but not too slow if you optimize the
code (eg: using hash or dichotomy lookup). And it will allow you, using
multiple dictionaries, to sort match results from multiple languages at
the same time.
On the other hand, like I told you last Saturday, you could also try
some heuristic based on common knowledge about Vietnamese "words".
Here is some code example I wrote this Sunday:
#!/bin/bash
# Here is the algorithm:
# 1. transliterate the text to ascii-only characters
# 2. remove everything not alphabetic, apostrophe or white space
# 3. sort out "not Vietnamese" "words" from those that could be
# 4. count occurrences to get the percentage of "not Vietnamese" "words"
iconv --from=utf8 --to=ascii//translit |
tr -cs "[[:alpha:]]'[[:space:]]" " " | tr -s '[[:space:]]' "\n" |
sed -ne "
s/^.*'.*$/not|\0/p;t
s/^.*[aeiouy][^aeiouy[:space:]]\+[aeiouy].*$/not|\0/p;t
s/^.*$/may|\0/p
" | tee /tmp/stats.txt |
sed -e 's/|.*$//' | awk '/may/{may+=1}/not/{not+=1}END{print"Probably
not Vietnamese at",int(100*not/(may+not)),"%"}'
It will give you a percentage of surely *not* Vietnamese "words", based
only on the knowledge that you can not have two vowels separated by one
consonant. Try it on some random web content: pure Vietnamese text will
always give you 0% and other text will give you positive result, meaning
there is some level of not Vietnamese "words" in it. You can also check
the details in "/tmp/stats.txt".
About Vietnamese language properties, there is some interesting English
document here:
http://vietnamese-grammar.group.shef.ac.uk/
Still I'm not a Vietnamese language expert so anybody else who have
better information or links please contribute them! :-)
--
Jean Christophe "????" ANDR? ? Responsable technique r?gional
Bureau Asie-Pacifique (BAP) ? http://asie-pacifique.auf.org/
Agence universitaire de la Francophonie (AuF) ? http://www.auf.org/
Adresse postale : AUF, 21 L? Th?nh T?ng, T.T. Ho?n Ki?m, H? N?i, Vi?t Nam
T?l. : +84 4 9331108 Fax : +84 4 8247383 Mobile : +84 91 3248747
? Note personnelle : merci d'?viter de m'envoyer des fichiers PowerPoint ?
? ou Word, voir http://www.gnu.org/philosophy/no-word-attachments.fr.html ?
-------------- section suivante --------------
Une pi?ce jointe non texte a ?t? nettoy?e...
Nom: signature.asc
Type: application/pgp-signature
Taille: 252 octets
Desc: OpenPGP digital signature
Url:
http://lists.hanoilug.org/pipermail/hanoilug/attachments/20080514/eb2d3f1e/attachment.pgp