[HanoiLUG] A better Vietnamese spellchecker

Clytie Siddall Mon, 2 Jun 2008 15:57:05 +0930

Ch?o c?c b?n :)

H?p th? chung spell at lists.hanoilug.org ?? ???c t?o ??  
th?o lu?n v?n ?? t?o m?t c?ng c? ki?m tra ch?nh t?  
ti?ng Vi?t h?u hi?u h?n.
The hanoi-spell mailing list has been created to discuss the problem  
of creating a more effective spellchecking tool for Vietnamese.


Cu?c th?o lu?n s? x?y ra ?t nh?t m?t ph?n b?ng  
ti?ng Anh do s? tham gia Ivan Garcia (ng??i ?? t?o t?  
?i?n ki?m tra ch?nh t? ti?ng Vi?t Hunspell cho ch??ng  
tr?nh OpenOffice.org).
This discussion will take place at least partly in English, due to the  
key participation of Ivan Garcia (who created the Hunspell Vietnamese  
spellchecking dictionary for OpenOffice.org [1]).

C?ng c? ki?m tra ch?nh t? ti?ng Vi?t c?i ti?n s?  
d?a v?o s? c? g?ng ph?t tri?n c?a Nguy?n Th?i Ng?c  
Duy (pclouds).
This improved Vietnamese spellchecker will be based on the essential  
development efforts of Nguy?n Th?i Ng?c Duy (also well-known  
online as pclouds).

M?c ??ch l? th?o lu?n v?n ?? n?y, thu th?p v?  
?i?u h?p th?ng tin th?ch h?p v? ti?n tr?nh t?o v?  
th?c hi?n m?t c?ng c? ki?m tra ch?nh t? t?t h?n.
The goal here is to discuss this issue, collecting and coordinating  
relevant information about the process of creating and implementing a  
better Vietnamese spellchecker.

Cu?c th?o lu?n ??n l?c n?y:
Our discussion so far:

Ivan said:

> I have been working on vspell, a Vietnamese spell checker, since 2003
>  (there were huge gaps when I did not work on it at all, though). The
>  source code can be found in [2]. The core idea is quite simple. It is
>  trained with a word-segmented corpus. When a sentence is given
>  (actually a phrase because it still does not understand "sentence"),
>  it will generate similar sentences based on common spelling errors.  
> It
>  then uses statistics from the corpus to determine which sentence (the
>  original one or one of the generated ones) is "better". If a  
> generated
>  one is better, then it assumes the original one is misspelled. That's
>  all. The rest of work is matching the original one and the "right"  
> one
>  to see differences between them and tell users about that.
>
>  The result as of three months ago was not very promising: precise  
> rate
>  was about 60%-70% (I expected at least 80% to be useful). I was
>  investigating to see why the precise rate was low, and had some
>  technical difficulties. I am currently away from home so I can't do
>  any development until Sept. If you are interested in it, I will write
>  proper front-end for it when I get back home so that you can run and
>  test it.

Duy replied:

> I guess [Clytie] told you about my spellng checker for openoffice.  
> The problems are:
> - It requires input as phrases, not words. As far as I can tell,
> hunspell and almost all spelling checkers take input as words.
> - It uses a lot of memory (>= 256MB for itself)
> - It needs to be trained to be able to differentiate good phrases and
> bad phrases. There are some difficulties in my training method (I
> don't know, I may give up on my method, as it is becoming infeasible,
> something with number explosion).
> - It's not stable. I spent most of my time training it and testing it
> with a small sample. It liked to crash back then ;)
>
> The first problem means we must interact with openoffice without
> hunspell. It's sort of a pain to do.
>
> But the main problem is the third. My approach is statistics-based. It
> requires lots of (correct) word-aligned sentences to be trained on.
> That kind of corpus for Vietnamese does not exist (at least freely).
> So my workaround is to take a raw corpus and train repeatedly to get
> better result each iteration. The workaround has number explosion
> problem. It breaks 32-bit integer limit easily and also "long double"
> limit. In short, until I find a feasible training method, my spell
> checker is no use.

Ivan added

> IMHO the statistic method you try to implement  doesn't seem to be  
> the most efficient way for an spell checker, which needs to be  
> realtime speed and light.
>
> The current hunspell dictionary is a temporary solution but not  
> perfect at all, because of the error mentioned by Clytie

>> [Clytie]
>> Vietnamese "words" are often composed of more than one word. Words  
>> are usually monosyllabic, so we can think of them as syllables of  
>> these longer words. However, current spellchecking tools treat each  
>> Vietnamese "syllable" as a separate word. This means that when you  
>> make a mistake that is still a valid word, e.g. typing
>>
>> m?u h?nh
>>
>> instead of
>>
>> m?n h?nh
>>
>> current spellchecking tools will still recognize ? m?u ? and ?  
>> h?nh ? as valid separate words, and not detect the error.

> and because of your effort to solve it with your program (composed  
> words).
>
> My question here is, shouldn't we be talking about a Grammar checker  
> instead of a Spellchecker? Spellcheckers are only supposed to check  
> the spelling of the smallest piece of information separated by  
> spaces, in the Vietnamese case, each syllable.
>
> I suggest that we start comparing the current Grammar tools for  
> OpenOffice, their rules, and ask advice to their mailing list to see  
> which way can be the fastest and more accurate to deploy a efficient  
> Vietnamese Grammar checker.

Then Duy said:

> I know my statistical method isn't ideal. A rule-based approach  
> would be more realistic. But then
> rule-based one requires human power to build the ruleset. A rule  
> generation
> approach like TBL [3] requires an annotated corpus, which I don't
> have.


> [Identifying composed words is] also what I want my spellchecker to  
> do.

> [I don't think a grammar-checker is viable, due to the complexity of  
> Vietnamese grammar.] My spellchecker is basically a spellchecker.  
> Although it could also be able to detect some semantic/grammar  
> mistakes as well.

> [Even using current grammar-checking tools,] you will have more  
> troubles with Vietnamese ;) Before you discuss
> grammar, you must split a sentence into words (actually annotated
> words but that's not the point). It's already difficult to do that in
> Vietnamese. Now you are supposed to do that on a _misspelt_ text.
> Good luck :D
>
> European languages don't have this problem, as distinct words can be  
> easily
> recognized. CJK languages do though, but I guess CJK spellchecker
> status in OOo is just the same as Vietnamese.
>
> To have an idea how hard it is to split a sentence into words, let's
> take a corner case: "?ng gi? ?i nhi?u qu?". You can understand  
> the
> sentence in a couple ways:
>
> - An old man goes a long way (?ng-gi? ?i nhi?u qu?, notice  
> "?ng gi?" is a word)
> - He gets very old (?ng gi? ?i nhi?u qu?, notice "?ng gi?" is  
> two
> separate words)
>
> Now suppose "gi?" is mistakenly written as "d?" then pass the  
> sentence
> to a grammar checker ;)


Clytie contributed:

> One solution to creating a rule-based corpus might be to enlist some  
> volunteers. I use a Bayesian [4] spam program (SpamSieve for OSX)  
> which depends on the user creating his or her own corpus of emails.  
> The more you train the program with the real emails you receive, the  
> better it gets at its job. I like this approach, so I'd be happy to  
> volunteer to train a spellchecking program.
>
> Ivan is quite right, the whole "spellchecker" vs "grammar-checker"  
> is an arbitrary distinction imposed on us by European languages. It  
> really doesn't fit Vietnamese, or any of the ideographic languages.  
> Maybe we need a new description ... just "language checker" ?

> It's a complex task. We need to focus on the spelling, but perhaps  
> use some of the techniques that work for grammar-checkers, to define  
> a "term". Where English words just get longer, Vietnamese words  
> concatenate.
>
> msgid "spellchecker"
> msgstr "c?ng c? ki?m tra ch?nh t?"
>
> We don't have the option of actually gluing much-used words together  
> over time to create a distinct term
>
> spell checker
> spell-checker
> spellchecker
>
> but it's not as bad as it looks. The term "spellchecker" does not  
> have to be checked as a six-word entity. It is a combination of  
> three terms:
>
> c?ng c?       tool
> ki?m tra      check
> ch?nh t?      spelling
>
> The combination into a more complex word is redundant from the point  
> of view of our putative spellchecker.
>
> So, if we have even a small team of people training the software  
> with distinct terms, it wouldn't take long before the checker would  
> be pulling its own weight for the testers. Every bit of data we  
> accumulate this way is a benefit.

> It would be useful to talk to the CJK teams about this. It's really  
> a pan-project issue: this is one of the times when the parochial  
> structure of projects really gets in the way. Perhaps the best place  
> to discuss it outside OOo would be the Translate Toolkit list. They  
> already have a general spellchecking tool, and are involved in  
> OpenOffice.org. (These are the people who created Pootle, too.  
> They're mostly into Python and JToolkit, I think.)

>>
>> - An old man goes a long way (?ng-gi? ?i nhi?u qu?, notice  
>> "?ng gi?" is a word)
>> - He gets very old (?ng gi? ?i nhi?u qu?, notice "?ng gi?"  
>> is two
>> separate words)
>
> He's aging quickly because he walks too far. :D
>
> Yes, the interchangeability of word positions is a difficult  
> problem. But either way, ? gi? ? is correct. The software compares  
> the sentence against its corpus, and won't find an error. If there  
> is an error, the volunteer trains the software with it. Do you think  
> this could work in practice?

Duy answered:

> It would be a bit more work than just marking some mails spam or not.
> It might also require parts-of-speech tagging. I hadn't thought about
> using such an approach. I'll look into it.

Ivan added:

> I totally agree that we should get feedback from other translators'  
> and language tools developers' mailing lists.
>
> I'll start asking in the HanoiLUG and SaigonLUG groups to get more  
> feedback about what should be the best technique for achieving the  
> "word" spellchecker function in Vietnamese.
>
> The HanoiLUG administrator has been kind enough to create our own  
> mailing list for this purpose.
> http://lists.hanoilug.org/listinfo/spell 
> <http://lists.hanoilug.org/listinfo/spell 
> >
>
> I hope you don't mind if I include part of our interesting email  
> conversation concerning the spellchecker.

[And that's how we got to this point...]

M?i b?n g?p ?. :)
Please feel free to contribute your own ideas to this discussion.

Clytie

Nh?m Vi?t ho? Ph?n m?m T? do
http://vnoss.net/dokuwiki/doku.php?id=projects:l10n

[1] http://vi.openoffice.org/about-spellcheck.html
http://code.google.com/p/hunspell-spellcheck-vi/

[2] http://repo.or.cz/w/vspell.git
(click on the first "snapshot" link to get a tarball)

[3] Transformation-Based Learning. Google will give you more  
information.

[4] http://en.wikipedia.org/wiki/Bayesian_spam_filtering
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : 
http://lists.hanoilug.org/pipermail/hanoilug/attachments/20080602/afd06b05/attachment.pgp

[HanoiLUG] A better Vietnamese spellchecker

Trả lời cho