IMO the most difficult thing to tokenize for CJK, especially
Chinese, will be the segmentation of words. Because they don't
separate characters and words by delimiters. They always appear
as a string of characters and words. Another problem is that in
Chinese, the same sentence can be interpreted entirely as different
meanings depending on the ways of sentence segmentation.

Apertium has already had a *Chinese dictionary*[1], and I have
compiled and tested its functionality with lt-comp and lt-proc before.

Apertium's tokenization of Chinese seems to go something like this:
A dictionary is ready with commonly used characters and commonly
used words. After the program reads a string of characters, if there
are several characters that combine a word that in the dictionary, the
characters will be considered as a whole, regarded as a word to be
tokenized and analyzed. For other characters that failed to combine
words, all of them will be individually tokenized and analyzed as a
lexeme. As far as I know, Apertium has not yet been implemented the
translation function from Chinese to other languages.

Weizhe

[1] https://github.com/apertium/apertium-zho

On Fri, Mar 27, 2020 at 9:49 PM Tommi A Pirinen <
tommi.antero.piri...@uni-hamburg.de> wrote:

> On Fri, Mar 27, 2020 at 09:58:53AM +0800, 杨伟哲 wrote:
> >
> > Of course, as a Chinese student, I would also be very happy to work
> > on the CJK. We can keep communicating about the tweaks of the plan
> > and the other details.
>
> Awesome, could you perhaps then make even a small example of how
> apertium would currently tokenise any Chinese language and how that
> would be improved. If/when there is no existing apertium dictionary you
> can make a toy example with just a handful of words, this would be very
> interesting.
>
>
> --
> Doktor Tommi A Pirinen, Computational Linguist,
> <https://flammie.github.io/purplemonkeydishwasher/>, Universität
> Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D
> Entwickler.  President of ACL SIGUR SIG for Uralic languages
> <http://gtweb.uit.no/sigur/>.
> I tend to follow inline-posting style in desktop e-mail messages.
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to