IMO the most difficult thing to tokenize for CJK, especially Chinese, will be the segmentation of words. Because they don't separate characters and words by delimiters. They always appear as a string of characters and words. Another problem is that in Chinese, the same sentence can be interpreted entirely as different meanings depending on the ways of sentence segmentation.
Apertium has already had a *Chinese dictionary*[1], and I have compiled and tested its functionality with lt-comp and lt-proc before. Apertium's tokenization of Chinese seems to go something like this: A dictionary is ready with commonly used characters and commonly used words. After the program reads a string of characters, if there are several characters that combine a word that in the dictionary, the characters will be considered as a whole, regarded as a word to be tokenized and analyzed. For other characters that failed to combine words, all of them will be individually tokenized and analyzed as a lexeme. As far as I know, Apertium has not yet been implemented the translation function from Chinese to other languages. Weizhe [1] https://github.com/apertium/apertium-zho On Fri, Mar 27, 2020 at 9:49 PM Tommi A Pirinen < tommi.antero.piri...@uni-hamburg.de> wrote: > On Fri, Mar 27, 2020 at 09:58:53AM +0800, 杨伟哲 wrote: > > > > Of course, as a Chinese student, I would also be very happy to work > > on the CJK. We can keep communicating about the tweaks of the plan > > and the other details. > > Awesome, could you perhaps then make even a small example of how > apertium would currently tokenise any Chinese language and how that > would be improved. If/when there is no existing apertium dictionary you > can make a toy example with just a handful of words, this would be very > interesting. > > > -- > Doktor Tommi A Pirinen, Computational Linguist, > <https://flammie.github.io/purplemonkeydishwasher/>, Universität > Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D > Entwickler. President of ACL SIGUR SIG for Uralic languages > <http://gtweb.uit.no/sigur/>. > I tend to follow inline-posting style in desktop e-mail messages. > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff