> I'd like to participate in Google Summer of Code 2023 at Apertium. > In particular, I'm interested in adding new language pair and I am > thinking to add Japanese-English as I speak Japanese. I took summer > school at Tokyo University online on natural language processing > before. > Could you tell me more about the project?
Hi, Getting some support for Japanese would be great! I'm not sure if you saw the whole IRC discussion, but what we really need in that regard is support for the *tokenisation* step, where our regular methods[1] fail us, since the text might have no spaces and lots of tokenisation-ambiguity. There has been some prior work[2] and it's already listed as a potential GsoC project. Support for anything-Japanese depends on tokenisation. It's also a big enough job that it would qualify as a full GsoC project, so if you were hoping for jpn-eng in a summer you will be disappointeda (but having a toy language pair to test with would help!). On the other hand, if we get good spaceless tokenisation we open up the possibility for not just Japanese, but Thai, Lao, Chinese etc. – and of course all those writing systems used before the invention of the space character :) regards, Kevin [1] https://wiki.apertium.org/wiki/LRLM [2] http://hdl.handle.net/10066/20002 [3] https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff