> I'd like to participate in Google Summer of Code 2023 at Apertium.
> In particular, I'm interested in adding new language pair and I am
> thinking to add Japanese-English as I speak Japanese. I took summer
> school at Tokyo University online on natural language processing
> before.
> Could you tell me more about the project?

Hi,

Getting some support for Japanese would be great! I'm not sure if you
saw the whole IRC discussion, but what we really need in that regard is
support for the *tokenisation* step, where our regular methods[1] fail
us, since the text might have no spaces and lots of
tokenisation-ambiguity. There has been some prior work[2] and it's
already listed as a potential GsoC project.

Support for anything-Japanese depends on tokenisation. It's also a big
enough job that it would qualify as a full GsoC project, so if you were
hoping for jpn-eng in a summer you will be disappointeda (but having a
toy language pair to test with would help!). On the other hand, if we
get good spaceless tokenisation we open up the possibility for not just
Japanese, but Thai, Lao, Chinese etc. – and of course all those writing
systems used before the invention of the space character :)

regards,
Kevin

[1] https://wiki.apertium.org/wiki/LRLM
[2] http://hdl.handle.net/10066/20002 
[3] 
https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to