Hello, I am thinking to work on the integration of apertium-3 into apertium-jpn as Jonathan san suggested. Do I need to language data for it? I have already installed dev tools locally.
Also, I’ve found an issue in apertium-jpn, and I wonder should I do this for something like a coding challenge? Cheers, *Sorry for your inconvenient to be asked through email. IRC seems weird for my account now. On Mon, 27 Feb 2023 at 01:08, Jonathan Washington < jonathan.n.washing...@gmail.com> wrote: > Hi Eijisan, > > There's also the tokeniser used for Nuosu, which uses the transducer > itself to tokenise: > https://github.com/apertium/apertium-iii > > I believe this is a later implementation of what's described in the thesis > sent by Kevin in [2]. > > This method has some downsides, but it also has some advantages over a > statistical model. Perhaps a way to get started would be to explore the > pros and cons of each approach, and think about what a hybrid model could > achieve. It would be good to join the IRC channel to discuss all this with > the mentors. > > Another good way to get started (and it would help you do the above too) > would be to integrate the tokeniser from apertium-iii into apertium-jpn: > https://github.com/apertium/apertium-jpn > > You would need to modify the Makefile.am, the modes.xml file, drop in the > tokeniser script, and that's about it? Then see if you can get it to > analyse text without spaces (test it first with the same text, > hand-tokenised, to see what the output is). Again, come to IRC for > guidance. > > The tokeniser.py script is a bit slow, mainly because of Python string > processing. Rewriting it in C/C++ would be useful, and also a good way to > get a better handle on how it works. > > -- > Jonathan > > > On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto <motopo...@gmail.com> wrote: > >> Thank you for your reply. The project seems cool to work on for GSOC2023, >> and I would like to participate in. I reckon there are two tasks on the >> page and could you tell me where to start? >> >> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer <unham...@fsfe.org> >> wrote: >> >>> > I'd like to participate in Google Summer of Code 2023 at Apertium. >>> > In particular, I'm interested in adding new language pair and I am >>> > thinking to add Japanese-English as I speak Japanese. I took summer >>> > school at Tokyo University online on natural language processing >>> > before. >>> > Could you tell me more about the project? >>> >>> Hi, >>> >>> Getting some support for Japanese would be great! I'm not sure if you >>> saw the whole IRC discussion, but what we really need in that regard is >>> support for the *tokenisation* step, where our regular methods[1] fail >>> us, since the text might have no spaces and lots of >>> tokenisation-ambiguity. There has been some prior work[2] and it's >>> already listed as a potential GsoC project. >>> >>> Support for anything-Japanese depends on tokenisation. It's also a big >>> enough job that it would qualify as a full GsoC project, so if you were >>> hoping for jpn-eng in a summer you will be disappointeda (but having a >>> toy language pair to test with would help!). On the other hand, if we >>> get good spaceless tokenisation we open up the possibility for not just >>> Japanese, but Thai, Lao, Chinese etc. – and of course all those writing >>> systems used before the invention of the space character :) >>> >>> regards, >>> Kevin >>> >>> [1] https://wiki.apertium.org/wiki/LRLM >>> [2] http://hdl.handle.net/10066/20002 >>> [3] >>> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff