Okay, thank you. I will work on the integration and then try to do other tasks!
On Wed, 1 Mar 2023 at 14:56, Daniel Swanson <awesomeevildu...@gmail.com> wrote: > You're certainly welcome to submit pull requests on the Japanese > repository, but due to the tokenization problems, that probably > shouldn't be your entire coding challenge, since we also need to see > that you can work on that aspect of the project. > > Daniel > > On Wed, Mar 1, 2023 at 9:49 AM Eiji Miyamoto <motopo...@gmail.com> wrote: > > > > Hello, I am thinking to work on the integration of apertium-3 into > apertium-jpn as Jonathan san suggested. Do I need to language data for it? > I have already installed dev tools locally. > > > > Also, I’ve found an issue in apertium-jpn, and I wonder should I do this > for something like a coding challenge? > > > > Cheers, > > > > *Sorry for your inconvenient to be asked through email. IRC seems weird > for my account now. > > > > On Mon, 27 Feb 2023 at 01:08, Jonathan Washington < > jonathan.n.washing...@gmail.com> wrote: > >> > >> Hi Eijisan, > >> > >> There's also the tokeniser used for Nuosu, which uses the transducer > itself to tokenise: > >> https://github.com/apertium/apertium-iii > >> > >> I believe this is a later implementation of what's described in the > thesis sent by Kevin in [2]. > >> > >> This method has some downsides, but it also has some advantages over a > statistical model. Perhaps a way to get started would be to explore the > pros and cons of each approach, and think about what a hybrid model could > achieve. It would be good to join the IRC channel to discuss all this with > the mentors. > >> > >> Another good way to get started (and it would help you do the above > too) would be to integrate the tokeniser from apertium-iii into > apertium-jpn: > >> https://github.com/apertium/apertium-jpn > >> > >> You would need to modify the Makefile.am, the modes.xml file, drop in > the tokeniser script, and that's about it? Then see if you can get it to > analyse text without spaces (test it first with the same text, > hand-tokenised, to see what the output is). Again, come to IRC for > guidance. > >> > >> The tokeniser.py script is a bit slow, mainly because of Python string > processing. Rewriting it in C/C++ would be useful, and also a good way to > get a better handle on how it works. > >> > >> -- > >> Jonathan > >> > >> > >> On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto <motopo...@gmail.com> wrote: > >>> > >>> Thank you for your reply. The project seems cool to work on for > GSOC2023, and I would like to participate in. I reckon there are two tasks > on the page and could you tell me where to start? > >>> > >>> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer < > unham...@fsfe.org> wrote: > >>>> > >>>> > I'd like to participate in Google Summer of Code 2023 at Apertium. > >>>> > In particular, I'm interested in adding new language pair and I am > >>>> > thinking to add Japanese-English as I speak Japanese. I took summer > >>>> > school at Tokyo University online on natural language processing > >>>> > before. > >>>> > Could you tell me more about the project? > >>>> > >>>> Hi, > >>>> > >>>> Getting some support for Japanese would be great! I'm not sure if you > >>>> saw the whole IRC discussion, but what we really need in that regard > is > >>>> support for the *tokenisation* step, where our regular methods[1] fail > >>>> us, since the text might have no spaces and lots of > >>>> tokenisation-ambiguity. There has been some prior work[2] and it's > >>>> already listed as a potential GsoC project. > >>>> > >>>> Support for anything-Japanese depends on tokenisation. It's also a big > >>>> enough job that it would qualify as a full GsoC project, so if you > were > >>>> hoping for jpn-eng in a summer you will be disappointeda (but having a > >>>> toy language pair to test with would help!). On the other hand, if we > >>>> get good spaceless tokenisation we open up the possibility for not > just > >>>> Japanese, but Thai, Lao, Chinese etc. – and of course all those > writing > >>>> systems used before the invention of the space character :) > >>>> > >>>> regards, > >>>> Kevin > >>>> > >>>> [1] https://wiki.apertium.org/wiki/LRLM > >>>> [2] http://hdl.handle.net/10066/20002 > >>>> [3] > https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies > >>>> _______________________________________________ > >>>> Apertium-stuff mailing list > >>>> Apertium-stuff@lists.sourceforge.net > >>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff > >>> > >>> _______________________________________________ > >>> Apertium-stuff mailing list > >>> Apertium-stuff@lists.sourceforge.net > >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff > >> > >> _______________________________________________ > >> Apertium-stuff mailing list > >> Apertium-stuff@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff > > > > _______________________________________________ > > Apertium-stuff mailing list > > Apertium-stuff@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > > > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff