Re: [Apertium-stuff] GSOC2023

Daniel Swanson Wed, 01 Mar 2023 06:56:10 -0800

You're certainly welcome to submit pull requests on the Japanese
repository, but due to the tokenization problems, that probably
shouldn't be your entire coding challenge, since we also need to see
that you can work on that aspect of the project.


Daniel

On Wed, Mar 1, 2023 at 9:49 AM Eiji Miyamoto <motopo...@gmail.com> wrote:
>
> Hello, I am thinking to work on the integration of apertium-3 into 
> apertium-jpn as Jonathan san suggested. Do I need to language data for it? I 
> have already installed dev tools locally.
>
> Also, I’ve found an issue in apertium-jpn, and I wonder should I do this for 
> something like a coding challenge?
>
> Cheers,
>
> *Sorry for your inconvenient to be asked through email. IRC seems weird for 
> my account now.
>
> On Mon, 27 Feb 2023 at 01:08, Jonathan Washington 
> <jonathan.n.washing...@gmail.com> wrote:
>>
>> Hi Eijisan,
>>
>> There's also the tokeniser used for Nuosu, which uses the transducer itself 
>> to tokenise:
>> https://github.com/apertium/apertium-iii
>>
>> I believe this is a later implementation of what's described in the thesis 
>> sent by Kevin in [2].
>>
>> This method has some downsides, but it also has some advantages over a 
>> statistical model.  Perhaps a way to get started would be to explore the 
>> pros and cons of each approach, and think about what a hybrid model could 
>> achieve.  It would be good to join the IRC channel to discuss all this with 
>> the mentors.
>>
>> Another good way to get started (and it would help you do the above too) 
>> would be to integrate the tokeniser from apertium-iii into apertium-jpn:
>> https://github.com/apertium/apertium-jpn
>>
>> You would need to modify the Makefile.am, the modes.xml file, drop in the 
>> tokeniser script, and that's about it?  Then see if you can get it to 
>> analyse text without spaces (test it first with the same text, 
>> hand-tokenised, to see what the output is).  Again, come to IRC for guidance.
>>
>> The tokeniser.py script is a bit slow, mainly because of Python string 
>> processing.  Rewriting it in C/C++ would be useful, and also a good way to 
>> get a better handle on how it works.
>>
>> --
>> Jonathan
>>
>>
>> On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto <motopo...@gmail.com> wrote:
>>>
>>> Thank you for your reply. The project seems cool to work on for GSOC2023, 
>>> and I would like to participate in. I reckon there are two tasks on the 
>>> page and could you tell me where to start?
>>>
>>> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer <unham...@fsfe.org> 
>>> wrote:
>>>>
>>>> > I'd like to participate in Google Summer of Code 2023 at Apertium.
>>>> > In particular, I'm interested in adding new language pair and I am
>>>> > thinking to add Japanese-English as I speak Japanese. I took summer
>>>> > school at Tokyo University online on natural language processing
>>>> > before.
>>>> > Could you tell me more about the project?
>>>>
>>>> Hi,
>>>>
>>>> Getting some support for Japanese would be great! I'm not sure if you
>>>> saw the whole IRC discussion, but what we really need in that regard is
>>>> support for the *tokenisation* step, where our regular methods[1] fail
>>>> us, since the text might have no spaces and lots of
>>>> tokenisation-ambiguity. There has been some prior work[2] and it's
>>>> already listed as a potential GsoC project.
>>>>
>>>> Support for anything-Japanese depends on tokenisation. It's also a big
>>>> enough job that it would qualify as a full GsoC project, so if you were
>>>> hoping for jpn-eng in a summer you will be disappointeda (but having a
>>>> toy language pair to test with would help!). On the other hand, if we
>>>> get good spaceless tokenisation we open up the possibility for not just
>>>> Japanese, but Thai, Lao, Chinese etc. – and of course all those writing
>>>> systems used before the invention of the space character :)
>>>>
>>>> regards,
>>>> Kevin
>>>>
>>>> [1] https://wiki.apertium.org/wiki/LRLM
>>>> [2] http://hdl.handle.net/10066/20002
>>>> [3] 
>>>> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC2023

Reply via email to