Re: [Apertium-stuff] GSOC2023

Eiji Miyamoto Wed, 01 Mar 2023 15:26:04 -0800

Okay, thank you. I will work on the integration and then try to do other
tasks!



On Wed, 1 Mar 2023 at 14:56, Daniel Swanson <awesomeevildu...@gmail.com>
wrote:

> You're certainly welcome to submit pull requests on the Japanese
> repository, but due to the tokenization problems, that probably
> shouldn't be your entire coding challenge, since we also need to see
> that you can work on that aspect of the project.
>
> Daniel
>
> On Wed, Mar 1, 2023 at 9:49 AM Eiji Miyamoto <motopo...@gmail.com> wrote:
> >
> > Hello, I am thinking to work on the integration of apertium-3 into
> apertium-jpn as Jonathan san suggested. Do I need to language data for it?
> I have already installed dev tools locally.
> >
> > Also, I’ve found an issue in apertium-jpn, and I wonder should I do this
> for something like a coding challenge?
> >
> > Cheers,
> >
> > *Sorry for your inconvenient to be asked through email. IRC seems weird
> for my account now.
> >
> > On Mon, 27 Feb 2023 at 01:08, Jonathan Washington <
> jonathan.n.washing...@gmail.com> wrote:
> >>
> >> Hi Eijisan,
> >>
> >> There's also the tokeniser used for Nuosu, which uses the transducer
> itself to tokenise:
> >> https://github.com/apertium/apertium-iii
> >>
> >> I believe this is a later implementation of what's described in the
> thesis sent by Kevin in [2].
> >>
> >> This method has some downsides, but it also has some advantages over a
> statistical model.  Perhaps a way to get started would be to explore the
> pros and cons of each approach, and think about what a hybrid model could
> achieve.  It would be good to join the IRC channel to discuss all this with
> the mentors.
> >>
> >> Another good way to get started (and it would help you do the above
> too) would be to integrate the tokeniser from apertium-iii into
> apertium-jpn:
> >> https://github.com/apertium/apertium-jpn
> >>
> >> You would need to modify the Makefile.am, the modes.xml file, drop in
> the tokeniser script, and that's about it?  Then see if you can get it to
> analyse text without spaces (test it first with the same text,
> hand-tokenised, to see what the output is).  Again, come to IRC for
> guidance.
> >>
> >> The tokeniser.py script is a bit slow, mainly because of Python string
> processing.  Rewriting it in C/C++ would be useful, and also a good way to
> get a better handle on how it works.
> >>
> >> --
> >> Jonathan
> >>
> >>
> >> On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto <motopo...@gmail.com> wrote:
> >>>
> >>> Thank you for your reply. The project seems cool to work on for
> GSOC2023, and I would like to participate in. I reckon there are two tasks
> on the page and could you tell me where to start?
> >>>
> >>> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer <
> unham...@fsfe.org> wrote:
> >>>>
> >>>> > I'd like to participate in Google Summer of Code 2023 at Apertium.
> >>>> > In particular, I'm interested in adding new language pair and I am
> >>>> > thinking to add Japanese-English as I speak Japanese. I took summer
> >>>> > school at Tokyo University online on natural language processing
> >>>> > before.
> >>>> > Could you tell me more about the project?
> >>>>
> >>>> Hi,
> >>>>
> >>>> Getting some support for Japanese would be great! I'm not sure if you
> >>>> saw the whole IRC discussion, but what we really need in that regard
> is
> >>>> support for the *tokenisation* step, where our regular methods[1] fail
> >>>> us, since the text might have no spaces and lots of
> >>>> tokenisation-ambiguity. There has been some prior work[2] and it's
> >>>> already listed as a potential GsoC project.
> >>>>
> >>>> Support for anything-Japanese depends on tokenisation. It's also a big
> >>>> enough job that it would qualify as a full GsoC project, so if you
> were
> >>>> hoping for jpn-eng in a summer you will be disappointeda (but having a
> >>>> toy language pair to test with would help!). On the other hand, if we
> >>>> get good spaceless tokenisation we open up the possibility for not
> just
> >>>> Japanese, but Thai, Lao, Chinese etc. – and of course all those
> writing
> >>>> systems used before the invention of the space character :)
> >>>>
> >>>> regards,
> >>>> Kevin
> >>>>
> >>>> [1] https://wiki.apertium.org/wiki/LRLM
> >>>> [2] http://hdl.handle.net/10066/20002
> >>>> [3]
> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
> >>>> _______________________________________________
> >>>> Apertium-stuff mailing list
> >>>> Apertium-stuff@lists.sourceforge.net
> >>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >>>
> >>> _______________________________________________
> >>> Apertium-stuff mailing list
> >>> Apertium-stuff@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >>
> >> _______________________________________________
> >> Apertium-stuff mailing list
> >> Apertium-stuff@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >
> > _______________________________________________
> > Apertium-stuff mailing list
> > Apertium-stuff@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC2023

Reply via email to