Re: [Apertium-stuff] gsoc2023 proposal
Thank you for your feedback! I will make my proposal more detailed and change some week's goals too. On Sat, 18 Mar 2023 at 10:05, Kevin Brubeck Unhammer wrote: > > Hello, I have finished my first draft and I would love to get any > feedback > > from potential mentors. > > https://wiki.apertium.org/wiki/User:Eiji > > Hi, > > This looks promising :) Some thoughts: > > You've already made kind of an overview of the possibilities in your > proposal; I would tone down the "investigate possibilities" parts and > instead try to focus on how you're going to implement one of the > methods, using apertium-jpn as a testbed. > > Try to make clear deliverables per week or at least every other week, > you should have something like a proof-of-concept by week 2 – especially > if your ambition is to also work on improving the Japanese language > data. You currently have week 6 for testing – but you should be testing > from the start alongside the coding. I would probably plan for 2 weeks > for converting the PoC from Python to C++ and making it usable as a part > of the pipeline. > > (Think about how this will be integrated into apertium – we have a > translation pipeline which expects a certain format > https://wiki.apertium.org/wiki/Apertium_stream_format ) > > best regards, > Kevin Brubeck Unhammer > > > > > > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] gsoc2023 proposal
> Hello, I have finished my first draft and I would love to get any feedback > from potential mentors. > https://wiki.apertium.org/wiki/User:Eiji Hi, This looks promising :) Some thoughts: You've already made kind of an overview of the possibilities in your proposal; I would tone down the "investigate possibilities" parts and instead try to focus on how you're going to implement one of the methods, using apertium-jpn as a testbed. Try to make clear deliverables per week or at least every other week, you should have something like a proof-of-concept by week 2 – especially if your ambition is to also work on improving the Japanese language data. You currently have week 6 for testing – but you should be testing from the start alongside the coding. I would probably plan for 2 weeks for converting the PoC from Python to C++ and making it usable as a part of the pipeline. (Think about how this will be integrated into apertium – we have a translation pipeline which expects a certain format https://wiki.apertium.org/wiki/Apertium_stream_format ) best regards, Kevin Brubeck Unhammer ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC2023
Okay, thank you. I will work on the integration and then try to do other tasks! On Wed, 1 Mar 2023 at 14:56, Daniel Swanson wrote: > You're certainly welcome to submit pull requests on the Japanese > repository, but due to the tokenization problems, that probably > shouldn't be your entire coding challenge, since we also need to see > that you can work on that aspect of the project. > > Daniel > > On Wed, Mar 1, 2023 at 9:49 AM Eiji Miyamoto wrote: > > > > Hello, I am thinking to work on the integration of apertium-3 into > apertium-jpn as Jonathan san suggested. Do I need to language data for it? > I have already installed dev tools locally. > > > > Also, I’ve found an issue in apertium-jpn, and I wonder should I do this > for something like a coding challenge? > > > > Cheers, > > > > *Sorry for your inconvenient to be asked through email. IRC seems weird > for my account now. > > > > On Mon, 27 Feb 2023 at 01:08, Jonathan Washington < > jonathan.n.washing...@gmail.com> wrote: > >> > >> Hi Eijisan, > >> > >> There's also the tokeniser used for Nuosu, which uses the transducer > itself to tokenise: > >> https://github.com/apertium/apertium-iii > >> > >> I believe this is a later implementation of what's described in the > thesis sent by Kevin in [2]. > >> > >> This method has some downsides, but it also has some advantages over a > statistical model. Perhaps a way to get started would be to explore the > pros and cons of each approach, and think about what a hybrid model could > achieve. It would be good to join the IRC channel to discuss all this with > the mentors. > >> > >> Another good way to get started (and it would help you do the above > too) would be to integrate the tokeniser from apertium-iii into > apertium-jpn: > >> https://github.com/apertium/apertium-jpn > >> > >> You would need to modify the Makefile.am, the modes.xml file, drop in > the tokeniser script, and that's about it? Then see if you can get it to > analyse text without spaces (test it first with the same text, > hand-tokenised, to see what the output is). Again, come to IRC for > guidance. > >> > >> The tokeniser.py script is a bit slow, mainly because of Python string > processing. Rewriting it in C/C++ would be useful, and also a good way to > get a better handle on how it works. > >> > >> -- > >> Jonathan > >> > >> > >> On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto wrote: > >>> > >>> Thank you for your reply. The project seems cool to work on for > GSOC2023, and I would like to participate in. I reckon there are two tasks > on the page and could you tell me where to start? > >>> > >>> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer < > unham...@fsfe.org> wrote: > > > I'd like to participate in Google Summer of Code 2023 at Apertium. > > In particular, I'm interested in adding new language pair and I am > > thinking to add Japanese-English as I speak Japanese. I took summer > > school at Tokyo University online on natural language processing > > before. > > Could you tell me more about the project? > > Hi, > > Getting some support for Japanese would be great! I'm not sure if you > saw the whole IRC discussion, but what we really need in that regard > is > support for the *tokenisation* step, where our regular methods[1] fail > us, since the text might have no spaces and lots of > tokenisation-ambiguity. There has been some prior work[2] and it's > already listed as a potential GsoC project. > > Support for anything-Japanese depends on tokenisation. It's also a big > enough job that it would qualify as a full GsoC project, so if you > were > hoping for jpn-eng in a summer you will be disappointeda (but having a > toy language pair to test with would help!). On the other hand, if we > get good spaceless tokenisation we open up the possibility for not > just > Japanese, but Thai, Lao, Chinese etc. – and of course all those > writing > systems used before the invention of the space character :) > > regards, > Kevin > > [1] https://wiki.apertium.org/wiki/LRLM > [2] http://hdl.handle.net/10066/20002 > [3] > https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > >>> > >>> ___ > >>> Apertium-stuff mailing list > >>> Apertium-stuff@lists.sourceforge.net > >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff > >> > >> ___ > >> Apertium-stuff mailing list > >> Apertium-stuff@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff > > > > ___ > > Apertium-stuff
Re: [Apertium-stuff] GSOC2023
You're certainly welcome to submit pull requests on the Japanese repository, but due to the tokenization problems, that probably shouldn't be your entire coding challenge, since we also need to see that you can work on that aspect of the project. Daniel On Wed, Mar 1, 2023 at 9:49 AM Eiji Miyamoto wrote: > > Hello, I am thinking to work on the integration of apertium-3 into > apertium-jpn as Jonathan san suggested. Do I need to language data for it? I > have already installed dev tools locally. > > Also, I’ve found an issue in apertium-jpn, and I wonder should I do this for > something like a coding challenge? > > Cheers, > > *Sorry for your inconvenient to be asked through email. IRC seems weird for > my account now. > > On Mon, 27 Feb 2023 at 01:08, Jonathan Washington > wrote: >> >> Hi Eijisan, >> >> There's also the tokeniser used for Nuosu, which uses the transducer itself >> to tokenise: >> https://github.com/apertium/apertium-iii >> >> I believe this is a later implementation of what's described in the thesis >> sent by Kevin in [2]. >> >> This method has some downsides, but it also has some advantages over a >> statistical model. Perhaps a way to get started would be to explore the >> pros and cons of each approach, and think about what a hybrid model could >> achieve. It would be good to join the IRC channel to discuss all this with >> the mentors. >> >> Another good way to get started (and it would help you do the above too) >> would be to integrate the tokeniser from apertium-iii into apertium-jpn: >> https://github.com/apertium/apertium-jpn >> >> You would need to modify the Makefile.am, the modes.xml file, drop in the >> tokeniser script, and that's about it? Then see if you can get it to >> analyse text without spaces (test it first with the same text, >> hand-tokenised, to see what the output is). Again, come to IRC for guidance. >> >> The tokeniser.py script is a bit slow, mainly because of Python string >> processing. Rewriting it in C/C++ would be useful, and also a good way to >> get a better handle on how it works. >> >> -- >> Jonathan >> >> >> On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto wrote: >>> >>> Thank you for your reply. The project seems cool to work on for GSOC2023, >>> and I would like to participate in. I reckon there are two tasks on the >>> page and could you tell me where to start? >>> >>> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer >>> wrote: > I'd like to participate in Google Summer of Code 2023 at Apertium. > In particular, I'm interested in adding new language pair and I am > thinking to add Japanese-English as I speak Japanese. I took summer > school at Tokyo University online on natural language processing > before. > Could you tell me more about the project? Hi, Getting some support for Japanese would be great! I'm not sure if you saw the whole IRC discussion, but what we really need in that regard is support for the *tokenisation* step, where our regular methods[1] fail us, since the text might have no spaces and lots of tokenisation-ambiguity. There has been some prior work[2] and it's already listed as a potential GsoC project. Support for anything-Japanese depends on tokenisation. It's also a big enough job that it would qualify as a full GsoC project, so if you were hoping for jpn-eng in a summer you will be disappointeda (but having a toy language pair to test with would help!). On the other hand, if we get good spaceless tokenisation we open up the possibility for not just Japanese, but Thai, Lao, Chinese etc. – and of course all those writing systems used before the invention of the space character :) regards, Kevin [1] https://wiki.apertium.org/wiki/LRLM [2] http://hdl.handle.net/10066/20002 [3] https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >>> ___ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> >> ___ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff > > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC2023
Hello, I am thinking to work on the integration of apertium-3 into apertium-jpn as Jonathan san suggested. Do I need to language data for it? I have already installed dev tools locally. Also, I’ve found an issue in apertium-jpn, and I wonder should I do this for something like a coding challenge? Cheers, *Sorry for your inconvenient to be asked through email. IRC seems weird for my account now. On Mon, 27 Feb 2023 at 01:08, Jonathan Washington < jonathan.n.washing...@gmail.com> wrote: > Hi Eijisan, > > There's also the tokeniser used for Nuosu, which uses the transducer > itself to tokenise: > https://github.com/apertium/apertium-iii > > I believe this is a later implementation of what's described in the thesis > sent by Kevin in [2]. > > This method has some downsides, but it also has some advantages over a > statistical model. Perhaps a way to get started would be to explore the > pros and cons of each approach, and think about what a hybrid model could > achieve. It would be good to join the IRC channel to discuss all this with > the mentors. > > Another good way to get started (and it would help you do the above too) > would be to integrate the tokeniser from apertium-iii into apertium-jpn: > https://github.com/apertium/apertium-jpn > > You would need to modify the Makefile.am, the modes.xml file, drop in the > tokeniser script, and that's about it? Then see if you can get it to > analyse text without spaces (test it first with the same text, > hand-tokenised, to see what the output is). Again, come to IRC for > guidance. > > The tokeniser.py script is a bit slow, mainly because of Python string > processing. Rewriting it in C/C++ would be useful, and also a good way to > get a better handle on how it works. > > -- > Jonathan > > > On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto wrote: > >> Thank you for your reply. The project seems cool to work on for GSOC2023, >> and I would like to participate in. I reckon there are two tasks on the >> page and could you tell me where to start? >> >> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer >> wrote: >> >>> > I'd like to participate in Google Summer of Code 2023 at Apertium. >>> > In particular, I'm interested in adding new language pair and I am >>> > thinking to add Japanese-English as I speak Japanese. I took summer >>> > school at Tokyo University online on natural language processing >>> > before. >>> > Could you tell me more about the project? >>> >>> Hi, >>> >>> Getting some support for Japanese would be great! I'm not sure if you >>> saw the whole IRC discussion, but what we really need in that regard is >>> support for the *tokenisation* step, where our regular methods[1] fail >>> us, since the text might have no spaces and lots of >>> tokenisation-ambiguity. There has been some prior work[2] and it's >>> already listed as a potential GsoC project. >>> >>> Support for anything-Japanese depends on tokenisation. It's also a big >>> enough job that it would qualify as a full GsoC project, so if you were >>> hoping for jpn-eng in a summer you will be disappointeda (but having a >>> toy language pair to test with would help!). On the other hand, if we >>> get good spaceless tokenisation we open up the possibility for not just >>> Japanese, but Thai, Lao, Chinese etc. – and of course all those writing >>> systems used before the invention of the space character :) >>> >>> regards, >>> Kevin >>> >>> [1] https://wiki.apertium.org/wiki/LRLM >>> [2] http://hdl.handle.net/10066/20002 >>> [3] >>> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies >>> ___ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> ___ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC2023
Hi Eijisan, There's also the tokeniser used for Nuosu, which uses the transducer itself to tokenise: https://github.com/apertium/apertium-iii I believe this is a later implementation of what's described in the thesis sent by Kevin in [2]. This method has some downsides, but it also has some advantages over a statistical model. Perhaps a way to get started would be to explore the pros and cons of each approach, and think about what a hybrid model could achieve. It would be good to join the IRC channel to discuss all this with the mentors. Another good way to get started (and it would help you do the above too) would be to integrate the tokeniser from apertium-iii into apertium-jpn: https://github.com/apertium/apertium-jpn You would need to modify the Makefile.am, the modes.xml file, drop in the tokeniser script, and that's about it? Then see if you can get it to analyse text without spaces (test it first with the same text, hand-tokenised, to see what the output is). Again, come to IRC for guidance. The tokeniser.py script is a bit slow, mainly because of Python string processing. Rewriting it in C/C++ would be useful, and also a good way to get a better handle on how it works. -- Jonathan On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto wrote: > Thank you for your reply. The project seems cool to work on for GSOC2023, > and I would like to participate in. I reckon there are two tasks on the > page and could you tell me where to start? > > On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer > wrote: > >> > I'd like to participate in Google Summer of Code 2023 at Apertium. >> > In particular, I'm interested in adding new language pair and I am >> > thinking to add Japanese-English as I speak Japanese. I took summer >> > school at Tokyo University online on natural language processing >> > before. >> > Could you tell me more about the project? >> >> Hi, >> >> Getting some support for Japanese would be great! I'm not sure if you >> saw the whole IRC discussion, but what we really need in that regard is >> support for the *tokenisation* step, where our regular methods[1] fail >> us, since the text might have no spaces and lots of >> tokenisation-ambiguity. There has been some prior work[2] and it's >> already listed as a potential GsoC project. >> >> Support for anything-Japanese depends on tokenisation. It's also a big >> enough job that it would qualify as a full GsoC project, so if you were >> hoping for jpn-eng in a summer you will be disappointeda (but having a >> toy language pair to test with would help!). On the other hand, if we >> get good spaceless tokenisation we open up the possibility for not just >> Japanese, but Thai, Lao, Chinese etc. – and of course all those writing >> systems used before the invention of the space character :) >> >> regards, >> Kevin >> >> [1] https://wiki.apertium.org/wiki/LRLM >> [2] http://hdl.handle.net/10066/20002 >> [3] >> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies >> ___ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC2023
Thank you for your reply. The project seems cool to work on for GSOC2023, and I would like to participate in. I reckon there are two tasks on the page and could you tell me where to start? On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer wrote: > > I'd like to participate in Google Summer of Code 2023 at Apertium. > > In particular, I'm interested in adding new language pair and I am > > thinking to add Japanese-English as I speak Japanese. I took summer > > school at Tokyo University online on natural language processing > > before. > > Could you tell me more about the project? > > Hi, > > Getting some support for Japanese would be great! I'm not sure if you > saw the whole IRC discussion, but what we really need in that regard is > support for the *tokenisation* step, where our regular methods[1] fail > us, since the text might have no spaces and lots of > tokenisation-ambiguity. There has been some prior work[2] and it's > already listed as a potential GsoC project. > > Support for anything-Japanese depends on tokenisation. It's also a big > enough job that it would qualify as a full GsoC project, so if you were > hoping for jpn-eng in a summer you will be disappointeda (but having a > toy language pair to test with would help!). On the other hand, if we > get good spaceless tokenisation we open up the possibility for not just > Japanese, but Thai, Lao, Chinese etc. – and of course all those writing > systems used before the invention of the space character :) > > regards, > Kevin > > [1] https://wiki.apertium.org/wiki/LRLM > [2] http://hdl.handle.net/10066/20002 > [3] > https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC2023
> I'd like to participate in Google Summer of Code 2023 at Apertium. > In particular, I'm interested in adding new language pair and I am > thinking to add Japanese-English as I speak Japanese. I took summer > school at Tokyo University online on natural language processing > before. > Could you tell me more about the project? Hi, Getting some support for Japanese would be great! I'm not sure if you saw the whole IRC discussion, but what we really need in that regard is support for the *tokenisation* step, where our regular methods[1] fail us, since the text might have no spaces and lots of tokenisation-ambiguity. There has been some prior work[2] and it's already listed as a potential GsoC project. Support for anything-Japanese depends on tokenisation. It's also a big enough job that it would qualify as a full GsoC project, so if you were hoping for jpn-eng in a summer you will be disappointeda (but having a toy language pair to test with would help!). On the other hand, if we get good spaceless tokenisation we open up the possibility for not just Japanese, but Thai, Lao, Chinese etc. – and of course all those writing systems used before the invention of the space character :) regards, Kevin [1] https://wiki.apertium.org/wiki/LRLM [2] http://hdl.handle.net/10066/20002 [3] https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies signature.asc Description: PGP signature ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff