Re: [Apertium-stuff] gsoc2023 proposal

2023-03-18 Thread Eiji Miyamoto
Thank you for your feedback! I will make my proposal more detailed and
change some week's goals too.

On Sat, 18 Mar 2023 at 10:05, Kevin Brubeck Unhammer 
wrote:

> > Hello, I have finished my first draft and I would love to get any
> feedback
> > from potential mentors.
> > https://wiki.apertium.org/wiki/User:Eiji
>
> Hi,
>
> This looks promising :) Some thoughts:
>
> You've already made kind of an overview of the possibilities in your
> proposal; I would tone down the "investigate possibilities" parts and
> instead try to focus on how you're going to implement one of the
> methods, using apertium-jpn as a testbed.
>
> Try to make clear deliverables per week or at least every other week,
> you should have something like a proof-of-concept by week 2 – especially
> if your ambition is to also work on improving the Japanese language
> data. You currently have week 6 for testing – but you should be testing
> from the start alongside the coding. I would probably plan for 2 weeks
> for converting the PoC from Python to C++ and making it usable as a part
> of the pipeline.
>
> (Think about how this will be integrated into apertium – we have a
> translation pipeline which expects a certain format
> https://wiki.apertium.org/wiki/Apertium_stream_format )
>
> best regards,
> Kevin Brubeck Unhammer
>
>
>
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] gsoc2023 proposal

2023-03-18 Thread Kevin Brubeck Unhammer
> Hello, I have finished my first draft and I would love to get any feedback
> from potential mentors.
> https://wiki.apertium.org/wiki/User:Eiji

Hi,

This looks promising :) Some thoughts:

You've already made kind of an overview of the possibilities in your
proposal; I would tone down the "investigate possibilities" parts and
instead try to focus on how you're going to implement one of the
methods, using apertium-jpn as a testbed.

Try to make clear deliverables per week or at least every other week,
you should have something like a proof-of-concept by week 2 – especially
if your ambition is to also work on improving the Japanese language
data. You currently have week 6 for testing – but you should be testing
from the start alongside the coding. I would probably plan for 2 weeks
for converting the PoC from Python to C++ and making it usable as a part
of the pipeline. 

(Think about how this will be integrated into apertium – we have a
translation pipeline which expects a certain format
https://wiki.apertium.org/wiki/Apertium_stream_format )

best regards,
Kevin Brubeck Unhammer 





___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSOC2023

2023-03-01 Thread Eiji Miyamoto
Okay, thank you. I will work on the integration and then try to do other
tasks!


On Wed, 1 Mar 2023 at 14:56, Daniel Swanson 
wrote:

> You're certainly welcome to submit pull requests on the Japanese
> repository, but due to the tokenization problems, that probably
> shouldn't be your entire coding challenge, since we also need to see
> that you can work on that aspect of the project.
>
> Daniel
>
> On Wed, Mar 1, 2023 at 9:49 AM Eiji Miyamoto  wrote:
> >
> > Hello, I am thinking to work on the integration of apertium-3 into
> apertium-jpn as Jonathan san suggested. Do I need to language data for it?
> I have already installed dev tools locally.
> >
> > Also, I’ve found an issue in apertium-jpn, and I wonder should I do this
> for something like a coding challenge?
> >
> > Cheers,
> >
> > *Sorry for your inconvenient to be asked through email. IRC seems weird
> for my account now.
> >
> > On Mon, 27 Feb 2023 at 01:08, Jonathan Washington <
> jonathan.n.washing...@gmail.com> wrote:
> >>
> >> Hi Eijisan,
> >>
> >> There's also the tokeniser used for Nuosu, which uses the transducer
> itself to tokenise:
> >> https://github.com/apertium/apertium-iii
> >>
> >> I believe this is a later implementation of what's described in the
> thesis sent by Kevin in [2].
> >>
> >> This method has some downsides, but it also has some advantages over a
> statistical model.  Perhaps a way to get started would be to explore the
> pros and cons of each approach, and think about what a hybrid model could
> achieve.  It would be good to join the IRC channel to discuss all this with
> the mentors.
> >>
> >> Another good way to get started (and it would help you do the above
> too) would be to integrate the tokeniser from apertium-iii into
> apertium-jpn:
> >> https://github.com/apertium/apertium-jpn
> >>
> >> You would need to modify the Makefile.am, the modes.xml file, drop in
> the tokeniser script, and that's about it?  Then see if you can get it to
> analyse text without spaces (test it first with the same text,
> hand-tokenised, to see what the output is).  Again, come to IRC for
> guidance.
> >>
> >> The tokeniser.py script is a bit slow, mainly because of Python string
> processing.  Rewriting it in C/C++ would be useful, and also a good way to
> get a better handle on how it works.
> >>
> >> --
> >> Jonathan
> >>
> >>
> >> On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto  wrote:
> >>>
> >>> Thank you for your reply. The project seems cool to work on for
> GSOC2023, and I would like to participate in. I reckon there are two tasks
> on the page and could you tell me where to start?
> >>>
> >>> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer <
> unham...@fsfe.org> wrote:
> 
>  > I'd like to participate in Google Summer of Code 2023 at Apertium.
>  > In particular, I'm interested in adding new language pair and I am
>  > thinking to add Japanese-English as I speak Japanese. I took summer
>  > school at Tokyo University online on natural language processing
>  > before.
>  > Could you tell me more about the project?
> 
>  Hi,
> 
>  Getting some support for Japanese would be great! I'm not sure if you
>  saw the whole IRC discussion, but what we really need in that regard
> is
>  support for the *tokenisation* step, where our regular methods[1] fail
>  us, since the text might have no spaces and lots of
>  tokenisation-ambiguity. There has been some prior work[2] and it's
>  already listed as a potential GsoC project.
> 
>  Support for anything-Japanese depends on tokenisation. It's also a big
>  enough job that it would qualify as a full GsoC project, so if you
> were
>  hoping for jpn-eng in a summer you will be disappointeda (but having a
>  toy language pair to test with would help!). On the other hand, if we
>  get good spaceless tokenisation we open up the possibility for not
> just
>  Japanese, but Thai, Lao, Chinese etc. – and of course all those
> writing
>  systems used before the invention of the space character :)
> 
>  regards,
>  Kevin
> 
>  [1] https://wiki.apertium.org/wiki/LRLM
>  [2] http://hdl.handle.net/10066/20002
>  [3]
> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
>  ___
>  Apertium-stuff mailing list
>  Apertium-stuff@lists.sourceforge.net
>  https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >>>
> >>> ___
> >>> Apertium-stuff mailing list
> >>> Apertium-stuff@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >>
> >> ___
> >> Apertium-stuff mailing list
> >> Apertium-stuff@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >
> > ___
> > Apertium-stuff 

Re: [Apertium-stuff] GSOC2023

2023-03-01 Thread Daniel Swanson
You're certainly welcome to submit pull requests on the Japanese
repository, but due to the tokenization problems, that probably
shouldn't be your entire coding challenge, since we also need to see
that you can work on that aspect of the project.

Daniel

On Wed, Mar 1, 2023 at 9:49 AM Eiji Miyamoto  wrote:
>
> Hello, I am thinking to work on the integration of apertium-3 into 
> apertium-jpn as Jonathan san suggested. Do I need to language data for it? I 
> have already installed dev tools locally.
>
> Also, I’ve found an issue in apertium-jpn, and I wonder should I do this for 
> something like a coding challenge?
>
> Cheers,
>
> *Sorry for your inconvenient to be asked through email. IRC seems weird for 
> my account now.
>
> On Mon, 27 Feb 2023 at 01:08, Jonathan Washington 
>  wrote:
>>
>> Hi Eijisan,
>>
>> There's also the tokeniser used for Nuosu, which uses the transducer itself 
>> to tokenise:
>> https://github.com/apertium/apertium-iii
>>
>> I believe this is a later implementation of what's described in the thesis 
>> sent by Kevin in [2].
>>
>> This method has some downsides, but it also has some advantages over a 
>> statistical model.  Perhaps a way to get started would be to explore the 
>> pros and cons of each approach, and think about what a hybrid model could 
>> achieve.  It would be good to join the IRC channel to discuss all this with 
>> the mentors.
>>
>> Another good way to get started (and it would help you do the above too) 
>> would be to integrate the tokeniser from apertium-iii into apertium-jpn:
>> https://github.com/apertium/apertium-jpn
>>
>> You would need to modify the Makefile.am, the modes.xml file, drop in the 
>> tokeniser script, and that's about it?  Then see if you can get it to 
>> analyse text without spaces (test it first with the same text, 
>> hand-tokenised, to see what the output is).  Again, come to IRC for guidance.
>>
>> The tokeniser.py script is a bit slow, mainly because of Python string 
>> processing.  Rewriting it in C/C++ would be useful, and also a good way to 
>> get a better handle on how it works.
>>
>> --
>> Jonathan
>>
>>
>> On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto  wrote:
>>>
>>> Thank you for your reply. The project seems cool to work on for GSOC2023, 
>>> and I would like to participate in. I reckon there are two tasks on the 
>>> page and could you tell me where to start?
>>>
>>> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer  
>>> wrote:

 > I'd like to participate in Google Summer of Code 2023 at Apertium.
 > In particular, I'm interested in adding new language pair and I am
 > thinking to add Japanese-English as I speak Japanese. I took summer
 > school at Tokyo University online on natural language processing
 > before.
 > Could you tell me more about the project?

 Hi,

 Getting some support for Japanese would be great! I'm not sure if you
 saw the whole IRC discussion, but what we really need in that regard is
 support for the *tokenisation* step, where our regular methods[1] fail
 us, since the text might have no spaces and lots of
 tokenisation-ambiguity. There has been some prior work[2] and it's
 already listed as a potential GsoC project.

 Support for anything-Japanese depends on tokenisation. It's also a big
 enough job that it would qualify as a full GsoC project, so if you were
 hoping for jpn-eng in a summer you will be disappointeda (but having a
 toy language pair to test with would help!). On the other hand, if we
 get good spaceless tokenisation we open up the possibility for not just
 Japanese, but Thai, Lao, Chinese etc. – and of course all those writing
 systems used before the invention of the space character :)

 regards,
 Kevin

 [1] https://wiki.apertium.org/wiki/LRLM
 [2] http://hdl.handle.net/10066/20002
 [3] 
 https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
 ___
 Apertium-stuff mailing list
 Apertium-stuff@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSOC2023

2023-03-01 Thread Eiji Miyamoto
Hello, I am thinking to work on the integration of apertium-3 into
apertium-jpn as Jonathan san suggested. Do I need to language data for it?
I have already installed dev tools locally.

Also, I’ve found an issue in apertium-jpn, and I wonder should I do this
for something like a coding challenge?

Cheers,

*Sorry for your inconvenient to be asked through email. IRC seems weird for
my account now.

On Mon, 27 Feb 2023 at 01:08, Jonathan Washington <
jonathan.n.washing...@gmail.com> wrote:

> Hi Eijisan,
>
> There's also the tokeniser used for Nuosu, which uses the transducer
> itself to tokenise:
> https://github.com/apertium/apertium-iii
>
> I believe this is a later implementation of what's described in the thesis
> sent by Kevin in [2].
>
> This method has some downsides, but it also has some advantages over a
> statistical model.  Perhaps a way to get started would be to explore the
> pros and cons of each approach, and think about what a hybrid model could
> achieve.  It would be good to join the IRC channel to discuss all this with
> the mentors.
>
> Another good way to get started (and it would help you do the above too)
> would be to integrate the tokeniser from apertium-iii into apertium-jpn:
> https://github.com/apertium/apertium-jpn
>
> You would need to modify the Makefile.am, the modes.xml file, drop in the
> tokeniser script, and that's about it?  Then see if you can get it to
> analyse text without spaces (test it first with the same text,
> hand-tokenised, to see what the output is).  Again, come to IRC for
> guidance.
>
> The tokeniser.py script is a bit slow, mainly because of Python string
> processing.  Rewriting it in C/C++ would be useful, and also a good way to
> get a better handle on how it works.
>
> --
> Jonathan
>
>
> On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto  wrote:
>
>> Thank you for your reply. The project seems cool to work on for GSOC2023,
>> and I would like to participate in. I reckon there are two tasks on the
>> page and could you tell me where to start?
>>
>> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer 
>> wrote:
>>
>>> > I'd like to participate in Google Summer of Code 2023 at Apertium.
>>> > In particular, I'm interested in adding new language pair and I am
>>> > thinking to add Japanese-English as I speak Japanese. I took summer
>>> > school at Tokyo University online on natural language processing
>>> > before.
>>> > Could you tell me more about the project?
>>>
>>> Hi,
>>>
>>> Getting some support for Japanese would be great! I'm not sure if you
>>> saw the whole IRC discussion, but what we really need in that regard is
>>> support for the *tokenisation* step, where our regular methods[1] fail
>>> us, since the text might have no spaces and lots of
>>> tokenisation-ambiguity. There has been some prior work[2] and it's
>>> already listed as a potential GsoC project.
>>>
>>> Support for anything-Japanese depends on tokenisation. It's also a big
>>> enough job that it would qualify as a full GsoC project, so if you were
>>> hoping for jpn-eng in a summer you will be disappointeda (but having a
>>> toy language pair to test with would help!). On the other hand, if we
>>> get good spaceless tokenisation we open up the possibility for not just
>>> Japanese, but Thai, Lao, Chinese etc. – and of course all those writing
>>> systems used before the invention of the space character :)
>>>
>>> regards,
>>> Kevin
>>>
>>> [1] https://wiki.apertium.org/wiki/LRLM
>>> [2] http://hdl.handle.net/10066/20002
>>> [3]
>>> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSOC2023

2023-02-26 Thread Jonathan Washington
Hi Eijisan,

There's also the tokeniser used for Nuosu, which uses the transducer itself
to tokenise:
https://github.com/apertium/apertium-iii

I believe this is a later implementation of what's described in the thesis
sent by Kevin in [2].

This method has some downsides, but it also has some advantages over a
statistical model.  Perhaps a way to get started would be to explore the
pros and cons of each approach, and think about what a hybrid model could
achieve.  It would be good to join the IRC channel to discuss all this with
the mentors.

Another good way to get started (and it would help you do the above too)
would be to integrate the tokeniser from apertium-iii into apertium-jpn:
https://github.com/apertium/apertium-jpn

You would need to modify the Makefile.am, the modes.xml file, drop in the
tokeniser script, and that's about it?  Then see if you can get it to
analyse text without spaces (test it first with the same text,
hand-tokenised, to see what the output is).  Again, come to IRC for
guidance.

The tokeniser.py script is a bit slow, mainly because of Python string
processing.  Rewriting it in C/C++ would be useful, and also a good way to
get a better handle on how it works.

--
Jonathan


On Fri, Feb 24, 2023, 13:03 Eiji Miyamoto  wrote:

> Thank you for your reply. The project seems cool to work on for GSOC2023,
> and I would like to participate in. I reckon there are two tasks on the
> page and could you tell me where to start?
>
> On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer 
> wrote:
>
>> > I'd like to participate in Google Summer of Code 2023 at Apertium.
>> > In particular, I'm interested in adding new language pair and I am
>> > thinking to add Japanese-English as I speak Japanese. I took summer
>> > school at Tokyo University online on natural language processing
>> > before.
>> > Could you tell me more about the project?
>>
>> Hi,
>>
>> Getting some support for Japanese would be great! I'm not sure if you
>> saw the whole IRC discussion, but what we really need in that regard is
>> support for the *tokenisation* step, where our regular methods[1] fail
>> us, since the text might have no spaces and lots of
>> tokenisation-ambiguity. There has been some prior work[2] and it's
>> already listed as a potential GsoC project.
>>
>> Support for anything-Japanese depends on tokenisation. It's also a big
>> enough job that it would qualify as a full GsoC project, so if you were
>> hoping for jpn-eng in a summer you will be disappointeda (but having a
>> toy language pair to test with would help!). On the other hand, if we
>> get good spaceless tokenisation we open up the possibility for not just
>> Japanese, but Thai, Lao, Chinese etc. – and of course all those writing
>> systems used before the invention of the space character :)
>>
>> regards,
>> Kevin
>>
>> [1] https://wiki.apertium.org/wiki/LRLM
>> [2] http://hdl.handle.net/10066/20002
>> [3]
>> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSOC2023

2023-02-24 Thread Eiji Miyamoto
Thank you for your reply. The project seems cool to work on for GSOC2023,
and I would like to participate in. I reckon there are two tasks on the
page and could you tell me where to start?

On Fri, 24 Feb 2023 at 08:20, Kevin Brubeck Unhammer 
wrote:

> > I'd like to participate in Google Summer of Code 2023 at Apertium.
> > In particular, I'm interested in adding new language pair and I am
> > thinking to add Japanese-English as I speak Japanese. I took summer
> > school at Tokyo University online on natural language processing
> > before.
> > Could you tell me more about the project?
>
> Hi,
>
> Getting some support for Japanese would be great! I'm not sure if you
> saw the whole IRC discussion, but what we really need in that regard is
> support for the *tokenisation* step, where our regular methods[1] fail
> us, since the text might have no spaces and lots of
> tokenisation-ambiguity. There has been some prior work[2] and it's
> already listed as a potential GsoC project.
>
> Support for anything-Japanese depends on tokenisation. It's also a big
> enough job that it would qualify as a full GsoC project, so if you were
> hoping for jpn-eng in a summer you will be disappointeda (but having a
> toy language pair to test with would help!). On the other hand, if we
> get good spaceless tokenisation we open up the possibility for not just
> Japanese, but Thai, Lao, Chinese etc. – and of course all those writing
> systems used before the invention of the space character :)
>
> regards,
> Kevin
>
> [1] https://wiki.apertium.org/wiki/LRLM
> [2] http://hdl.handle.net/10066/20002
> [3]
> https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSOC2023

2023-02-24 Thread Kevin Brubeck Unhammer
> I'd like to participate in Google Summer of Code 2023 at Apertium.
> In particular, I'm interested in adding new language pair and I am
> thinking to add Japanese-English as I speak Japanese. I took summer
> school at Tokyo University online on natural language processing
> before.
> Could you tell me more about the project?

Hi,

Getting some support for Japanese would be great! I'm not sure if you
saw the whole IRC discussion, but what we really need in that regard is
support for the *tokenisation* step, where our regular methods[1] fail
us, since the text might have no spaces and lots of
tokenisation-ambiguity. There has been some prior work[2] and it's
already listed as a potential GsoC project.

Support for anything-Japanese depends on tokenisation. It's also a big
enough job that it would qualify as a full GsoC project, so if you were
hoping for jpn-eng in a summer you will be disappointeda (but having a
toy language pair to test with would help!). On the other hand, if we
get good spaceless tokenisation we open up the possibility for not just
Japanese, but Thai, Lao, Chinese etc. – and of course all those writing
systems used before the invention of the space character :)

regards,
Kevin

[1] https://wiki.apertium.org/wiki/LRLM
[2] http://hdl.handle.net/10066/20002 
[3] 
https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff