Re: NMT survey (was: Roll Cal)

Michael Wall Tue, 20 Oct 2020 06:37:40 -0700

Hi,

Been watching Joshua since it was incubating.  Finally may have some
free time and am would like to get involved.


The NMT stuff looks interesting.  I don't have an overleaf account, so
maybe my next question is answered there.  What is the end result of
the paper?  Will you be choosing a framework to add to Joshua.  And if
so, what will make it different than just using said framework on it's
own?

Thanks

Mike

On Tue, Oct 20, 2020 at 5:34 AM Tommaso Teofili
<tommaso.teof...@gmail.com> wrote:
>
> I've also added M2M-100 from FB-AI [1].
>
> Regarding desiderata, here's an unsorted list of first things that come to
> my mind:
> - runs also on jvm
> - low resource requirements (e.g. for training)
> - can leverage existing / pretrained models
> - word and phrase translation capabilities
> - good effectiveness :)
>
> Regards,
> Tommaso
>
> [1] :
> https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
>
> On Mon, 19 Oct 2020 at 14:09, Tommaso Teofili <tommaso.teof...@gmail.com>
> wrote:
>
> > Thanks a lot Thamme, I sticked to AL-2 compatible ones, but I agree we can
> > surely have a look at others having different licensing too.
> > In the meantime I've added all of your suggestions to the paper (with
> > related reference when available).
> > We should decide what our desiderata are and establish a first set of eval
> > benchmark just to understand what can work for us, at least initially, then
> > we can have a more thorough evaluation with a small number of "candidates".
> >
> > Regards,
> > Tommaso
> >
> > On Mon, 19 Oct 2020 at 09:05, Thamme Gowda <tgow...@gmail.com> wrote:
> >
> >> Tomaso,
> >>
> >> Awesome! Thanks for the links.
> >> I will be happy to join, (But I wont be able to contribute to the actual
> >> paper untill Oct 24).
> >>
> >> I suggest we should consider popular NMT toolkits for the survey
> >> regardless
> >> of their compatibility with AL-2.
> >> We should see all the tricks and features, and know if we are missing out
> >> on any useful features after enforcing the AL-2 filter (and create issues
> >> for adding those features).
> >>
> >> here are some more NMT toolkits to be included in the survey.
> >> - Fairseq https://github.com/pytorch/fairseq
> >> - Tensor2tensor https://github.com/tensorflow/tensor2tensor/
> >> - Nematus  https://github.com/EdinburghNLP/nematus
> >> - xNMT https://github.com/neulab/xnmt
> >> - XLM   https://github.com/facebookresearch/XLM/
> >>     |-> MASS  https://github.com/microsoft/MASS/  -->
> >> https://github.com/thammegowda/unmass  (took that and made it easier to
> >> install and use)
> >>
> >> Some old stuff which we are defnitely not going to use but worth
> >> mentioning
> >> in the survey (for the sake of completion)
> >> - https://github.com/google/seq2seq
> >> - https://github.com/tensorflow/nmt
> >> - https://github.com/isi-nlp/Zoph_RNN
> >>
> >>
> >>
> >> Cheers,
> >> TG
> >>
> >>
> >> ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili <
> >> tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> >>
> >> > Following up on the report topic, I've created an overleaf doc for
> >> everyone
> >> > who's interested in working on this [1].
> >> >
> >> > First set of (AL-2 compatible) NMT toolkits I've found:
> >> > - Joey NMT [2]
> >> > - OpenNMT [3]
> >> > - MarianNMT [4]
> >> > - Sockeye [5]
> >> > - and of course RTG already shared by Thamme [6]
> >> >
> >> > Regards,
> >> > Tommaso
> >> >
> >> > [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw
> >> > [2] : https://github.com/joeynmt/joeynmt
> >> > [3] : https://github.com/OpenNMT
> >> > [4] : https://github.com/marian-nmt/marian
> >> > [5] : https://github.com/awslabs/sockeye
> >> > [6] : https://github.com/isi-nlp/rtg-xt
> >> >
> >> > On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <
> >> tommaso.teof...@gmail.com>
> >> > wrote:
> >> >
> >> > > very good idea Thamme!
> >> > > I'd be up for writing such a short survey paper as a result of our
> >> > > analysis.
> >> > >
> >> > > Regards,
> >> > > Tommaso
> >> > >
> >> > >
> >> > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tgow...@gmail.com> wrote:
> >> > >
> >> > >> Tomasso and others,
> >> > >>
> >> > >> > I think we may now go into a research phase to understand what
> >> > existing
> >> > >> toolkit we can more easily integrate with.
> >> > >> Agreed.
> >> > >> if we can write a (short) report that compares various NMT toolkits
> >> of
> >> > >> 2020, it would be useful for us to make this decision as well as to
> >> the
> >> > >> NMT
> >> > >> community.
> >> > >> Something like a survey paper on NMT research but focus on toolkits
> >> and
> >> > >> software engineering part.
> >> > >>
> >> > >>
> >> > >>
> >> > >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili
> >> <
> >> > >> tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> >> > >>
> >> > >> > Thamme, Jeff,
> >> > >> >
> >> > >> > your contributions will be very important for the project and the
> >> > >> > community, especially given your NLP background, thanks for your
> >> > >> support!
> >> > >> >
> >> > >> > I agree moving towards NMT is the best thing to do at this point
> >> for
> >> > >> > Joshua.
> >> > >> >
> >> > >> > Thamme, thanks for your suggestions!
> >> > >> > I think we may now go into a research phase to understand what
> >> > existing
> >> > >> > toolkit we can more easily integrate with.
> >> > >> > Of course if you like to integrate your own toolkit then that'd be
> >> > even
> >> > >> > more interesting to see how it compares to others.
> >> > >> >
> >> > >> > If that means moving to Python I think it's not a problem; we can
> >> > still
> >> > >> > work on Java bindings to ship a new Joshua Decoder implementation.
> >> > >> >
> >> > >> > The pretrained models topic is imho something we will have to
> >> embrace
> >> > at
> >> > >> > some point, so that others can:
> >> > >> > a) just download new LPs
> >> > >> > b) eventually fine tune with their own data
> >> > >> >
> >> > >> > I am looking forward to start this new phase of research on Joshua.
> >> > >> >
> >> > >> > Regards,
> >> > >> > Tommaso
> >> > >> >
> >> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <jzemer...@apache.org>
> >> > >> wrote:
> >> > >> >
> >> > >> > > I haven't contributed to this point but I would like to see
> >> Apache
> >> > >> Joshua
> >> > >> > > remain an active project so I am volunteering to help. I may not
> >> be
> >> > a
> >> > >> lot
> >> > >> > > of help with code for a bit but I will help out with
> >> documentation,
> >> > >> > > releases, etc.
> >> > >> > >
> >> > >> > > I do agree that NMT is the best path forward but I will leave the
> >> > >> choice
> >> > >> > of
> >> > >> > > integrating an existing library into Joshua versus a new NMT
> >> > >> > implementation
> >> > >> > > in Joshua to those more familiar with the code and what they
> >> think
> >> > is
> >> > >> > best
> >> > >> > > for the project.
> >> > >> > >
> >> > >> > > Jeff
> >> > >> > >
> >> > >> > >
> >> > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <tgow...@gmail.com>
> >> > >> wrote:
> >> > >> > >
> >> > >> > > > Hi Tomasso, and others
> >> > >> > > >
> >> > >> > > > *1.  I support the addition of neural MT decoder. *
> >> > >> > > > The world has moved on, and NMT is clearly the way to go
> >> forward.
> >> > >> > > > If you dont believe my words, read what Matt Post himself said
> >> [1]
> >> > >> > three
> >> > >> > > > years ago!
> >> > >> > > >
> >> > >> > > > I have spent the past three years focusing on NMT  as part of
> >> my
> >> > job
> >> > >> > and
> >> > >> > > > Ph.D -- I'd be glad to contribute in that direction.
> >> > >> > > > There are many NMT toolkits out there today. (Fairseq, sockeye,
> >> > >> > > > tensor2tensor, ....)
> >> > >> > > >
> >> > >> > > > The right thing to do, IMHO, is simply merge one of the NMT
> >> > toolkits
> >> > >> > with
> >> > >> > > > Joshua project.  We can do that as long as it's Apache License
> >> > >> right?
> >> > >> > > > We will now have to move towards python land as most toolkits
> >> are
> >> > in
> >> > >> > > > python. On the positive side, we will be losing the ancient
> >> perl
> >> > >> > scripts
> >> > >> > > > that many are not fan of.
> >> > >> > > >
> >> > >> > > > I have been working on my own NMT toolkit for my work and
> >> research
> >> > >> --
> >> > >> > > RTG
> >> > >> > > > https://isi-nlp.github.io/rtg/#conf
> >> > >> > > > I had worked on Joshua in the past, mainly, I improved the code
> >> > >> quality
> >> > >> > > > [2]. So you can tell my new code'd be upto Apache's standards
> >> ;)
> >> > >> > > >
> >> > >> > > > *2. Pretrained MT models for lots of languages*
> >> > >> > > > I have been working on a lib to retrieve parallel data from
> >> many
> >> > >> > sources
> >> > >> > > --
> >> > >> > > > MTData [3]
> >> > >> > > > There is so much parallel data out their for hundreds of
> >> > languages.
> >> > >> > > > My recent estimate is over a billion lines of parallel
> >> sentences
> >> > for
> >> > >> > over
> >> > >> > > > 500 languages is freely and publicly available for download
> >> using
> >> > >> > MTData
> >> > >> > > > tool.
> >> > >> > > > If we find some sponsors to lend us some resources, we could
> >> train
> >> > >> > better
> >> > >> > > > MT models and update the Language Packs section [4].
> >> > >> > > > Perhaps, one massively multilingual NMT model that supports
> >> many
> >> > >> > > > translation directions (I know its possible with NMT; I tested
> >> it
> >> > >> > > recently
> >> > >> > > > with RTG)
> >> > >> > > >
> >> > >> > > > I am interested in hearing what others are thinking.
> >> > >> > > >
> >> > >> > > > [1]
> >> > >> > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E
> >> > >> > > > [2] -
> >> > >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+
> >> > >> > > > [3] - https://github.com/thammegowda/mtdata
> >> > >> > > > [4] -
> >> > >> > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > Cheers,
> >> > >> > > > TG
> >> > >> > > >
> >> > >> > > > --
> >> > >> > > > *Thamme Gowda *
> >> > >> > > > @thammegowda <https://twitter.com/thammegowda> |
> >> > >> https://isi.edu/~tg
> >> > >> > > > ~Sent via somebody's Webmail server
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso
> >> > >> Teofili <
> >> > >> > > > tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> >> > >> > > >
> >> > >> > > > > Hi all,
> >> > >> > > > >
> >> > >> > > > > This is a roll call for people interested in contributing to
> >> > >> Apache
> >> > >> > > > Joshua
> >> > >> > > > > going forward.
> >> > >> > > > > Contributing could be not just code, but anything that may
> >> help
> >> > >> the
> >> > >> > > > project
> >> > >> > > > > or serve the community.
> >> > >> > > > >
> >> > >> > > > > In case you're interested in helping out please speak up :-)
> >> > >> > > > >
> >> > >> > > > > Code-wise Joshua has not evolved much in the latest months,
> >> > >> there's
> >> > >> > > room
> >> > >> > > > > for both improvements to the current code (make a new minor
> >> > >> release)
> >> > >> > > and
> >> > >> > > > > new ideas / code branches (e.g. neural MT based Joshua
> >> Decoder).
> >> > >> > > > >
> >> > >> > > > > Regards,
> >> > >> > > > > Tommaso
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> >
> >>
> >

Re: NMT survey (was: Roll Cal)

Reply via email to