I've also added M2M-100 from FB-AI [1]. Regarding desiderata, here's an unsorted list of first things that come to my mind: - runs also on jvm - low resource requirements (e.g. for training) - can leverage existing / pretrained models - word and phrase translation capabilities - good effectiveness :)
Regards, Tommaso [1] : https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/ On Mon, 19 Oct 2020 at 14:09, Tommaso Teofili <tommaso.teof...@gmail.com> wrote: > Thanks a lot Thamme, I sticked to AL-2 compatible ones, but I agree we can > surely have a look at others having different licensing too. > In the meantime I've added all of your suggestions to the paper (with > related reference when available). > We should decide what our desiderata are and establish a first set of eval > benchmark just to understand what can work for us, at least initially, then > we can have a more thorough evaluation with a small number of "candidates". > > Regards, > Tommaso > > On Mon, 19 Oct 2020 at 09:05, Thamme Gowda <tgow...@gmail.com> wrote: > >> Tomaso, >> >> Awesome! Thanks for the links. >> I will be happy to join, (But I wont be able to contribute to the actual >> paper untill Oct 24). >> >> I suggest we should consider popular NMT toolkits for the survey >> regardless >> of their compatibility with AL-2. >> We should see all the tricks and features, and know if we are missing out >> on any useful features after enforcing the AL-2 filter (and create issues >> for adding those features). >> >> here are some more NMT toolkits to be included in the survey. >> - Fairseq https://github.com/pytorch/fairseq >> - Tensor2tensor https://github.com/tensorflow/tensor2tensor/ >> - Nematus https://github.com/EdinburghNLP/nematus >> - xNMT https://github.com/neulab/xnmt >> - XLM https://github.com/facebookresearch/XLM/ >> |-> MASS https://github.com/microsoft/MASS/ --> >> https://github.com/thammegowda/unmass (took that and made it easier to >> install and use) >> >> Some old stuff which we are defnitely not going to use but worth >> mentioning >> in the survey (for the sake of completion) >> - https://github.com/google/seq2seq >> - https://github.com/tensorflow/nmt >> - https://github.com/isi-nlp/Zoph_RNN >> >> >> >> Cheers, >> TG >> >> >> ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili < >> tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: >> >> > Following up on the report topic, I've created an overleaf doc for >> everyone >> > who's interested in working on this [1]. >> > >> > First set of (AL-2 compatible) NMT toolkits I've found: >> > - Joey NMT [2] >> > - OpenNMT [3] >> > - MarianNMT [4] >> > - Sockeye [5] >> > - and of course RTG already shared by Thamme [6] >> > >> > Regards, >> > Tommaso >> > >> > [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw >> > [2] : https://github.com/joeynmt/joeynmt >> > [3] : https://github.com/OpenNMT >> > [4] : https://github.com/marian-nmt/marian >> > [5] : https://github.com/awslabs/sockeye >> > [6] : https://github.com/isi-nlp/rtg-xt >> > >> > On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili < >> tommaso.teof...@gmail.com> >> > wrote: >> > >> > > very good idea Thamme! >> > > I'd be up for writing such a short survey paper as a result of our >> > > analysis. >> > > >> > > Regards, >> > > Tommaso >> > > >> > > >> > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tgow...@gmail.com> wrote: >> > > >> > >> Tomasso and others, >> > >> >> > >> > I think we may now go into a research phase to understand what >> > existing >> > >> toolkit we can more easily integrate with. >> > >> Agreed. >> > >> if we can write a (short) report that compares various NMT toolkits >> of >> > >> 2020, it would be useful for us to make this decision as well as to >> the >> > >> NMT >> > >> community. >> > >> Something like a survey paper on NMT research but focus on toolkits >> and >> > >> software engineering part. >> > >> >> > >> >> > >> >> > >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili >> < >> > >> tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: >> > >> >> > >> > Thamme, Jeff, >> > >> > >> > >> > your contributions will be very important for the project and the >> > >> > community, especially given your NLP background, thanks for your >> > >> support! >> > >> > >> > >> > I agree moving towards NMT is the best thing to do at this point >> for >> > >> > Joshua. >> > >> > >> > >> > Thamme, thanks for your suggestions! >> > >> > I think we may now go into a research phase to understand what >> > existing >> > >> > toolkit we can more easily integrate with. >> > >> > Of course if you like to integrate your own toolkit then that'd be >> > even >> > >> > more interesting to see how it compares to others. >> > >> > >> > >> > If that means moving to Python I think it's not a problem; we can >> > still >> > >> > work on Java bindings to ship a new Joshua Decoder implementation. >> > >> > >> > >> > The pretrained models topic is imho something we will have to >> embrace >> > at >> > >> > some point, so that others can: >> > >> > a) just download new LPs >> > >> > b) eventually fine tune with their own data >> > >> > >> > >> > I am looking forward to start this new phase of research on Joshua. >> > >> > >> > >> > Regards, >> > >> > Tommaso >> > >> > >> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <jzemer...@apache.org> >> > >> wrote: >> > >> > >> > >> > > I haven't contributed to this point but I would like to see >> Apache >> > >> Joshua >> > >> > > remain an active project so I am volunteering to help. I may not >> be >> > a >> > >> lot >> > >> > > of help with code for a bit but I will help out with >> documentation, >> > >> > > releases, etc. >> > >> > > >> > >> > > I do agree that NMT is the best path forward but I will leave the >> > >> choice >> > >> > of >> > >> > > integrating an existing library into Joshua versus a new NMT >> > >> > implementation >> > >> > > in Joshua to those more familiar with the code and what they >> think >> > is >> > >> > best >> > >> > > for the project. >> > >> > > >> > >> > > Jeff >> > >> > > >> > >> > > >> > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <tgow...@gmail.com> >> > >> wrote: >> > >> > > >> > >> > > > Hi Tomasso, and others >> > >> > > > >> > >> > > > *1. I support the addition of neural MT decoder. * >> > >> > > > The world has moved on, and NMT is clearly the way to go >> forward. >> > >> > > > If you dont believe my words, read what Matt Post himself said >> [1] >> > >> > three >> > >> > > > years ago! >> > >> > > > >> > >> > > > I have spent the past three years focusing on NMT as part of >> my >> > job >> > >> > and >> > >> > > > Ph.D -- I'd be glad to contribute in that direction. >> > >> > > > There are many NMT toolkits out there today. (Fairseq, sockeye, >> > >> > > > tensor2tensor, ....) >> > >> > > > >> > >> > > > The right thing to do, IMHO, is simply merge one of the NMT >> > toolkits >> > >> > with >> > >> > > > Joshua project. We can do that as long as it's Apache License >> > >> right? >> > >> > > > We will now have to move towards python land as most toolkits >> are >> > in >> > >> > > > python. On the positive side, we will be losing the ancient >> perl >> > >> > scripts >> > >> > > > that many are not fan of. >> > >> > > > >> > >> > > > I have been working on my own NMT toolkit for my work and >> research >> > >> -- >> > >> > > RTG >> > >> > > > https://isi-nlp.github.io/rtg/#conf >> > >> > > > I had worked on Joshua in the past, mainly, I improved the code >> > >> quality >> > >> > > > [2]. So you can tell my new code'd be upto Apache's standards >> ;) >> > >> > > > >> > >> > > > *2. Pretrained MT models for lots of languages* >> > >> > > > I have been working on a lib to retrieve parallel data from >> many >> > >> > sources >> > >> > > -- >> > >> > > > MTData [3] >> > >> > > > There is so much parallel data out their for hundreds of >> > languages. >> > >> > > > My recent estimate is over a billion lines of parallel >> sentences >> > for >> > >> > over >> > >> > > > 500 languages is freely and publicly available for download >> using >> > >> > MTData >> > >> > > > tool. >> > >> > > > If we find some sponsors to lend us some resources, we could >> train >> > >> > better >> > >> > > > MT models and update the Language Packs section [4]. >> > >> > > > Perhaps, one massively multilingual NMT model that supports >> many >> > >> > > > translation directions (I know its possible with NMT; I tested >> it >> > >> > > recently >> > >> > > > with RTG) >> > >> > > > >> > >> > > > I am interested in hearing what others are thinking. >> > >> > > > >> > >> > > > [1] >> > >> > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E >> > >> > > > [2] - >> > >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+ >> > >> > > > [3] - https://github.com/thammegowda/mtdata >> > >> > > > [4] - >> > >> > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs >> > >> > > > >> > >> > > > >> > >> > > > Cheers, >> > >> > > > TG >> > >> > > > >> > >> > > > -- >> > >> > > > *Thamme Gowda * >> > >> > > > @thammegowda <https://twitter.com/thammegowda> | >> > >> https://isi.edu/~tg >> > >> > > > ~Sent via somebody's Webmail server >> > >> > > > >> > >> > > > >> > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso >> > >> Teofili < >> > >> > > > tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: >> > >> > > > >> > >> > > > > Hi all, >> > >> > > > > >> > >> > > > > This is a roll call for people interested in contributing to >> > >> Apache >> > >> > > > Joshua >> > >> > > > > going forward. >> > >> > > > > Contributing could be not just code, but anything that may >> help >> > >> the >> > >> > > > project >> > >> > > > > or serve the community. >> > >> > > > > >> > >> > > > > In case you're interested in helping out please speak up :-) >> > >> > > > > >> > >> > > > > Code-wise Joshua has not evolved much in the latest months, >> > >> there's >> > >> > > room >> > >> > > > > for both improvements to the current code (make a new minor >> > >> release) >> > >> > > and >> > >> > > > > new ideas / code branches (e.g. neural MT based Joshua >> Decoder). >> > >> > > > > >> > >> > > > > Regards, >> > >> > > > > Tommaso >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > >> >