hi Michael, nice to hear from you too on the dev@ list! We're looking forward to see you involved :) If I understood Thamme's proposal correctly, the paper is just a way to write down our own evaluation of current approaches to NMT; that would help us decide on our own way to pursue MT. At this stage I am not sure what we'll end up doing, it'd be nice not to just be a wrapper for one of those existing NMT tools, but let's see.
Regards, Tommaso On Tue, 20 Oct 2020 at 15:37, Michael Wall <mjw...@apache.org> wrote: > Hi, > > Been watching Joshua since it was incubating. Finally may have some > free time and am would like to get involved. > > The NMT stuff looks interesting. I don't have an overleaf account, so > maybe my next question is answered there. What is the end result of > the paper? Will you be choosing a framework to add to Joshua. And if > so, what will make it different than just using said framework on it's > own? > > Thanks > > Mike > > On Tue, Oct 20, 2020 at 5:34 AM Tommaso Teofili > <tommaso.teof...@gmail.com> wrote: > > > > I've also added M2M-100 from FB-AI [1]. > > > > Regarding desiderata, here's an unsorted list of first things that come > to > > my mind: > > - runs also on jvm > > - low resource requirements (e.g. for training) > > - can leverage existing / pretrained models > > - word and phrase translation capabilities > > - good effectiveness :) > > > > Regards, > > Tommaso > > > > [1] : > > > https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/ > > > > On Mon, 19 Oct 2020 at 14:09, Tommaso Teofili <tommaso.teof...@gmail.com > > > > wrote: > > > > > Thanks a lot Thamme, I sticked to AL-2 compatible ones, but I agree we > can > > > surely have a look at others having different licensing too. > > > In the meantime I've added all of your suggestions to the paper (with > > > related reference when available). > > > We should decide what our desiderata are and establish a first set of > eval > > > benchmark just to understand what can work for us, at least initially, > then > > > we can have a more thorough evaluation with a small number of > "candidates". > > > > > > Regards, > > > Tommaso > > > > > > On Mon, 19 Oct 2020 at 09:05, Thamme Gowda <tgow...@gmail.com> wrote: > > > > > >> Tomaso, > > >> > > >> Awesome! Thanks for the links. > > >> I will be happy to join, (But I wont be able to contribute to the > actual > > >> paper untill Oct 24). > > >> > > >> I suggest we should consider popular NMT toolkits for the survey > > >> regardless > > >> of their compatibility with AL-2. > > >> We should see all the tricks and features, and know if we are missing > out > > >> on any useful features after enforcing the AL-2 filter (and create > issues > > >> for adding those features). > > >> > > >> here are some more NMT toolkits to be included in the survey. > > >> - Fairseq https://github.com/pytorch/fairseq > > >> - Tensor2tensor https://github.com/tensorflow/tensor2tensor/ > > >> - Nematus https://github.com/EdinburghNLP/nematus > > >> - xNMT https://github.com/neulab/xnmt > > >> - XLM https://github.com/facebookresearch/XLM/ > > >> |-> MASS https://github.com/microsoft/MASS/ --> > > >> https://github.com/thammegowda/unmass (took that and made it easier > to > > >> install and use) > > >> > > >> Some old stuff which we are defnitely not going to use but worth > > >> mentioning > > >> in the survey (for the sake of completion) > > >> - https://github.com/google/seq2seq > > >> - https://github.com/tensorflow/nmt > > >> - https://github.com/isi-nlp/Zoph_RNN > > >> > > >> > > >> > > >> Cheers, > > >> TG > > >> > > >> > > >> ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili < > > >> tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: > > >> > > >> > Following up on the report topic, I've created an overleaf doc for > > >> everyone > > >> > who's interested in working on this [1]. > > >> > > > >> > First set of (AL-2 compatible) NMT toolkits I've found: > > >> > - Joey NMT [2] > > >> > - OpenNMT [3] > > >> > - MarianNMT [4] > > >> > - Sockeye [5] > > >> > - and of course RTG already shared by Thamme [6] > > >> > > > >> > Regards, > > >> > Tommaso > > >> > > > >> > [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw > > >> > [2] : https://github.com/joeynmt/joeynmt > > >> > [3] : https://github.com/OpenNMT > > >> > [4] : https://github.com/marian-nmt/marian > > >> > [5] : https://github.com/awslabs/sockeye > > >> > [6] : https://github.com/isi-nlp/rtg-xt > > >> > > > >> > On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili < > > >> tommaso.teof...@gmail.com> > > >> > wrote: > > >> > > > >> > > very good idea Thamme! > > >> > > I'd be up for writing such a short survey paper as a result of our > > >> > > analysis. > > >> > > > > >> > > Regards, > > >> > > Tommaso > > >> > > > > >> > > > > >> > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tgow...@gmail.com> > wrote: > > >> > > > > >> > >> Tomasso and others, > > >> > >> > > >> > >> > I think we may now go into a research phase to understand what > > >> > existing > > >> > >> toolkit we can more easily integrate with. > > >> > >> Agreed. > > >> > >> if we can write a (short) report that compares various NMT > toolkits > > >> of > > >> > >> 2020, it would be useful for us to make this decision as well as > to > > >> the > > >> > >> NMT > > >> > >> community. > > >> > >> Something like a survey paper on NMT research but focus on > toolkits > > >> and > > >> > >> software engineering part. > > >> > >> > > >> > >> > > >> > >> > > >> > >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso > Teofili > > >> < > > >> > >> tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: > > >> > >> > > >> > >> > Thamme, Jeff, > > >> > >> > > > >> > >> > your contributions will be very important for the project and > the > > >> > >> > community, especially given your NLP background, thanks for > your > > >> > >> support! > > >> > >> > > > >> > >> > I agree moving towards NMT is the best thing to do at this > point > > >> for > > >> > >> > Joshua. > > >> > >> > > > >> > >> > Thamme, thanks for your suggestions! > > >> > >> > I think we may now go into a research phase to understand what > > >> > existing > > >> > >> > toolkit we can more easily integrate with. > > >> > >> > Of course if you like to integrate your own toolkit then > that'd be > > >> > even > > >> > >> > more interesting to see how it compares to others. > > >> > >> > > > >> > >> > If that means moving to Python I think it's not a problem; we > can > > >> > still > > >> > >> > work on Java bindings to ship a new Joshua Decoder > implementation. > > >> > >> > > > >> > >> > The pretrained models topic is imho something we will have to > > >> embrace > > >> > at > > >> > >> > some point, so that others can: > > >> > >> > a) just download new LPs > > >> > >> > b) eventually fine tune with their own data > > >> > >> > > > >> > >> > I am looking forward to start this new phase of research on > Joshua. > > >> > >> > > > >> > >> > Regards, > > >> > >> > Tommaso > > >> > >> > > > >> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick < > jzemer...@apache.org> > > >> > >> wrote: > > >> > >> > > > >> > >> > > I haven't contributed to this point but I would like to see > > >> Apache > > >> > >> Joshua > > >> > >> > > remain an active project so I am volunteering to help. I may > not > > >> be > > >> > a > > >> > >> lot > > >> > >> > > of help with code for a bit but I will help out with > > >> documentation, > > >> > >> > > releases, etc. > > >> > >> > > > > >> > >> > > I do agree that NMT is the best path forward but I will > leave the > > >> > >> choice > > >> > >> > of > > >> > >> > > integrating an existing library into Joshua versus a new NMT > > >> > >> > implementation > > >> > >> > > in Joshua to those more familiar with the code and what they > > >> think > > >> > is > > >> > >> > best > > >> > >> > > for the project. > > >> > >> > > > > >> > >> > > Jeff > > >> > >> > > > > >> > >> > > > > >> > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda < > tgow...@gmail.com> > > >> > >> wrote: > > >> > >> > > > > >> > >> > > > Hi Tomasso, and others > > >> > >> > > > > > >> > >> > > > *1. I support the addition of neural MT decoder. * > > >> > >> > > > The world has moved on, and NMT is clearly the way to go > > >> forward. > > >> > >> > > > If you dont believe my words, read what Matt Post himself > said > > >> [1] > > >> > >> > three > > >> > >> > > > years ago! > > >> > >> > > > > > >> > >> > > > I have spent the past three years focusing on NMT as part > of > > >> my > > >> > job > > >> > >> > and > > >> > >> > > > Ph.D -- I'd be glad to contribute in that direction. > > >> > >> > > > There are many NMT toolkits out there today. (Fairseq, > sockeye, > > >> > >> > > > tensor2tensor, ....) > > >> > >> > > > > > >> > >> > > > The right thing to do, IMHO, is simply merge one of the NMT > > >> > toolkits > > >> > >> > with > > >> > >> > > > Joshua project. We can do that as long as it's Apache > License > > >> > >> right? > > >> > >> > > > We will now have to move towards python land as most > toolkits > > >> are > > >> > in > > >> > >> > > > python. On the positive side, we will be losing the ancient > > >> perl > > >> > >> > scripts > > >> > >> > > > that many are not fan of. > > >> > >> > > > > > >> > >> > > > I have been working on my own NMT toolkit for my work and > > >> research > > >> > >> -- > > >> > >> > > RTG > > >> > >> > > > https://isi-nlp.github.io/rtg/#conf > > >> > >> > > > I had worked on Joshua in the past, mainly, I improved the > code > > >> > >> quality > > >> > >> > > > [2]. So you can tell my new code'd be upto Apache's > standards > > >> ;) > > >> > >> > > > > > >> > >> > > > *2. Pretrained MT models for lots of languages* > > >> > >> > > > I have been working on a lib to retrieve parallel data from > > >> many > > >> > >> > sources > > >> > >> > > -- > > >> > >> > > > MTData [3] > > >> > >> > > > There is so much parallel data out their for hundreds of > > >> > languages. > > >> > >> > > > My recent estimate is over a billion lines of parallel > > >> sentences > > >> > for > > >> > >> > over > > >> > >> > > > 500 languages is freely and publicly available for download > > >> using > > >> > >> > MTData > > >> > >> > > > tool. > > >> > >> > > > If we find some sponsors to lend us some resources, we > could > > >> train > > >> > >> > better > > >> > >> > > > MT models and update the Language Packs section [4]. > > >> > >> > > > Perhaps, one massively multilingual NMT model that supports > > >> many > > >> > >> > > > translation directions (I know its possible with NMT; I > tested > > >> it > > >> > >> > > recently > > >> > >> > > > with RTG) > > >> > >> > > > > > >> > >> > > > I am interested in hearing what others are thinking. > > >> > >> > > > > > >> > >> > > > [1] > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > >> > https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E > > >> > >> > > > [2] - > > >> > >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+ > > >> > >> > > > [3] - https://github.com/thammegowda/mtdata > > >> > >> > > > [4] - > > >> > >> > > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > Cheers, > > >> > >> > > > TG > > >> > >> > > > > > >> > >> > > > -- > > >> > >> > > > *Thamme Gowda * > > >> > >> > > > @thammegowda <https://twitter.com/thammegowda> | > > >> > >> https://isi.edu/~tg > > >> > >> > > > ~Sent via somebody's Webmail server > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು > Tommaso > > >> > >> Teofili < > > >> > >> > > > tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: > > >> > >> > > > > > >> > >> > > > > Hi all, > > >> > >> > > > > > > >> > >> > > > > This is a roll call for people interested in > contributing to > > >> > >> Apache > > >> > >> > > > Joshua > > >> > >> > > > > going forward. > > >> > >> > > > > Contributing could be not just code, but anything that > may > > >> help > > >> > >> the > > >> > >> > > > project > > >> > >> > > > > or serve the community. > > >> > >> > > > > > > >> > >> > > > > In case you're interested in helping out please speak up > :-) > > >> > >> > > > > > > >> > >> > > > > Code-wise Joshua has not evolved much in the latest > months, > > >> > >> there's > > >> > >> > > room > > >> > >> > > > > for both improvements to the current code (make a new > minor > > >> > >> release) > > >> > >> > > and > > >> > >> > > > > new ideas / code branches (e.g. neural MT based Joshua > > >> Decoder). > > >> > >> > > > > > > >> > >> > > > > Regards, > > >> > >> > > > > Tommaso > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > > >> > > > >> > > > >