Following up on the report topic, I've created an overleaf doc for everyone who's interested in working on this [1].
First set of (AL-2 compatible) NMT toolkits I've found: - Joey NMT [2] - OpenNMT [3] - MarianNMT [4] - Sockeye [5] - and of course RTG already shared by Thamme [6] Regards, Tommaso [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw [2] : https://github.com/joeynmt/joeynmt [3] : https://github.com/OpenNMT [4] : https://github.com/marian-nmt/marian [5] : https://github.com/awslabs/sockeye [6] : https://github.com/isi-nlp/rtg-xt On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <tommaso.teof...@gmail.com> wrote: > very good idea Thamme! > I'd be up for writing such a short survey paper as a result of our > analysis. > > Regards, > Tommaso > > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tgow...@gmail.com> wrote: > >> Tomasso and others, >> >> > I think we may now go into a research phase to understand what existing >> toolkit we can more easily integrate with. >> Agreed. >> if we can write a (short) report that compares various NMT toolkits of >> 2020, it would be useful for us to make this decision as well as to the >> NMT >> community. >> Something like a survey paper on NMT research but focus on toolkits and >> software engineering part. >> >> >> >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili < >> tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: >> >> > Thamme, Jeff, >> > >> > your contributions will be very important for the project and the >> > community, especially given your NLP background, thanks for your >> support! >> > >> > I agree moving towards NMT is the best thing to do at this point for >> > Joshua. >> > >> > Thamme, thanks for your suggestions! >> > I think we may now go into a research phase to understand what existing >> > toolkit we can more easily integrate with. >> > Of course if you like to integrate your own toolkit then that'd be even >> > more interesting to see how it compares to others. >> > >> > If that means moving to Python I think it's not a problem; we can still >> > work on Java bindings to ship a new Joshua Decoder implementation. >> > >> > The pretrained models topic is imho something we will have to embrace at >> > some point, so that others can: >> > a) just download new LPs >> > b) eventually fine tune with their own data >> > >> > I am looking forward to start this new phase of research on Joshua. >> > >> > Regards, >> > Tommaso >> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <jzemer...@apache.org> >> wrote: >> > >> > > I haven't contributed to this point but I would like to see Apache >> Joshua >> > > remain an active project so I am volunteering to help. I may not be a >> lot >> > > of help with code for a bit but I will help out with documentation, >> > > releases, etc. >> > > >> > > I do agree that NMT is the best path forward but I will leave the >> choice >> > of >> > > integrating an existing library into Joshua versus a new NMT >> > implementation >> > > in Joshua to those more familiar with the code and what they think is >> > best >> > > for the project. >> > > >> > > Jeff >> > > >> > > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <tgow...@gmail.com> >> wrote: >> > > >> > > > Hi Tomasso, and others >> > > > >> > > > *1. I support the addition of neural MT decoder. * >> > > > The world has moved on, and NMT is clearly the way to go forward. >> > > > If you dont believe my words, read what Matt Post himself said [1] >> > three >> > > > years ago! >> > > > >> > > > I have spent the past three years focusing on NMT as part of my job >> > and >> > > > Ph.D -- I'd be glad to contribute in that direction. >> > > > There are many NMT toolkits out there today. (Fairseq, sockeye, >> > > > tensor2tensor, ....) >> > > > >> > > > The right thing to do, IMHO, is simply merge one of the NMT toolkits >> > with >> > > > Joshua project. We can do that as long as it's Apache License >> right? >> > > > We will now have to move towards python land as most toolkits are in >> > > > python. On the positive side, we will be losing the ancient perl >> > scripts >> > > > that many are not fan of. >> > > > >> > > > I have been working on my own NMT toolkit for my work and research >> -- >> > > RTG >> > > > https://isi-nlp.github.io/rtg/#conf >> > > > I had worked on Joshua in the past, mainly, I improved the code >> quality >> > > > [2]. So you can tell my new code'd be upto Apache's standards ;) >> > > > >> > > > *2. Pretrained MT models for lots of languages* >> > > > I have been working on a lib to retrieve parallel data from many >> > sources >> > > -- >> > > > MTData [3] >> > > > There is so much parallel data out their for hundreds of languages. >> > > > My recent estimate is over a billion lines of parallel sentences for >> > over >> > > > 500 languages is freely and publicly available for download using >> > MTData >> > > > tool. >> > > > If we find some sponsors to lend us some resources, we could train >> > better >> > > > MT models and update the Language Packs section [4]. >> > > > Perhaps, one massively multilingual NMT model that supports many >> > > > translation directions (I know its possible with NMT; I tested it >> > > recently >> > > > with RTG) >> > > > >> > > > I am interested in hearing what others are thinking. >> > > > >> > > > [1] >> > > > >> > > > >> > > >> > >> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E >> > > > [2] - >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+ >> > > > [3] - https://github.com/thammegowda/mtdata >> > > > [4] - >> > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs >> > > > >> > > > >> > > > Cheers, >> > > > TG >> > > > >> > > > -- >> > > > *Thamme Gowda * >> > > > @thammegowda <https://twitter.com/thammegowda> | >> https://isi.edu/~tg >> > > > ~Sent via somebody's Webmail server >> > > > >> > > > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso >> Teofili < >> > > > tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: >> > > > >> > > > > Hi all, >> > > > > >> > > > > This is a roll call for people interested in contributing to >> Apache >> > > > Joshua >> > > > > going forward. >> > > > > Contributing could be not just code, but anything that may help >> the >> > > > project >> > > > > or serve the community. >> > > > > >> > > > > In case you're interested in helping out please speak up :-) >> > > > > >> > > > > Code-wise Joshua has not evolved much in the latest months, >> there's >> > > room >> > > > > for both improvements to the current code (make a new minor >> release) >> > > and >> > > > > new ideas / code branches (e.g. neural MT based Joshua Decoder). >> > > > > >> > > > > Regards, >> > > > > Tommaso >> > > > > >> > > > >> > > >> > >> >