very good idea Thamme! I'd be up for writing such a short survey paper as a result of our analysis.
Regards, Tommaso On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tgow...@gmail.com> wrote: > Tomasso and others, > > > I think we may now go into a research phase to understand what existing > toolkit we can more easily integrate with. > Agreed. > if we can write a (short) report that compares various NMT toolkits of > 2020, it would be useful for us to make this decision as well as to the NMT > community. > Something like a survey paper on NMT research but focus on toolkits and > software engineering part. > > > > ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili < > tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: > > > Thamme, Jeff, > > > > your contributions will be very important for the project and the > > community, especially given your NLP background, thanks for your support! > > > > I agree moving towards NMT is the best thing to do at this point for > > Joshua. > > > > Thamme, thanks for your suggestions! > > I think we may now go into a research phase to understand what existing > > toolkit we can more easily integrate with. > > Of course if you like to integrate your own toolkit then that'd be even > > more interesting to see how it compares to others. > > > > If that means moving to Python I think it's not a problem; we can still > > work on Java bindings to ship a new Joshua Decoder implementation. > > > > The pretrained models topic is imho something we will have to embrace at > > some point, so that others can: > > a) just download new LPs > > b) eventually fine tune with their own data > > > > I am looking forward to start this new phase of research on Joshua. > > > > Regards, > > Tommaso > > > > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <jzemer...@apache.org> wrote: > > > > > I haven't contributed to this point but I would like to see Apache > Joshua > > > remain an active project so I am volunteering to help. I may not be a > lot > > > of help with code for a bit but I will help out with documentation, > > > releases, etc. > > > > > > I do agree that NMT is the best path forward but I will leave the > choice > > of > > > integrating an existing library into Joshua versus a new NMT > > implementation > > > in Joshua to those more familiar with the code and what they think is > > best > > > for the project. > > > > > > Jeff > > > > > > > > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <tgow...@gmail.com> wrote: > > > > > > > Hi Tomasso, and others > > > > > > > > *1. I support the addition of neural MT decoder. * > > > > The world has moved on, and NMT is clearly the way to go forward. > > > > If you dont believe my words, read what Matt Post himself said [1] > > three > > > > years ago! > > > > > > > > I have spent the past three years focusing on NMT as part of my job > > and > > > > Ph.D -- I'd be glad to contribute in that direction. > > > > There are many NMT toolkits out there today. (Fairseq, sockeye, > > > > tensor2tensor, ....) > > > > > > > > The right thing to do, IMHO, is simply merge one of the NMT toolkits > > with > > > > Joshua project. We can do that as long as it's Apache License right? > > > > We will now have to move towards python land as most toolkits are in > > > > python. On the positive side, we will be losing the ancient perl > > scripts > > > > that many are not fan of. > > > > > > > > I have been working on my own NMT toolkit for my work and research -- > > > RTG > > > > https://isi-nlp.github.io/rtg/#conf > > > > I had worked on Joshua in the past, mainly, I improved the code > quality > > > > [2]. So you can tell my new code'd be upto Apache's standards ;) > > > > > > > > *2. Pretrained MT models for lots of languages* > > > > I have been working on a lib to retrieve parallel data from many > > sources > > > -- > > > > MTData [3] > > > > There is so much parallel data out their for hundreds of languages. > > > > My recent estimate is over a billion lines of parallel sentences for > > over > > > > 500 languages is freely and publicly available for download using > > MTData > > > > tool. > > > > If we find some sponsors to lend us some resources, we could train > > better > > > > MT models and update the Language Packs section [4]. > > > > Perhaps, one massively multilingual NMT model that supports many > > > > translation directions (I know its possible with NMT; I tested it > > > recently > > > > with RTG) > > > > > > > > I am interested in hearing what others are thinking. > > > > > > > > [1] > > > > > > > > > > > > > > https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E > > > > [2] - https://github.com/apache/joshua/pulls?q=author%3Athammegowda+ > > > > [3] - https://github.com/thammegowda/mtdata > > > > [4] - > > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs > > > > > > > > > > > > Cheers, > > > > TG > > > > > > > > -- > > > > *Thamme Gowda * > > > > @thammegowda <https://twitter.com/thammegowda> | https://isi.edu/~tg > > > > ~Sent via somebody's Webmail server > > > > > > > > > > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili > < > > > > tommaso.teof...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ: > > > > > > > > > Hi all, > > > > > > > > > > This is a roll call for people interested in contributing to Apache > > > > Joshua > > > > > going forward. > > > > > Contributing could be not just code, but anything that may help the > > > > project > > > > > or serve the community. > > > > > > > > > > In case you're interested in helping out please speak up :-) > > > > > > > > > > Code-wise Joshua has not evolved much in the latest months, there's > > > room > > > > > for both improvements to the current code (make a new minor > release) > > > and > > > > > new ideas / code branches (e.g. neural MT based Joshua Decoder). > > > > > > > > > > Regards, > > > > > Tommaso > > > > > > > > > > > > > > >