I found the reference for that 1,000,000 number a bit too late -- according to this more recent paper from Koehn, it's more like 15,000,000 tokens for NMT to meet phrase-based MT, and they omit syntax-based.
https://arxiv.org/pdf/1706.03872.pdf -John On Sun, Jul 2, 2017 at 12:38 PM, John Hewitt <john...@seas.upenn.edu> wrote: > I've talked with the ModernMT people; they're well aware that they're in a > neural MT world, and they also know that there's a sizable market for > non-neural MT solutions. > To back this up -- Philipp Koehn gave a talk in March on comparing > phrase-based, syntax-based, and neural MT in low-resource settings, that > is, when the amount of bilingual text to train on is small. > > Neural MT needs (if I remember correctly) about 1,000,000 tokens of > training data to outpace syntax-based MT. > Many language pairs (and, for that matter, domains within a single > language pair) do not meet that requirement, and in those cases > syntax-based MT performs best. > > That being said, there are some cool opportunities to combine neural and > syntax-based MT. I can't commit the work hours right now, necessarily, but > I've worked with xnmt <https://github.com/neulab/xnmt>, an MIT-licensed > neural MT library that is purpose-built to be highly modular. It may offer > some good opportunities to make an ensemble system. > > On Sun, Jul 2, 2017 at 4:22 AM, Tommaso Teofili <tommaso.teof...@gmail.com > > wrote: > >> I think it's interesting as it extends some features that also Joshua has, >> it's open source and has good results compared with NMT. >> >> Tommaso >> >> Il giorno sab 1 lug 2017 alle ore 18:56 Suneel Marthi < >> suneel.mar...@gmail.com> ha scritto: >> >> > Is this the latest/greatest paper around MT @tommaso ?? >> > >> > On Sat, Jul 1, 2017 at 7:55 AM, Tommaso Teofili < >> tommaso.teof...@gmail.com >> > > >> > wrote: >> > >> > > I accidentally found the paper about mmt [1] >> > > >> > > [1] : >> > > https://ufal.mff.cuni.cz/eamt2017/user-project-product- >> > > papers/papers/user/EAMT2017_paper_88.pdf >> > > >> > > Il giorno gio 1 dic 2016 alle ore 22:19 Mattmann, Chris A (3010) < >> > > chris.a.mattm...@jpl.nasa.gov> ha scritto: >> > > >> > > > Guys I want to point you at the DARPA D3M program: >> > > > >> > > > http://www.darpa.mil/program/data-driven-discovery-of-models >> > > > >> > > > I’m part of the Government Team for the program. This will be a good >> > > > connection >> > > > to have b/c it’s focused on automatically doing model and code >> building >> > > > for ML based >> > > > approaches. >> > > > >> > > > >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > Chris Mattmann, Ph.D. >> > > > Principal Data Scientist, Engineering Administrative Office (3010) >> > > > Manager, Open Source Projects Formulation and Development Office >> (8212) >> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> > > > Office: 180-503E, Mailstop: 180-503 >> > > > Email: chris.a.mattm...@nasa.gov >> > > > WWW: http://sunset.usc.edu/~mattmann/ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > Director, Information Retrieval and Data Science Group (IRDS) >> > > > Adjunct Associate Professor, Computer Science Department >> > > > University of Southern California, Los Angeles, CA 90089 USA >> > > > WWW: http://irds.usc.edu/ >> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > > >> > > > >> > > > On 12/1/16, 1:15 PM, "Matt Post" <p...@cs.jhu.edu> wrote: >> > > > >> > > > John, >> > > > >> > > > Thanks for sharing, this is really helpful. I didn't realize >> that >> > > > Marcello was involved. >> > > > >> > > > I think we can identify with the NMT danger. I still think there >> > is a >> > > > big niche that deep learning approaches won't reach for a few years, >> > > until >> > > > GPUs become super prevalent. Which is why I like ModernMT's >> approaches, >> > > > which overlap with many of the things I've been thinking. One thing >> I >> > > > really like is there automatic context-switching approach. This is a >> > > great >> > > > way to build general-purpose models, and I'd like to mimic it. I >> have >> > > some >> > > > general ideas about how this should be implemented but am also >> looking >> > > into >> > > > the literature here. >> > > > >> > > > matt >> > > > >> > > > >> > > > > On Dec 1, 2016, at 1:46 PM, John Hewitt < >> john...@seas.upenn.edu> >> > > > wrote: >> > > > > >> > > > > I had a few good conversations over dinner with this team at >> AMTA >> > > in >> > > > Austin >> > > > > in October. >> > > > > They seem to be in the interesting position where their work >> is >> > > > good, but >> > > > > is in danger of being superseded by neural MT as they come >> out of >> > > > the gate. >> > > > > Clearly, it has benefits over NMT, and is easier to adopt, but >> > may >> > > > not be >> > > > > the winner over the long run. >> > > > > >> > > > > Here's the link >> > > > > < >> > > > https://amtaweb.org/wp-content/uploads/2016/11/MMT_ >> > > Tutorial_FedericoTrombetti_wide-cover.pdf >> > > > > >> > > > > to their AMTA tutorial. >> > > > > >> > > > > -John >> > > > > >> > > > > On Thu, Dec 1, 2016 at 10:17 AM, Mattmann, Chris A (3010) < >> > > > > chris.a.mattm...@jpl.nasa.gov> wrote: >> > > > > >> > > > >> Wow seems like this kind of overlaps with BigTranslate as >> well.. >> > > > thanks >> > > > >> for passing >> > > > >> along Matt >> > > > >> >> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > ++++++ >> > > > >> Chris Mattmann, Ph.D. >> > > > >> Principal Data Scientist, Engineering Administrative Office >> > (3010) >> > > > >> Manager, Open Source Projects Formulation and Development >> Office >> > > > (8212) >> > > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> > > > >> Office: 180-503E, Mailstop: 180-503 >> > > > >> Email: chris.a.mattm...@nasa.gov >> > > > >> WWW: http://sunset.usc.edu/~mattmann/ >> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > ++++++ >> > > > >> Director, Information Retrieval and Data Science Group (IRDS) >> > > > >> Adjunct Associate Professor, Computer Science Department >> > > > >> University of Southern California, Los Angeles, CA 90089 USA >> > > > >> WWW: http://irds.usc.edu/ >> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > > ++++++ >> > > > >> >> > > > >> >> > > > >> On 12/1/16, 4:47 AM, "Matt Post" <p...@cs.jhu.edu> wrote: >> > > > >> >> > > > >> Just came across this, and it's really cool: >> > > > >> >> > > > >> https://github.com/ModernMT/MMT >> > > > >> >> > > > >> See the README for some great use cases. I'm surprised I'd >> > > never >> > > > heard >> > > > >> of this before as it's EU funded and associated with U >> > Edinburgh. >> > > > >> >> > > > >> >> > > > >> > > > >> > > > >> > > > >> > > >> > >> > >