It wouldn't be hard to add some TMX-like features, no. There are some technical
challenges, though — for example, the current demo lets you add phrases, but
that doesn't affect the language model at all.
Ideally, we'd also allow people to add whole sentences, and would then run
John's fast_align implementation (with a saved model) to break down that new
sentence, and do proper incremental updating.
How do you image Lucene fitting into this?
> On Dec 1, 2016, at 9:22 AM, Tommaso Teofili <tommaso.teof...@gmail.com> wrote:
> really nice least of very useful features, thanks for this!
> One comment only on the translation memories one: as seen by one that had
> never heard about it, it sounds not too complicated to implement on top of
> current Joshua (with IR library like Apache Lucene), is my understanding
> correct ?
> My 2 cents,
> Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post <p...@cs.jhu.edu
> <mailto:p...@cs.jhu.edu>> ha
>> One project I think could be interesting for Joshua's future is sketched
>> - Dynamic phrase tables. Joshua currently lets people add custom phrases
>> to the existing models that then get used. There is a research topic here
>> for how to make it better (particularly, how to set the weights of rules
>> that are added at runtime instead of learned from bitext), but it works
>> really well for adding words that are OOV (since it's always cheaper to use
>> the OOV). Here's a demo of how this works (this feature is included in the
>> language packs).
>> - Translation memories. There is a large commercial market (billions) for
>> tools called "translation memories", where translators are translating
>> documents, and the sentences get queried against their past translations
>> and matched in a fuzzy fashion. The big tool on the market for this is SDL
>> Trados <
>> I'm not talking about selling a product, but in a space that big, there
>> have got to be a lot of people who'd rather just run their own system, than
>> shell out for an expensive (and ugly) tool. So there is a big niche for an
>> open source tool, and currently nothing really filling it. The "dynamic
>> phrase table" feature above provides the beginnings of offering a TM
>> competitor, but one that is "seeded" with a regular statistical machine
>> translation model.
>> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
>> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
>> could sit on top of a large tuning set across diverse domains (e.g, formal
>> news, informal web logs, spoken dialogue, etc). You could then add new
>> phrases in sentences as above, which would get automatically aligned, and
>> then everything could be retuned at the user's request (or perhaps at
>> night). This way, when people added new data to their models, Joshua would
>> automatically find the best weights, either immediately or on some
>> schedule. There'd be less worry about bit rot.
>> - Data collection and sharing. Another cool idea would be to allow people
>> to easily send us data. If we get to a place where people are building
>> custom dynamic phrase tables, a cool ability would be to make it easy for
>> people to upload the data they have added to their private systems, which
>> we could then collect and further distribute. So Joshua could become an
>> easy means for people to crowdsource data used for translation systems.
>> This is obviously just a high-level idea that would require a lot of
>> details to be figured out, but it would be super cool.