[ 
https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310456#comment-14310456
 ] 

Chris A. Mattmann commented on TIKA-1343:
-----------------------------------------

Hi Lewis, the current status is the following:

I have an example of running OODT large scale translations using Joshua here 
https://github.com/chrismattmann/xdata-employment. I am working on cleaning 
this up for a SIGIR paper on large scale machine translation. This is a 
framework using OODT, Solr and Tika and 
https://github.com/chrismattmann/etllib/ that allows you to easily switch 
between MT implementations. Current support is for: 1) Joshua (with trained 
model that you have to bring); 2) Moses (with trained model you have to bring); 
and 3) for API-based translations for Bing translate; Google translate; and 
Lingo24 translate. 

It needs a little clean up, but it's all there. 

As for this implementation the current patch I put up actually uses the Joshua 
Java API - after thinking about this and running Joshua at Scale with OODT, I 
realized - we need to make this talk to JoshuaServer (and in general Moses has 
the same capacity, it runs on a server end-point, so we should just make like a 
NetworkTranslator base class, then have Joshua and Moses sub-class it). It 
should be a REST-based endPoint I think. 

> Create a Tika Translator implementation that uses JoshuaDecoder
> ---------------------------------------------------------------
>
>                 Key: TIKA-1343
>                 URL: https://issues.apache.org/jira/browse/TIKA-1343
>             Project: Tika
>          Issue Type: New Feature
>          Components: translation
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>
> The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine 
> translation system hosted at Github:
> http://joshua-decoder.org/
> Joshua takes in corpuses and trains models that can then be used to do 
> language translation. Currently there is support for e.g., Spanisn->English, 
> Indian dialects->English, Chinese->English, and a few others. 
> https://github.com/joshua-decoder/joshua/
> It would be nice to build a Tika Translator on top of Joshua. There are of 
> course several issues with this:
> * the models are huge - so we'll need a separate package or Maven module, 
> maybe tika-translate-joshua or something to release the models and we'll need 
> to build the models. I just went through the process of building the 
> Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, 
> but it took over a day
> * there is a configuration for Joshua, and so we need some way of passing 
> that config into the Translator. Not sure of the best way to do this.
> * Joshua isn't in the Central repository. I've started a discussion on the 
> Joshua lists about this: 
> https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
> Anyhoo, I've got a working patch right now with hard code stuff, and a manual 
> install into my Maven repo for brave souls out there that want to try it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to