----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/22761/#review76313 -----------------------------------------------------------
./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/JoshuaTranslator.java <https://reviews.apache.org/r/22761/#comment123859> Chris, can you provide a sample configuration here? I am struggling to find what this should look like! - Lewis McGibbney On June 18, 2014, 10:04 p.m., Chris Mattmann wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/22761/ > ----------------------------------------------------------- > > (Updated June 18, 2014, 10:04 p.m.) > > > Review request for tika. > > > Bugs: tika-1343 > https://issues.apache.org/jira/browse/tika-1343 > > > Repository: tika > > > Description > ------- > > The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine > translation system hosted at Github: > > http://joshua-decoder.org/ > > Joshua takes in corpuses and trains models that can then be used to do > language translation. Currently there is support for e.g., Spanisn->English, > Indian dialects->English, Chinese->English, and a few others. > > https://github.com/joshua-decoder/joshua/ > > It would be nice to build a Tika Translator on top of Joshua. There are of > course several issues with this: > > * the models are huge - so we'll need a separate package or Maven module, > maybe tika-translate-joshua or something to release the models and we'll need > to build the models. I just went through the process of building the > Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, > but it took over a day > * there is a configuration for Joshua, and so we need some way of passing > that config into the Translator. Not sure of the best way to do this. > * Joshua isn't in the Central repository. I've started a discussion on the > Joshua lists about this: > https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0 > > Anyhoo, I've got a working patch right now with hard code stuff, and a manual > install into my Maven repo for brave souls out there that want to try it. > > > Diffs > ----- > > ./trunk/tika-translate/pom.xml 1603529 > > ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/JoshuaTranslator.java > PRE-CREATION > > ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/JoshuaTranslatorTest.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/22761/diff/ > > > Testing > ------- > > ran through on my locally built Spanish->English corpus built using > http://joshua-decoder.org/data/fisher-callhome-corpus/ > My dataset isn't perfect, but it can do basic translations. Also wrote a unit > test, part of the patch. > > > Thanks, > > Chris Mattmann > >
