-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22761/
-----------------------------------------------------------
Review request for tika.
Bugs: tika-1343
https://issues.apache.org/jira/browse/tika-1343
Repository: tika
Description
-------
The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine
translation system hosted at Github:
http://joshua-decoder.org/
Joshua takes in corpuses and trains models that can then be used to do language
translation. Currently there is support for e.g., Spanisn->English, Indian
dialects->English, Chinese->English, and a few others.
https://github.com/joshua-decoder/joshua/
It would be nice to build a Tika Translator on top of Joshua. There are of
course several issues with this:
* the models are huge - so we'll need a separate package or Maven module, maybe
tika-translate-joshua or something to release the models and we'll need to
build the models. I just went through the process of building the
Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, but
it took over a day
* there is a configuration for Joshua, and so we need some way of passing that
config into the Translator. Not sure of the best way to do this.
* Joshua isn't in the Central repository. I've started a discussion on the
Joshua lists about this:
https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
Anyhoo, I've got a working patch right now with hard code stuff, and a manual
install into my Maven repo for brave souls out there that want to try it.
Diffs
-----
Diff: https://reviews.apache.org/r/22761/diff/
Testing
-------
ran through on my locally built Spanish->English corpus built using
http://joshua-decoder.org/data/fisher-callhome-corpus/
My dataset isn't perfect, but it can do basic translations. Also wrote a unit
test, part of the patch.
Thanks,
Chris Mattmann