[
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026662#comment-14026662
]
Nick Burch commented on TIKA-1319:
----------------------------------
Thinking about this, what about having the translator being a wrapping
(decorating) parser? You initialise it with a language and a "real" parser (eg
auto detect), then call parse on it as normal. It gets the real parser to
handle the content, then it calls out to the translation API on the things
which need it, and skips the bits that don't.
That might let us translate certain metadata values but not others, with
control, along with some of the body content but not all (eg an xml handler
which passes through the characters but not the tags)
> Translation
> -----------
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
> Issue Type: New Feature
> Reporter: Tyler Palsulich
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.6
>
>
> I just opened up a review on reviews.apache.org --
> https://reviews.apache.org/r/22219/. I copied the description below.
> This patch adds basic language translation functionality to Tika. Translation
> is provided by a Microsoft API, but accessed through Apache 2 licensed
> com.memetix.microsoft-translator-java-api
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants
> to use the translation feature, they have to add a client id and client
> secret to the
> tika-core/src/main/resources/org/apache/tika/language/translator.properties
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added
> com.memetix as a dependency in tika-core. I put the Translator class in
> org.apache.tika.language. There is no integration with the server or CLI,
> yet. Further, only Strings are translated right now -- if you pass in a full
> document with xml tags, the structure will be mangled. But, I think that
> would be a cool feature -- translate the body, title, subtitle, etc, but not
> the structural elements.
> There is still more work to do, but I wanted some more eyes on this to make
> sure I'm heading in the right direction and this is a desired feature. Let me
> know what you think!
> There are two simple unit tests for now which translate "hello" to French
> ("salut"). One for inputting the source and target languages, one for
> inputing just the target language (and detecting the source language
> automatically).
--
This message was sent by Atlassian JIRA
(v6.2#6252)