[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

Tim Allison (JIRA) Mon, 31 Aug 2015 05:22:21 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723352#comment-14723352
 ]


Tim Allison commented on TIKA-1723:
-----------------------------------

My personal preference would be to add to whatever metadata we have about the 
document, not overwrite it.  We might use that information in the priors for 
the doc.

So, again, my personal preference, would be to use the "added by Tika" 
(TikaCoreProperties.TIKA_META_PREFIX) to any metadata that we compute via lang 
id.

> Integrate language-detector into Tika
> -------------------------------------
>
>                 Key: TIKA-1723
>                 URL: https://issues.apache.org/jira/browse/TIKA-1723
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 1.11
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1723.patch, TIKA-1723v2.patch
>
>
> The language-detector project at 
> https://github.com/optimaize/language-detector is faster, has more languages 
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a 
> number of issues this raises, especially if [~chrismattmann] moves forward 
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

Reply via email to